diff options
-rw-r--r-- | content/blog/aws/ansible-fact-metadata.md | 2 | ||||
-rw-r--r-- | content/blog/aws/capacity_blocks.md | 93 | ||||
-rw-r--r-- | content/blog/aws/defaults.md | 2 | ||||
-rw-r--r-- | content/blog/aws/secrets.md | 2 | ||||
-rw-r--r-- | content/blog/terraform/chart-http-datasources.md | 2 |
5 files changed, 97 insertions, 4 deletions
diff --git a/content/blog/aws/ansible-fact-metadata.md b/content/blog/aws/ansible-fact-metadata.md index 3c48f1c..7721a5c 100644 --- a/content/blog/aws/ansible-fact-metadata.md +++ b/content/blog/aws/ansible-fact-metadata.md @@ -4,7 +4,7 @@ description: 'An ansible fact I wrote' date: '2024-10-12' tags: - ansible -- aws +- AWS --- ## Introduction diff --git a/content/blog/aws/capacity_blocks.md b/content/blog/aws/capacity_blocks.md new file mode 100644 index 0000000..be90b69 --- /dev/null +++ b/content/blog/aws/capacity_blocks.md @@ -0,0 +1,93 @@ +--- +title: 'AWS capacity blocks with OpenTofu/terraform' +description: 'Some pitfalls to avoid' +date: '2025-01-04' +tags: +- AWS +- OpenTofu +- terraform +--- + +## Introduction + +AWS capacity blocks for machine learning are a short term GPU instance reservation mechanism. It is somewhat recent and has some rough edges when used via OpenTofu/terraform because of the incomplete documentation. I had to figure out things the hard way a few months ago, here they are. + +## EC2 launch template + +When you reserve a capacity block, you get a capacity reservation id. You need to feed this id to an EC2 launch template. The twist is that you also need to use a specific instance market option not specified in the AWS provider's documentation for this to work: + +``` hcl +resource "aws_launch_template" "main" { + capacity_reservation_specification { + capacity_reservation_target { + capacity_reservation_id = "cr-XXXXXX" + } + } + instance_market_options { + market_type = "capacity-block" + } + instance_type = "p4d.24xlarge" + # soc2: IMDSv2 for all ec2 instances + metadata_options { + http_endpoint = "enabled" + http_put_response_hop_limit = 1 + http_tokens = "required" + instance_metadata_tags = "enabled" + } + name = "imdsv2-${var.name}" +} +``` + +## EKS node group + +In order to use a capacity block reservation for a kubernetes node group, you need to: +- set a specific capacity type, not specified in the AWS provider's documentation +- use an AMI with GPU support +- disable the kubernetes cluster autoscaler if you are using it (and you should) + +``` hcl +resource "aws_eks_node_group" "main" { + for_each = var.node_groups + + ami_type = each.value.gpu ? "AL2_x86_64_GPU" : null + capacity_type = each.value.capacity_reservation != null ? "CAPACITY_BLOCK" : null + cluster_name = aws_eks_cluster.main.name + labels = { + adyxax-gpu-node = each.value.gpu + adyxax-node-group = each.key + } + launch_template { + name = aws_launch_template.imdsv2[each.key].name + version = aws_launch_template.imdsv2[each.key].latest_version + } + node_group_name = each.key + node_role_arn = aws_iam_role.nodes.arn + scaling_config { + desired_size = each.value.scaling.min + max_size = each.value.scaling.max + min_size = each.value.scaling.min + } + subnet_ids = local.subnet_ids + tags = { + "k8s.io/cluster-autoscaler/enabled" = each.value.capacity_reservation == null + } + update_config { + max_unavailable = 1 + } + version = local.versions.aws-eks.nodes-version + + depends_on = [ + aws_iam_role_policy_attachment.AmazonEC2ContainerRegistryReadOnly, + aws_iam_role_policy_attachment.AmazonEKSCNIPolicy, + aws_iam_role_policy_attachment.AmazonEKSWorkerNodePolicy, + ] + lifecycle { + create_before_destroy = true + ignore_changes = [scaling_config[0].desired_size] + } +} +``` + +## Conclusion + +There is a terraform resource to provision the capacity blocks themselves that might be of interest, but I did not attempt to use it seriously. Capacity blocks are never available right when you create them, you need to book them days (sometimes weeks) in advance. Though OpenTofu/terraform has some basic date and time handling functions I could use to work around this, my needs are too sparse to go through the hassle of automating this. diff --git a/content/blog/aws/defaults.md b/content/blog/aws/defaults.md index 454b325..3d1aed9 100644 --- a/content/blog/aws/defaults.md +++ b/content/blog/aws/defaults.md @@ -3,7 +3,7 @@ title: Securing AWS default VPCs description: With terraform/OpenTofu date: 2024-09-10 tags: -- aws +- AWS - OpenTofu - terraform --- diff --git a/content/blog/aws/secrets.md b/content/blog/aws/secrets.md index a25f9ef..448bf5b 100644 --- a/content/blog/aws/secrets.md +++ b/content/blog/aws/secrets.md @@ -3,7 +3,7 @@ title: Managing AWS secrets description: with the CLI and with terraform/OpenTofu date: 2024-08-13 tags: -- aws +- AWS - OpenTofu - terraform --- diff --git a/content/blog/terraform/chart-http-datasources.md b/content/blog/terraform/chart-http-datasources.md index 5c4108d..f5a827d 100644 --- a/content/blog/terraform/chart-http-datasources.md +++ b/content/blog/terraform/chart-http-datasources.md @@ -3,7 +3,7 @@ title: Manage helm charts extras with OpenTofu description: a use case for the http datasource date: 2024-04-25 tags: -- aws +- AWS - OpenTofu - terraform --- |