Announcing NVIDIA GPU support for Bottlerocket on Amazon ECS

Last year, we announced the general availability of the Amazon Elastic Container Service (Amazon ECS)-optimized Bottlerocket AMI. Bottlerocket is an open source project that focuses on security and maintainability, providing a reliable and consistent Linux distribution for hosting container-based workloads. Now, we are happy to announce that you can now run ECS NVIDIA GPU-accelerated workloads on ECS using Bottlerocket.

In this post, we will walk through how to create an Amazon ECS task to run an NVIDIA GPU workload on Bottlerocket.

Why Bottlerocket?

Customers have continued to adopt containers to run their workloads, and AWS saw a need for a Linux distribution designed and optimized to run these containerized applications. Bottlerocket OS was built to provide a secure foundation for hosts running containers, and minimizing operational overhead to manage them at scale. Bottlerocket is designed for reliable updates that can be applied through automation.

You can learn more about getting started with Bottlerocket and Amazon ECS in the Getting started with Bottlerocket and Amazon ECS blog post.

Setting up an ECS cluster with Bottlerocket and NVIDIA GPUs

Let’s have a look at how this is done in practice. We will be working in the us-west-2 (Oregon) Region.

Prerequisites

The AWS CLI with appropriate credentials
A default VPC in a region of your choice (you can also use an existing VPC in your account)

First, let’s create the ECS cluster named ecs-bottlerocket.

aws ecs --region us-west-2 create-cluster --cluster-name ecs-bottlerocket

The instance we’re launching will need an AWS Identity and Access Management (IAM) role to communicate both with the ECS APIs and the Systems Manager Session Manager APIs as well. I have created an IAM role named ecsInstanceRole that has both the AmazonSSMManagedInstanceCore and the AmazonEC2ContainerServiceforEC2Role managed policies attached.

The list of Bottlerocket Amazon Machine Images (AMIs) supported for use with NVIDIA GPUs is publicly available from AWS Systems Manager Parameter Store, so let’s get the AMI ID for the latest Bottlerocket release. (AMIs are available for both x86_64 and aarch64 architectures). In this blog post we are going to be using the x86_64 AMI.

latest_bottlerocket_ami=$(aws ssm get-parameter --region us-west-2 \
    --name "/aws/service/bottlerocket/aws-ecs-1-nvidia/x86_64/latest/image_id" \
    --query Parameter.Value --output text)

Next, we get the list of subnets that are configured to allocate a public IP address.

aws ec2 describe-subnets \
   --region us-west-2 \
   --filter=Name=vpc-id,Values=$vpc_id \
   --query 'Subnets[?MapPublicIpOnLaunch == `true`].SubnetId'
   
[
    "subnet-bc8993e6",
    "subnet-b55f6bfe",
    "subnet-e1e27fca",
    "subnet-21cbc058"
]

To associate our EC2 instance to the ECS cluster, we need to provide some information to the instance when we create it: a small config file (userdata.toml) that has the details of the ECS cluster, saved in a file in the current directory.

A full set of supported settings is here.

cat > ./userdata.toml << 'EOF'
[settings.ecs]
cluster = "ecs-bottlerocket"
EOF

Let’s deploy one Bottlerocket instance in one of the subnets above. We are choosing a public subnet for this blog post. It will be easier to debug and connect to the instances if needed. You can choose private or public subnets based on your use case.

We are using the p3.2 xlarge instance type, which has one NVIDIA Tesla V100 Tesla Core GPU.

aws ec2 run-instances \
   --subnet-id subnet-bc8993e6 \
   --image-id $latest_bottlerocket_ami \
   --instance-type p3.2xlarge \
   --region us-west-2 \
   --tag-specifications 'ResourceType=instance,Tags=[{Key=bottlerocket,Value=quickstart}]' \
   --user-data file://userdata.toml \
   --iam-instance-profile Name=ecsInstanceRole

Next, let’s create the task definition for the sample application.

cat > ./sample-gpu.json << 'EOF'
{
  "containerDefinitions": [
    {
      "memory": 80,
      "essential": true,
      "name": "gpu",
      "image": "nvidia/cuda:11.0-base",
      "resourceRequirements": [
         {
           "type":"GPU",
           "value": "1"
         }
      ],
      "command": [
        "sh",
        "-c",
        "nvidia-smi"
      ],
      "cpu": 100,
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
           "awslogs-group": "/ecs/bottlerocket",
           "awslogs-region": "us-west-2",
           "awslogs-stream-prefix": "demo-gpu"
           }
      }
    }
  ],
  "family": "example-ecs-gpu"
}
EOF

In the task definition, assign one NVIDIA GPU to our task through the resourceRequirements parameter. We are also defining the awslogs-group configuration for our task to send the log output from our container into Amazon CloudWatch.

The log group configuration is as follows:

region: us-west-2
log group name: /ecs/bottlerocket
log stream prefix: demo-gpu

Create the CloudWatch log group specified above in the task definition.

aws logs create-log-group –log-group-name ‘/ecs/bottlerocket’ –region us-west-2

aws ecs register-task-definition \
   --region us-west-2 \
   --cli-input-json file://sample-gpu.json

Run the task.

aws ecs run-task --cluster ecs-bottlerocket \ 
   --task-definition bottlerocket-gpu:1

The task will run and execute a command () inside the container to provide information on the GPU configuration available, and then it will exit.

When you go into the ECS console in your account, you will see a stopped task. Select Clusters on the left menu, select the ecs-bottlerocket cluster, and then select the Tasks tab.

UI showing stopped ECS task in the ecs-bottlerocket cluster

Click on the task ID and then select the Logs tab, which will show you the log output from the task that just ran:

UI shows Container log entries in the task Logs tab

You can also view the log output from the container from the command line. By passing in both the log group name, the log stream name and a timeframe. In my case this would be:

aws logs tail '/ecs/bottlerocket' / 
   --log-stream-names 'demo-gpu/gpu/7af782059c644872977da89a06023483' /
   --since 1h --format short

Out showing Log output from AWS CLI command

Cleanup

To remove the resources that you created during this post, run the following commands.

aws ecs deregister-task-definition \
   --region us-west-2 \
   --task-definition bottlerocket-gpu:1
      
delete_instances=$(aws ec2 describe-instances --region us-west-2 \
   --filters "Name=tag-key,Values=bottlerocket" "Name=tag-value,Values=quickstart" \
   --query 'Reservations[].Instances[].InstanceId')  
   
for instance in $delete_instances
  do aws ec2 terminate-instances --instance-ids $instance --region us-west-2
done 

aws ecs delete-cluster \
   --region us-west-2 \
   --cluster ecs-bottlerocket

aws logs delete-log-group --log-group-name '/ecs/bottlerocket'

Conclusion

In this post, we walked through how to create an ECS task definition with the appropriate configuration that will let you run a GPU enabled workload inside a container on Bottlerocket, quickly and securely. We also saw how the container logs are available in CloudWatch and how to access them from the command line. If you are looking for additional examples of GPU-accelerated workloads to run with Bottlerocket on ECS, you can check out the NVIDIA GPU-optimized containers from the NVIDIA NGC catalog on AWS Marketplace.

Bottlerocket is open source (MIT or Apache 2.0 licensed), meaning you have a number of well-documented freedoms to use, modify, and extend. Bottlerocket is also developed in the open on GitHub (https://github.com/bottlerocket-os/) and welcomes contribution, issues, and feedback on our discussion forum (https://github.com/bottlerocket-os/bottlerocket/discussions).

Containers