AWS Batch Features

Why AWS Batch?

With AWS Batch, you package the code for your batch jobs, specify their dependencies, and submit your batch job using the AWS Management Console, CLIs, or SDKs. Once you specify execution parameters and job dependencies, AWS Batch facilitates integration with a broad range of popular batch computing workflow engines and languages (for example, Pegasus WMS, Luigi, Nextflow, Metaflow, Apache Airflow, and AWS Step Functions). AWS Batch efficiently and dynamically provisions and scales Amazon Elastic Container Service (ECS)Amazon Elastic Kubernetes Service (EKS), and AWS Fargate compute resources, with an option to use On-Demand or Spot Instances based on your job requirements. AWS Batch provides default job queues and compute environment definitions so you can get started quickly.

Job definitions

Multi-container jobs is a feature that makes it easier and faster to run simulations at scale when testing complex systems like those used in autonomous vehicles and robotics. With the ability to run multiple containers in a job, you can benefit from the advanced scaling, scheduling, and cost optimization capabilities of AWS Batch without rebuilding your system into a complex monolithic container. Instead, you can maintain the use of multiple smaller, modular containers that each represent different system components like a 3D virtual environment, robot perception sensors, or a data logging sidecar. Multi-container jobs accelerates development times by reducing job preparation steps and eliminates the need to build extra in-house tools. Running simulation jobs with multiple containers also simplifies software development (Dev) and IT operations (Ops) by defining component responsibility so that different teams can find and fix errors in their team’s component without being distracted by other team’s components.

AWS Batch supports multi-node parallel jobs, so you can run single jobs that span multiple EC2 instances. With this feature, you can use AWS Batch to efficiently run workloads such as large-scale, tightly-coupled, high performance computing (HPC) applications or distributed GPU model training. AWS Batch also supports Elastic Fabric Adapter, a network interface where you can run applications that require high levels of internode communication at scale on AWS.

With AWS Batch, you can specify resource requirements, such as vCPU and memory, AWS Identity and Access Management (IAM) roles, volume mount points, container properties, and environment variables to define how jobs are run. AWS Batch runs your jobs as containerized applications on Amazon ECS.  You can also define dependencies between different jobs. For example, your batch job can be composed of three stages of processing with different resource needs. With dependencies, you can create three jobs with different resource requirements where each successive job depends on the previous job.

Integrations

AWS Batch can be integrated with commercial and open-source workflow engines such as Pegasus WMS, Luigi, Nextflow, Metaflow, Apache Airflow, and AWS Step Functions, so you can use familiar workflow languages to model your batch computing pipelines.

AWS Batch now supports EC2 launch templates, so you can build customized templates for your compute resources while Batch scales instances to meet your requirements. You can specify your EC2 launch template to add storage volumes, select network interfaces, or configure permissions, among other capabilities. EC2 launch templates reduce the number of steps required to configure Batch environments by capturing launch parameters within one resource.

 

AWS Batch displays key operational metrics for your batch jobs in the AWS Management Console. You can view metrics related to compute capacity, as well as metrics for running, pending, and completed jobs. Logs for your jobs (for example, STDERR and STDOUT) are available in the console and are also written to Amazon CloudWatch Logs.

AWS Batch uses IAM to control and monitor the AWS resources that your jobs can access, such as Amazon DynamoDB tables. Through IAM, you can also define policies for different users in your organization. For example, administrators can be granted full access permissions to any AWS Batch API operation, developers can have limited permissions related to configuring compute environments and registering jobs, and end users can be restricted to the permissions needed to submit and delete jobs.

Compute environments

AWS Batch can run your batch jobs on your existing Amazon EKS clusters. You specify the vCPU, memory, and GPU requirements your containers need and then submit them to a queue attached to an EKS cluster–enabled compute environment. AWS Batch manages both the scaling of Kubernetes nodes and the placement of pods within your nodes. Additionally, AWS Batch manages queueing, dependency tracking, job retries, prioritization, and pod submission, along with providing support for Amazon Elastic Compute Cloud (EC2) On-Demand and Spot Instances. AWS Batch also integrates with your EKS cluster in a distinct namespace, so you don’t need to worry about batch jobs interfering with your existing processes. Finally, AWS Batch manages capacity for you, including maintaining a warm pool of nodes, capping capacity at a certain amount of vCPU, scaling nodes, and running jobs either in a single cluster or across multiple clusters.

AWS Batch with Fargate resources gives you a completely serverless architecture for your batch jobs. With Fargate, every job receives the exact amount of CPU and memory that it requests (within allowed Fargate SKUs), so there is no wasted resource time or need to wait for EC2 instance launches.

If you’re a current AWS Batch user, Fargate allows an additional layer of separation from EC2 instances. There’s no need to manage or patch Amazon Machine Images (AMIs). When submitting your Fargate compatible jobs to Batch, you don’t have to worry about maintaining two different services if you have some workloads that run on Amazon EC2 and others that run on Fargate.

AWS provides a cloud-based scheduler complete with a managed queue and the ability to specify priorities, job retries, dependencies, timeouts, and more. AWS Batch manages submission to Fargate and the lifecycle of your jobs so you don’t have to.

Fargate also provides security benefits that come with no added effort (for example, SOX and PCI compliance) and isolation between compute resources for every job.

Scheduling

With AWS Batch, you can set up multiple queues with different priority levels. Batch jobs are stored in the queues until compute resources are available to run the job. The AWS Batch scheduler evaluates when, where, and how to run jobs that have been submitted to a queue based on the resource requirements of each job. The scheduler evaluates the priority of each queue and runs jobs in priority order on optimal compute resources (for example, memory-optimized compared to CPU-optimized), as long as those jobs have no outstanding dependencies.

GPU scheduling allows you to specify the number and type of accelerators your jobs require as job definition input variables in AWS Batch. AWS Batch will then scale up instances appropriate for your jobs based on the required number of GPUs and isolate the accelerators according to each job’s needs, so only the appropriate containers can access them.

Scaling

When using Fargate or AWS Fargate Spot with AWS Batch, you only need to set up a few concepts in Batch (a compute environment, a job queue, and a job definition), and you have a complete queue, scheduler, and compute architecture, without managing a single piece of compute infrastructure.

For those wanting EC2 instances, AWS Batch provides managed compute environments that dynamically provision and scale compute resources based on the volume and resource requirements of your submitted jobs. You can configure your managed compute environments with requirements such as the type of EC2 instances, VPC subnet configurations, the min/max/desired vCPUs across all instances, and the amount you are willing to pay for Spot Instances as a percent of the On-Demand Instance price.

Alternatively, you can provision and manage your own compute resources within AWS Batch unmanaged compute environments if you need to use different configurations (for example, larger EBS volumes or a different operating system) for your EC2 instances than what is provided by managed compute environments. You only need to provision EC2 instances that include the Amazon ECS agent and run supported versions of Linux and Docker. AWS Batch will then run batch jobs on the EC2 instances that you provision.

With AWS Batch, you can choose three methods to allocate compute resources. These strategies allow you to factor in throughput and price when deciding how AWS Batch should scale instances for you.

Best fit: AWS Batch selects an instance type that best fits the needs of the jobs with a preference for the lowest-cost instance type. If additional instances of the selected instance type are not available, AWS Batch will wait for the additional instances to become available. If there are not enough instances available, or if you are hitting Amazon EC2 service limits, then additional jobs will not run until currently running jobs have completed. This allocation strategy keeps costs lower but can limit scaling.

Best fit progressive: AWS Batch will select additional instance types that are large enough to meet the requirements of the jobs in the queue, with a preference for instance types with a lower-cost-per-unit vCPU. If additional instances of the previously selected instance types are not available, AWS Batch will select new instance types.

Spot capacity–optimized: AWS Batch will select one or more instance types that are large enough to meet the requirements of the jobs in the queue, with a preference for instance types that are less likely to be interrupted. This allocation strategy is only available for Spot Instance compute resources.