Amazon EC2 UltraClusters

Run HPC and ML applications at scale

Why Amazon EC2 UltraClusters?

Amazon Elastic Compute Cloud (Amazon EC2) UltraClusters can help you scale to thousands of GPUs or purpose-built ML AI chips, such as AWS Trainium, to get on-demand access to a supercomputer. They democratize access to supercomputing-class performance for machine learning (ML), generative AI, and high performance computing (HPC) developers through a simple pay-as-you-go usage model without any setup or maintenance costs. Amazon EC2 instances that are deployed in EC2 UltraClusters include P5en, P5e, P5, P4d, Trn2, and Trn1 instances.

EC2 UltraClusters consist of thousands of accelerated EC2 instances that are co-located in a given AWS Availability Zone and interconnected using Elastic Fabric Adapter (EFA) networking in a petabit-scale nonblocking network. EC2 UltraClusters also provide access to Amazon FSx for Lustre, a fully managed shared storage built on the most popular high-performance, parallel file system to quickly process massive datasets on demand and at scale with sub-millisecond latencies. EC2 UltraClusters provide scale-out capabilities for distributed ML training and tightly coupled HPC workloads.

Benefits

EC2 UltraClusters help you reduce training times and time-to-solution from weeks to just a few days. This helps you iterate at a faster pace and get your deep learning (DL), generative AI, and HPC applications to market more quickly.

EC2 UltraClusters consist of thousands of accelerated EC2 instances that are co-located in a given AWS Availability Zone and interconnected using Elastic Fabric Adapter (EFA) networking in a petabit-scale nonblocking network. They enable you to get on-demand access to several exaflops of accelerated compute. 

EC2 UltraClusters are supported on a growing list of EC2 instances and give you the flexibility to choose the right compute option to maximize performance while keeping costs under control for your workload.

Features

High-performance networking

EC2 instances deployed in EC2 UltraClusters are interconnected with EFA networking to improve performance for distributed training workloads and tightly coupled HPC workloads. P5en, P5e, P5, and Trn2 instances deliver up to 3,200 Gbps; Trn1 instances deliver up to 1,600 Gbps; and P4d instances deliver up to 400 Gbps of EFA networking. EFA is also coupled with NVIDIA GPUDirect RDMA (P5en, P5e, P5, P4d) and NeuronLink (Trn2, Trn1) to enable low-latency accelerator-to-accelerator communication between servers with operating system bypass.

High-performance storage

EC2 UltraClusters use FSx for Lustre, fully managed shared storage built on the most popular high-performance parallel file system. With FSx for Lustre, you can quickly process massive datasets on demand and at scale, and deliver sub-millisecond latencies. The low-latency and high-throughput characteristics of FSx for Lustre are optimized for DL, generative AI, and HPC workloads on EC2 UltraClusters. FSx for Lustre keeps the GPUs and AI chips in EC2 UltraClusters fed with data, accelerating the most demanding workloads. These workloads include large language model (LLM) training, generative AI inferencing, DL, genomics, and financial risk modeling. You can also get access to virtually unlimited cost-effective storage with Amazon Simple Storage Service (Amazon S3).

Instance supported

Powered by AWS Trainium2 AI chips, Trn2 instances offer up to 30-40% better price-performance over comparable GPU-based instances.

Learn more

Powered by NVIDIA H200 Tensor Core GPUs, P5en and P5e instances provide the highest performance in Amazon EC2 for ML training and HPC applications.

Learn more

Powered by NVIDIA H100 Tensor Core GPUs, P5 instances provide the highest performance in Amazon EC2 for ML training and HPC applications.

Learn more

Powered by NVIDIA A100 Tensor Core GPUs, P4d instances provide high performance for ML training and HPC applications.

Learn more

Powered by AWS Trainium AI chips, Trn1 instances are purpose built for high-performance ML training. They offer up to 50% cost-to-train savings over comparable EC2 instances.

Learn more