Amazon EC2 Trn1 Instances

High-performance, cost-effective training of generative AI models

Why Amazon EC2 Trn1 Instances?

Amazon Elastic Compute Cloud (EC2) Trn1 instances, powered by AWS Trainium chips, are purpose built for high-performance deep learning (DL) training of generative AI models, including large language models (LLMs) and latent diffusion models. Trn1 instances offer up to 50% cost-to-train savings over other comparable Amazon EC2 instances. You can use Trn1 instances to train 100B+ parameter DL and generative AI models across a broad set of applications, such as text summarization, code generation, question answering, image and video generation, recommendation, and fraud detection.

The AWS Neuron SDK helps developers train models on AWS Trainium (and deploy models on the AWS Inferentia chips). It integrates natively with frameworks such as PyTorch and TensorFlow, so that you can continue using your existing code and workflows to train models on Trn1 instances. To learn about the current Neuron support for machine learning (ML) frameworks and libraries, model architectures, and hardware optimizations, see the Neuron documentation.

Introducing Amazon EC2 Trn1 instances powered by AWS Trainium

Benefits

Trn1 instances are purpose built for high-performance DL and reduce training times from months to weeks, or even days. With reduced training times, you can iterate faster, build more innovative models, and increase productivity. Trn1n instances deliver up to 20% faster time-to-train than Trn1 instances for models that benefit from increased network bandwidth.

Trn1 instances deliver high performance while offering up to 50% cost-to-train savings over other comparable Amazon EC2 instances.

Use the AWS Neuron SDK to extract the full performance of Trn1 instances. With Neuron, you can use popular ML frameworks like PyTorch and TensorFlow and continue to use your existing code and workflows to train models on Trn1 instances. To quickly get started with Trn1 instances, see popular model examples in the Neuron documentation.

Trn1 instances support up to 800 Gbps of second-generation Elastic Fabric Adapter (EFAv2) network bandwidth. Trn1n instances support up to 1600 Gbps of EFAv2 network bandwidth to deliver even higher performance for network-intensive models. Both instances are deployed in EC2 UltraClusters that enable scaling up to 30,000 Trainium chips, which are interconnected with a nonblocking petabit-scale network to provide 6 exaflops of compute performance.

Features

Trn1 instances are powered by up to 16 AWS Trainium chips purpose built to accelerate DL training and deliver up to 3 petaflops of FP16/BF16 compute power. Each chip includes two second-generation NeuronCores.

To support efficient data and model parallelism, each Trn1 instance has 512 GB of shared accelerator memory (HBM) with 9.8 TB/s of total memory bandwidth.

To support training of network-intensive models, such as Mixture of Experts (MoE) and Generative Pre-Trained Transformers (GPT), each Trn1n instance delivers up to 1600 Gbps of EFAv2 networking bandwidth. Each Trn1 instance supports up to 800 Gbps of EFAv2 bandwidth. EFAv2 speeds up distributed training by delivering up to 50% improvement in collective communications performance over first-generation EFA. These instances also support up to 80 Gbps of Amazon Elastic Block Store (EBS) bandwidth and up to 8 TB of local NVMe solid state drive (SSD) storage for fast workload access to large datasets.

For fast connectivity between Trainium chips and streamlined collective communications, Trn1 instances support up to 768 GB/s of NeuronLink, a high-speed, nonblocking interconnect.

To deliver high performance while meeting accuracy goals, Trn1 instances are optimized for FP32, TF32, BF16, FP16, UINT8, and the new configurable FP8 (cFP8) data type. To support the fast pace of DL innovation and generative AI, Trn1 instances have several innovations that make them flexible and extendable to train constantly evolving DL models. Trn1 instances have hardware optimizations and software support for dynamic input shapes. To allow support for new operators in the future, they support custom operators written in C++. They also support stochastic rounding, a method for rounding probabilistically to achieve high performance and higher accuracy compared to legacy rounding modes.

Customer and Partner Testimonials

Here are some examples of how customers and partners have achieved their business goals with Amazon EC2 Trn1 instances.

  • Databricks

    More than 10,000 organizations worldwide — including Comcast, Condé Nast, and over 50% of the Fortune 500 — rely on the Databricks to unify their data, analytics and AI.

    Thousands of customers have implemented Databricks on AWS, giving them the ability to use MosaicML to pre-train, fine-tune, and serve foundation models for a variety of use cases. AWS Trainium gives us the scale and high performance needed to train our Mosaic MPT models, and at a low cost. As we train our next generation Mosaic MPT models, Trainium2 will make it possible to build models even faster, allowing us to provide our customers unprecedented scale and performance so they can bring their own generative AI applications to market more rapidly.

    Naveen Rao, VP of Generative AI, Databricks
  • Stockmark Co., Ltd

    With the mission of “reinventing the mechanism of value creation and advancing humanity,” Stockmark helps many companies create and build innovative businesses by providing cutting-edge natural language processing technology.

    With 16 nodes of Amazon EC2 Trn1 instances powered by AWS Trainium chips, we have developed and released stockmark-13b, a large language model with 13 billion parameters, pre-trained from scratch on a Japanese corpus of 220B tokens. The corpus includes the latest business domain texts up to September 2023. The model achieved the highest JSQuAD score (0.813) on the JGLUE (Japanese General Language Understanding Evaluation) benchmark compared to other equivalent models. It is available at Hugging Face Hub and can be used commercially with the MIT license. Trn1 instances helped us to achieve 20% training cost reduction compared to equivalent GPU instances.

    Kosuke Arima, CTO, Stockmark Co., Ltd.
  • RICOH

    RICOH offers workplace solutions and digital transformation services designed to manage and optimize the flow of information across businesses.

    The migration to Trn1 instances was quite straightforward. We were able to complete the training of our 13B parameter model in just 8 days. Building on this success, we are looking forward to developing and training our 70B parameter model on Trainium and are excited about the potential of these instances in training our models faster and more cost-effectively.

    Yoshiaki Umetsu, Director, Digital Technology Development Center, RICOH
  • HeliXon

    At HeliXon, we build next-generation AI solutions to protein-based therapeutics. We aim to develop AI tools that empower scientists to decipher protein function and interaction, interrogate large-scale genomic datasets for target identification, and design therapeutics such as antibodies and cell therapies. Today, we use training distribution libraries like FSDP to parallelize model training over many GPU-based servers, but this still takes us weeks to train a single model. We are excited to utilize Amazon EC2 Trn1 instances, featuring the highest networking bandwidth (800 Gbps) available in AWS to improve the performance of our distributed training jobs and reduce our model training times, while also reducing our training costs.

    Jian Peng, CEO, Helixon
  • Money Forward, Inc.

    Money Forward, Inc. serves businesses and individuals with an open and fair financial platform.

    We launched a large-scale AI chatbot service on the Amazon EC2 Inf1 instances and reduced our inference latency by 97% over comparable GPU-based instances while also reducing costs. As we keep fine-tuning tailored NLP models periodically, reducing model training times and costs is also important. Based on our experience from successful migration of inference workload on Inf1 instances and our initial work on AWS Trainium-based EC2 Trn1 instances, we expect Trn1 instances will provide additional value in improving end-to-end ML performance and cost.

    Takuya Nakade, CTO, Money Forward, Inc.
  • Magic

    Magic is an integrated product and research company developing AI that feels like a colleague to make the world more productive.

    Training large autoregressive Transformer-based models is an essential component of our work. AWS Trainium-powered Trn1 instances are designed specifically for these workloads, offering near infinite scalability, fast inter-node networking, and advanced support for 16- and 8-bit data types. Trn1 instances will help us train large models faster, at a lower cost. We are particularly excited about the native support for BF16 stochastic rounding in Trainium, increasing performance while numerical accuracy is indistinguishable from full precision.

    Eric Steinberger, Cofounder and CEO, Magic
  • Cactus Communications

    CACTUS has a suite of products and solutions for researchers, and organizations that improve how research gets funded, published, communicated and discovered.

    At Cactus Labs, we harness the power of AI, with research focused on natural language processing, ranking and recommendation, conversational AI, large language models, computer vision, AR/VR and XAI. In line with our quest to enable faster training of machine learning models as well as enable our researchers to run more experiments while managing the infrastructure cost, we were delighted to evaluate AWS Trainium. AWS Trainium’s out of the box features like XLA optimization, multi-worker data parallel training, and graph caching are really useful for us to reduce our training times and help us run more experiments faster and cheaper.

    Nishchay Shah, CTO and Head of Emerging Products, Cactus Communications
  • Watashiha

    Watashiha offers an innovative and interactive AI chatbot service, “OGIRI AI,” which incorporates humor to provide a funny answer on the spot for a question.

    We use Large Language Models to incorporate humor and offer a more relevant and conversational experience to our customers on our AI services. This requires us to pre-train and fine-tune these models frequently. We pre-trained a GPT-based Japanese model on the EC2 Trn1.32xlarge instance, leveraging tensor and data parallelism. The training was completed within 28 days at a 33% cost reduction over our previous GPU based infrastructure. As our models rapidly continue to grow in complexity, we are looking forward to Trn1n instances which has double the network bandwidth of Trn1 to speed up training of larger models.

    Yohei Kobashi, CTO, Watashiha, K.K.
  • PyTorch

    At PyTorch, we accelerate taking machine learning from research prototyping to production ready for customers. We have collaborated extensively with the AWS team to provide native PyTorch support for the new AWS Trainium powered Amazon EC2 Trn1 instances that are purpose built for training deep learning models. Developers building PyTorch models can start training on Trn1 instances with minimal code changes. Additionally, we have worked with the OpenXLA community to enable PyTorch Distributed libraries for easy model migration from GPU-based instances to Trn1 instances. We are excited about the innovation that Trn1 instances bring to the PyTorch community, including more efficient data types, dynamic shapes, custom operators, hardware-optimized stochastic rounding, and eager debug mode. All these makes Trn1 well suited for wide adoption by PyTorch developers and we look forward to future joint contributions to PyTorch to further optimize training performance.

    Geeta Chauhan, Applied AI, Engineering Manager, PyTorch
  • Hugging Face

    Hugging Face’s mission is to democratize good ML to help ML developers around the world solve real-world problems. And key to that is ensuring the latest and greatest models run as fast and efficiently as possible on the best ML chips in the cloud. We are incredibly excited about the potential for Inferentia2 to become the new standard way to deploy generative AI models at scale. With Inf1, we saw up to 70% lower cost than traditional GPU-based instances, and with Inf2 we have seen up to 8x lower latency for BERT-like transformers compared to Inferentia1. With Inferentia2, our community will be able to easily scale this performance to LLMs at the 100B+ parameters scale, and to the latest diffusion and computer vision models as well.

  • Amazon

    We are training large language models (LLM) that are multi-modal (text + image), multilingual, multi-locale, pre-trained on multiple tasks, and span multiple entities (products, queries, brands, reviews, etc.) to improve the customer shopping experience. Trn1 instances provide a more sustainable way to train LLMs by delivering the best performance/watt compared to other accelerated machine-learning solutions and offers us high performance at the lowest cost. We plan to explore the new configurable FP8 datatype, and hardware-accelerated stochastic rounding to further increase our training efficiency and development velocity.

    Trishul Chilimbi, VP, Amazon Search

Getting started

You can easily train models on Trn1 instances by using Amazon SageMaker. Significantly reduce the time and cost to train and tune ML models without the need to manage infrastructure. With SageMaker, you can use built-in tools to manage and track training experiments, automatically choose optimal hyperparameters, debug training jobs, and monitor the use of system resources.

The AWS Deep Learning AMIs (DLAMI) provide deep learning (DL) practitioners and researchers with the infrastructure and tools to accelerate DL on AWS, at any scale. AWS Neuron drivers comes preconfigured in the DLAMI to train your DL models optimally on Trn1 instances.

You can now deploy Trn1 instances in Amazon Elastic Kubernetes Service (EKS), a fully managed Kubernetes service, and in Amazon Elastic Container Service (ECS), a fully managed container orchestration service. Neuron is also available pre-installed in AWS Deep Learning Containers. To learn more about running containers on Trn1 instances, see the Neuron containers tutorials.

Product details

Instance Size Trainium Chips
Accelerator
Memory
(GB)
vCPUs Instance
Memory
(GiB)
Local
NVMe
Storage
(TB)
Network
Bandwidth
(Gbps)
EFA and
RDMA
Support
EBS
Bandwidth
(Gbps)
On-Demand
Price per Hour
1-Year
Reserved
Instance
Effective
Hourly*
3-Year
Reserved
Instance
Effective
Hourly*
trn1.2xlarge 1 32 8 32 0.5 Up to 12.5 No Up to 20 $1.34 $0.79 $0.4744
trn1.32xlarge 16 512 128 512 8 800 Yes 80 $21.50 $12.60 $7.59

trn1n.32xlarge

16 512 128 512 8 1600 Yes 80 $24.78 $14.52 $8.59