AWS Inferentia

Get high performance at the lowest cost in Amazon EC2 for deep learning and generative AI inference

Why Inferentia?

AWS Inferentia accelerators are designed by AWS to deliver high performance at the lowest cost in Amazon EC2 for your deep learning (DL) and generative AI inference applications. 

The first-generation AWS Inferentia accelerator powers Amazon Elastic Compute Cloud (Amazon EC2) Inf1 instances, which deliver up to 2.3x higher throughput and up to 70% lower cost per inference than comparable Amazon EC2 instances. Many customers, including Finch AI, Sprinklr, Money Forward, and Amazon Alexa, have adopted Inf1 instances and realized its performance and cost benefits.

AWS Inferentia2 accelerator delivers up to 4x higher throughput and up to 10x lower latency compared to Inferentia. Inferentia2-based Amazon EC2 Inf2 instances are optimized to deploy increasingly complex models, such as large language models (LLM) and latent diffusion models, at scale. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between accelerators. Many customers, including Leonardo.ai, Deutsche Telekom, and Qualtrics have adopted Inf2 instances for their DL and generative AI applications. 

AWS Neuron SDK helps developers deploy models on the AWS Inferentia accelerators (and train them on AWS Trainium accelerators). It integrates natively with popular frameworks, such as PyTorch and TensorFlow, so that you can continue to use your existing code and workflows and run on Inferentia accelerators.

Benefits of Inferentia

Each first-generation Inferentia accelerator has four first-generation NeuronCores with up to 16 Inferentia accelerators per EC2 Inf1 instance. Each Inferentia2 accelerator has two second-generation NeuronCores with up to 12 Inferentia2 accelerators per EC2 Inf2 instance. Each Inferentia2 accelerator supports up to 190 tera floating operations per second (TFLOPS) of FP16 performance. The first-generation Inferentia has 8 GB of DDR4 memory per accelerator and also features a large amount of on-chip memory. Inferentia2 offers 32 GB of HBM per accelerator, increasing the total memory by 4x and memory bandwidth by 10x over Inferentia.
AWS Neuron SDK integrates natively with popular ML frameworks such as PyTorch and TensorFlow. With AWS Neuron, you can use these frameworks to optimally deploy DL models on both AWS Inferentia accelerators, and Neuron is designed to minimize code changes and tie-in to vendor-specific solutions. Neuron helps you to run your inference applications for natural language processing (NLP)/understanding, language translation, text summarization, video and image generation, speech recognition, personalization, fraud detection, and more on Inferentia accelerators.
The first-generation Inferentia supports FP16, BF16, and INT8 data types. Inferentia2 adds additional support for FP32, TF32, and the new configurable FP8 (cFP8) data type to provide developers more flexibility to optimize performance and accuracy. AWS Neuron takes high-precision FP32 models and automatically casts them to lower-precision data types while optimizing accuracy and performance. Autocasting reduces time to market by removing the need for lower-precision retraining.
Inferentia2 adds hardware optimizations for dynamic input sizes and custom operators written in C++. It also supports stochastic rounding, a way of rounding probabilistically that enables high performance and higher accuracy compared to legacy rounding modes.
Inf2 instances offer up to 50% better performance/watt over comparable Amazon EC2 instances because they and the underlying Inferentia2 accelerators are purpose built to run DL models at scale. Inf2 instances help you meet your sustainability goals when deploying ultra-large models.

Videos

Behind the scenes look at Generative AI infrastructure at Amazon
Introducing Amazon EC2 Inf2 instances powered by AWS Inferentia2
How four AWS customers reduced ML costs and drove innovation with AWS Inferentia