Amazon EC2 Inf2 Instances

High performance at the lowest cost in Amazon EC2 for generative AI inference

Why Amazon EC2 Inf2 Instances?

Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instances are purpose built for deep learning (DL) inference. They deliver high performance at the lowest cost in Amazon EC2 for generative artificial intelligence (AI) models, including large language models (LLMs) and vision transformers. You can use Inf2 instances to run your inference applications for text summarization, code generation, video and image generation, speech recognition, personalization, fraud detection, and more.

Inf2 instances are powered by AWS Inferentia2, the second-generation AWS Inferentia chip. Inf2 instances raise the performance of Inf1 by delivering 3x higher compute performance, 4x larger total accelerator memory, up to 4x higher throughput, and up to 10x lower latency. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between Inferentia chips. You can now efficiently and cost-effectively deploy models with hundreds of billions of parameters across multiple chips on Inf2 instances.

The AWS Neuron SDK helps developers deploy models on the AWS Inferentia chips (and train them on AWS Trainium chips). It integrates natively with frameworks, such as PyTorch and TensorFlow, so you can continue using your existing workflows and application code and run on Inf2 instances.

Benefits

Inf2 instances are the first inference-optimized instances in Amazon EC2 to support distributed inference at scale. You can now efficiently deploy models with hundreds of billions of parameters across multiple Inferentia chips on Inf2 instances, using the ultra-high-speed connectivity between the chips.

Inf2 instances are designed to deliver high performance at the lowest cost in Amazon EC2 for your DL deployments. They offer up to 4x higher throughput and up to 10x lower latency than Amazon EC2 Inf1 instances. Inf2 instances deliver up to 40% better price performance than other comparable Amazon EC2 instances.

Use the AWS Neuron SDK to extract the full performance of Inf2 instances. With Neuron, you can use your existing frameworks like PyTorch and TensorFlow and get optimized out-of-the-box performance for models in popular repositories like Hugging Face. Neuron supports runtime integrations with serving tools like TorchServe and TensorFlow Serving. It also helps optimize performance with built-in profile and debugging tools like Neuron-Top and integrates into popular visualization tools like TensorBoard.

Inf2 instances deliver up to 50% better performance/watt over other comparable Amazon EC2 instances. These instances and the underlying Inferentia2 chips use advanced silicon processes and hardware and software optimizations to deliver high energy efficiency when running DL models at scale. Use Inf2 instances to help meet your sustainability goals when deploying ultra-large models.

Features

Inf2 instances are powered by up to 12 AWS Inferentia2 chips connected with ultra-high-speed NeuronLink for streamlined collective communications. They offer up to 2.3 petaflops of compute and up to 4x higher throughput and 10x lower latency than Inf1 instances.

To accommodate large DL models, Inf2 instances offer up to 384 GB of shared accelerator memory (32 GB HBM in every Inferentia2 chip, 4x larger than first-generation Inferentia) with 9.8 TB/s of total memory bandwidth (10x faster than first-generation Inferentia).

For fast communication between Inferentia2 chips, Inf2 instances support 192 GB/s of NeuronLink, a high-speed, nonblocking interconnect. Inf2 is the only inference-optimized instance to offer this interconnect, a feature that is only available in more expensive training instances. For ultra-large models that do not fit into a single chip, data flows directly between chips with NeuronLink, bypassing the CPU completely. With NeuronLink, Inf2 supports faster distributed inference and improves throughput and latency.

Inferentia2 supports FP32, TF32, BF16, FP16, UINT8, and the new configurable FP8 (cFP8) data type. AWS Neuron can take high-precision FP32 and FP16 models and autocast them to lower-precision data types while optimizing accuracy and performance. Autocasting reduces time to market by removing the need for lower-precision retraining and enabling higher-performance inference with smaller data types.

To support the fast pace of DL innovation, Inf2 instances have several innovations that make them flexible and extendable to deploy constantly evolving DL models. Inf2 instances have hardware optimizations and software support for dynamic input shapes. To allow support for new operators in the future, they support custom operators written in C++. They also support stochastic rounding, a method for rounding probabilistically to achieve high performance and higher accuracy compared to legacy rounding modes. 

Product details

Instance Size Inferentia2 Chips Accelerator
Memory
(GB)
vCPU Memory
(GiB)
Local
Storage
Inter-Chip
Interconnect
Network
Bandwidth
(Gbps)
EBS
Bandwidth
(Gbps)
On-Demand Price 1-Year Reserved Instance 3-Year Reserved Instance
inf2.xlarge 1 32 4 16 EBS Only N/A Up to 15 Up to 10 $0.76 $0.45 $0.30
inf2.8xlarge 1 32 32 128 EBS Only N/A Up to 25 10 $1.97 $1.81 $0.79
inf2.24xlarge 6 192 96 384 EBS Only Yes 50 30 $6.49 $3.89 $2.60
inf2.48xlarge 12 384 192 768 EBS Only Yes 100 60 $12.98 $7.79 $5.19

Customer and Partner testimonials

Here are some examples of how customers and partners have achieved their business goals with Amazon EC2 Inf2 instances.

  • Leonardo.ai

    Our team at Leonardo leverages generative AI to enable creative professionals and enthusiasts to produce visual assets with unmatched quality, speed, and style consistency. The price to performance of AWS Inf2 Utilizing AWS Inf2 we are able to reduce our costs by 80%, without sacrificing performance, fundamentally changing the value proposition we can offer customers, enabling our most advanced features at a more accessible price point. It also alleviates concerns around cost and capacity availability for our ancillary AI services, which are increasingly important as we grow and scale. It is a key enabling technology for us as we continue to push the envelope on what’s possible with generative AI, enabling a new era of creativity and expressive power for our users.

    Pete Werner, Head of AI at Leonardo.ai
  • Runway

    At Runway, our suite of AI Magic Tools enables our users to generate and edit content like never before. We are constantly pushing the boundaries of what is possible with AI-powered content creation, and as our AI models become more complex, the underlying infrastructure costs to run these models at scale can become expensive. Through our collaboration with Amazon EC2 Inf2 instances powered by AWS Inferentia, we’re able to run some of our models with up to 2x higher throughput than comparable GPU-based instances. This high-performance, low-cost inference enables us to introduce more features, deploy more complex models, and ultimately deliver a better experience for the millions of creators using Runway.

    Cristóbal Valenzuela, Cofounder and CEO at Runway
  • Qualtrics

    Qualtrics designs and develops experience management software.

    At Qualtrics, our focus is building technology that closes experience gaps for customers, employees, brands, and products. To achieve that, we are developing complex multi-task, multi-modal DL models to launch new features, such as text classification, sequence tagging, discourse analysis, key-phrase extraction, topic extraction, clustering, and end-to-end conversation understanding. As we utilize these more complex models in more applications, the volume of unstructured data grows, and we need more performant inference-optimized solutions that can meet these demands, such as Inf2 instances, to deliver the best experiences to our customers. We are excited about the new Inf2 instances because it will not only allow us to achieve higher throughputs, while dramatically cutting latency, but also introduces features like distributed inference and enhanced dynamic input shape support, which will help us scale to meet the deployment needs as we push toward larger, more complex large models.

    Aaron Colak, Head of Core Machine Learning at Qualtrics
  • Finch Computing

    Finch Computing is a natural language technology company providing artificial intelligence applications for government, financial services, and data integrator clients.

    To meet our customers’ needs for real-time natural language processing, we develop state-of-the-art DL models that scale to large production workloads. We have to provide low-latency transactions and achieve high throughputs to process global data feeds. We already migrated many production workloads to Inf1 instances and achieved an 80% reduction in cost over GPUs. Now, we are developing larger, more complex models that enable deeper, more insightful meaning from written text. A lot of our customers need access to these insights in real time, and the performance on Inf2 instances will help us deliver lower latency and higher throughput over Inf1 instances. With the Inf2 performance improvements and new Inf2 features, such as support for dynamic input sizes, we are improving our cost-efficiency, elevating the real-time customer experience, and helping our customers glean new insights from their data.

    Franz Weckesser, Chief Architect at Finch Computing
  • Money Forward Inc.

    Money Forward Inc. serves businesses and individuals with an open and fair financial platform. As part of this platform, HiTTO Inc., a Money Forward group company, offers an AI chatbot service, which uses tailored natural language processing (NLP) models to address the diverse needs of their corporate customers.

    We launched a large-scale AI chatbot service on the Amazon EC2 Inf1 instances and reduced our inference latency by 97% over comparable GPU-based instances while also reducing costs. We were very pleased to see further performance improvements in our initial test results on Amazon EC2 Inf2 instances. Using the same custom NLP model, AWS Inf2 was able to further reduce the latency by 10x over Inf1. As we move to larger multibillion parameter models, Inf2 gives us the confidence that we can continue to provide our customers with a superior end-to-end user experience.

    Takuya Nakade, CTO at Money Forward Inc.
  • Fileread

    At Fileread.ai, we are building solutions to make interacting with your docs as easy as asking them questions, enabling users to find what they looking for, from all their docs and getting the right information faster. Since switching to the new Inf2 EC2 instance, we've seen a significant improvement in our NLP inference capabilities. The cost savings alone have been a game-changer for us, allowing us to allocate resources more efficiently without sacrificing quality. We reduced our inferencing latency by 33% while increasing throughput by 50%—delighting our customers on faster turnarounds. Our team has been blown away by the speed and performance of Inf2 compared to the older G5 instances, and it's clear that this is the future deploying NLP models

    Daniel Hu, CEO at Fileread
  • Yaraku

    At Yaraku, our mission is to build the infrastructure that helps people communicate across language barriers. Our flagship product, YarakuZen, enables anyone, from professional translators to monolingual individuals, to confidently translate and post-edit texts and documents. To support this process, we offer a wide range of sophisticated tools based on DL models, covering tasks such as translation, bitext word alignment, sentence segmentation, language modeling, and many others. By using Inf1 instances, we have been able to speed up our services to meet the increasing demand while reducing the inference cost by more than 50% compared to GPU-based instances. We are now moving into the development of next-generation larger models that will require the enhanced capabilities of Inf2 instances to meet demand while maintaining low latency. With Inf2, we will be able to scale up our models by 10x while maintaining similar throughput, allowing us to deliver even higher levels of quality to our customers.

    Giovanni Giacomo, NLP Lead at Yaraku
  • Hugging Face

    Hugging Face’s mission is to democratize good ML to help ML developers around the world solve real-world problems. And key to that is ensuring the latest and greatest models run as fast and efficiently as possible on the best ML chips in the cloud. We are incredibly excited about the potential for Inferentia2 to become the new standard way to deploy generative AI models at scale. With Inf1, we saw up to 70% lower cost than traditional GPU-based instances, and with Inf2 we have seen up to 8x lower latency for BERT-like Transformers compared to Inferentia1. With Inferentia2, our community will be able to easily scale this performance to LLMs at the 100B+ parameters scale, and to the latest diffusion and computer vision models as well.

  • PyTorch

    PyTorch accelerates the path from research prototyping to production deployments for ML developers. We have collaborated with the AWS team to provide native PyTorch support for the new AWS Inferentia2 powered Amazon EC2 Inf2 instances. As more members of our community look to deploy large generative AI models, we are excited to partner with the AWS team to optimize distributed inference on Inf2 instances with high-speed NeuronLink connectivity between chips. With Inf2, developers using PyTorch can now easily deploy ultra-large LLMs and vision transformer models. Additionally, Inf2 instances bring other innovative capabilities to PyTorch developers, including efficient data types, dynamic shapes, custom operators, and hardware-optimized stochastic rounding, making them well-suited for wide adoption by the PyTorch community.

  • Weights & Biases

    Weights & Biases (W&B) provides developer tools for ML engineers and data scientists to build better models faster. The W&B platform provides ML practitioners a wide variety of insights to improve the performance of models, including the utilization of the underlying compute infrastructure. We have collaborated with the AWS team to add support for Amazon Trainium and Inferentia2 to our system metrics dashboard, providing valuable data much needed during model experimentation and training. This enables ML practitioners to optimize their models to take full advantage of AWS’s purpose-built hardware to train their models faster and at lower cost.

    Phil Gurbacki, VP of Product at Weights & Biases
  • OctoML

    OctoML helps developers reduce costs and build scalable AI applications by packaging their DL models to run on high-performance hardware. We have spent the last several years building expertise on the best software and hardware solutions and integrating them into our platform. Our roots as chip designers and system hackers make AWS Trainium and Inferentia even more exciting for us. We see these chips as a key driving factor for the future of AI innovation on the cloud. The GA launch of Inf2 instances is especially timely, as we are seeing the emergence of popular LLM as a key building block of next-generation AI applications. We are excited to make these instances available in our platform to help developers easily take advantage of their high performance and cost-saving benefits.

    Jared Roesch, CTO and Cofounder at OctoML
  • Nextira

    The historic challenge with LLMs, and more broadly with enterprise-level generative AI applications, are the costs associated with training and running high-performance DL models. Along with AWS Trainium, AWS Inferentia2 removes the financial compromises our customers make when they require high-performance training. Now, our customers looking for advantages in training and inference can achieve better results for less money. Trainium and Inferentia accelerate scale to meet even the most demanding DL requirements for today’s largest enterprises. Many Nextira customers running large AI workloads will benefit directly with these new chipsets, increasing efficiencies in cost savings and performance and leading to faster results in their market.

    Jason Cutrer, founder and CEO at Nextira
  • Amazon CodeWhisperer

    Amazon CodeWhisperer is an AI coding companion that generates real-time single-line or full-function code recommendations in your integrated development environment (IDE) to help you quickly build software.

    With CodeWhisperer, we're improving software developer productivity by providing code recommendations using generative AI models. To develop highly effective code recommendations, we scaled our DL network to billions of parameters. Our customers need code recommendations in real time as they type, so low-latency responses are critical. Large generative AI models require high-performance compute to deliver response times in a fraction of a second. With Inf2, we're delivering the same latency as running CodeWhisperer on training optimized GPU instances for large input and output sequences. Thus, Inf2 instances are helping us save cost and power while delivering the best possible experience for developers.

    Doug Seven, General Manager at Amazon CodeWhisperer
  • Amazon Search

    Amazon's product search engine indexes billions of products, serves billions of customer queries daily, and is one of the most heavily used services in the world.

    I am super excited at the Inf2 GA launch. The superior performance of Inf2, coupled with its ability to handle larger models with billions of parameters, makes it the perfect choice for our services and enables us to unlock new possibilities in terms of model complexity and accuracy. With the significant speedup and cost-efficiency offered by Inf2, integrating them into Amazon Search serving infrastructure can help us meet the growing demands of our customers. We are planning to power our new shopping experiences using generative LLMs using Inf2.

    Trishul Chilimbi, VP at Amazon Search

Getting started

Deploy models on Inf2 instances more easily using Amazon SageMaker and significantly reduce the costs to deploy ML models and increase performance without the need to manage infrastructure. SageMaker is a fully managed service and integrates with MLOps tools. Therefore, you can scale your model deployment, manage models more effectively in production, and reduce operational burden.

The AWS Deep Learning AMIs (DLAMI) provide DL practitioners and researchers with the infrastructure and tools to accelerate DL in the cloud, at any scale. AWS Neuron drivers come preconfigured in the DLAMI to deploy your DL models optimally on Inf2 instances.

You can now deploy Inf2 instances in Amazon Elastic Kubernetes Service (Amazon EKS), a fully managed Kubernetes service, and in Amazon Elastic Container Service (Amazon ECS), a fully managed container orchestration service. Neuron is also available preinstalled in AWS Deep Learning Containers. To learn more about running containers on Inf2 instances, see the Neuron containers tutorials.