Why Amazon EC2 Inf2 Instances?
Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instances are purpose built for deep learning (DL) inference. They deliver high performance at the lowest cost in Amazon EC2 for generative artificial intelligence (AI) models, including large language models (LLMs) and vision transformers. You can use Inf2 instances to run your inference applications for text summarization, code generation, video and image generation, speech recognition, personalization, fraud detection, and more.
Inf2 instances are powered by AWS Inferentia2, the second-generation AWS Inferentia chip. Inf2 instances raise the performance of Inf1 by delivering 3x higher compute performance, 4x larger total accelerator memory, up to 4x higher throughput, and up to 10x lower latency. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between Inferentia chips. You can now efficiently and cost-effectively deploy models with hundreds of billions of parameters across multiple chips on Inf2 instances.
The AWS Neuron SDK helps developers deploy models on the AWS Inferentia chips (and train them on AWS Trainium chips). It integrates natively with frameworks, such as PyTorch and TensorFlow, so you can continue using your existing workflows and application code and run on Inf2 instances.
Benefits
Features
Product details
Instance Size | Inferentia2 Chips | Accelerator Memory (GB) |
vCPU | Memory (GiB) |
Local Storage |
Inter-Chip Interconnect |
Network Bandwidth (Gbps) |
EBS Bandwidth (Gbps) |
On-Demand Price | 1-Year Reserved Instance | 3-Year Reserved Instance |
inf2.xlarge | 1 | 32 | 4 | 16 | EBS Only | N/A | Up to 15 | Up to 10 | $0.76 | $0.45 | $0.30 |
inf2.8xlarge | 1 | 32 | 32 | 128 | EBS Only | N/A | Up to 25 | 10 | $1.97 | $1.81 | $0.79 |
inf2.24xlarge | 6 | 192 | 96 | 384 | EBS Only | Yes | 50 | 30 | $6.49 | $3.89 | $2.60 |
inf2.48xlarge | 12 | 384 | 192 | 768 | EBS Only | Yes | 100 | 60 | $12.98 | $7.79 | $5.19 |
Customer and Partner testimonials
Here are some examples of how customers and partners have achieved their business goals with Amazon EC2 Inf2 instances.
-
Leonardo.ai
Our team at Leonardo leverages generative AI to enable creative professionals and enthusiasts to produce visual assets with unmatched quality, speed, and style consistency. The price to performance of AWS Inf2 Utilizing AWS Inf2 we are able to reduce our costs by 80%, without sacrificing performance, fundamentally changing the value proposition we can offer customers, enabling our most advanced features at a more accessible price point. It also alleviates concerns around cost and capacity availability for our ancillary AI services, which are increasingly important as we grow and scale. It is a key enabling technology for us as we continue to push the envelope on what’s possible with generative AI, enabling a new era of creativity and expressive power for our users.
Pete Werner, Head of AI at Leonardo.ai -
Runway
At Runway, our suite of AI Magic Tools enables our users to generate and edit content like never before. We are constantly pushing the boundaries of what is possible with AI-powered content creation, and as our AI models become more complex, the underlying infrastructure costs to run these models at scale can become expensive. Through our collaboration with Amazon EC2 Inf2 instances powered by AWS Inferentia, we’re able to run some of our models with up to 2x higher throughput than comparable GPU-based instances. This high-performance, low-cost inference enables us to introduce more features, deploy more complex models, and ultimately deliver a better experience for the millions of creators using Runway.
Cristóbal Valenzuela, Cofounder and CEO at Runway -
Qualtrics
Qualtrics designs and develops experience management software.
At Qualtrics, our focus is building technology that closes experience gaps for customers, employees, brands, and products. To achieve that, we are developing complex multi-task, multi-modal DL models to launch new features, such as text classification, sequence tagging, discourse analysis, key-phrase extraction, topic extraction, clustering, and end-to-end conversation understanding. As we utilize these more complex models in more applications, the volume of unstructured data grows, and we need more performant inference-optimized solutions that can meet these demands, such as Inf2 instances, to deliver the best experiences to our customers. We are excited about the new Inf2 instances because it will not only allow us to achieve higher throughputs, while dramatically cutting latency, but also introduces features like distributed inference and enhanced dynamic input shape support, which will help us scale to meet the deployment needs as we push toward larger, more complex large models.
Aaron Colak, Head of Core Machine Learning at Qualtrics -
Finch Computing
Finch Computing is a natural language technology company providing artificial intelligence applications for government, financial services, and data integrator clients.
To meet our customers’ needs for real-time natural language processing, we develop state-of-the-art DL models that scale to large production workloads. We have to provide low-latency transactions and achieve high throughputs to process global data feeds. We already migrated many production workloads to Inf1 instances and achieved an 80% reduction in cost over GPUs. Now, we are developing larger, more complex models that enable deeper, more insightful meaning from written text. A lot of our customers need access to these insights in real time, and the performance on Inf2 instances will help us deliver lower latency and higher throughput over Inf1 instances. With the Inf2 performance improvements and new Inf2 features, such as support for dynamic input sizes, we are improving our cost-efficiency, elevating the real-time customer experience, and helping our customers glean new insights from their data.
Franz Weckesser, Chief Architect at Finch Computing -
Money Forward Inc.
Money Forward Inc. serves businesses and individuals with an open and fair financial platform. As part of this platform, HiTTO Inc., a Money Forward group company, offers an AI chatbot service, which uses tailored natural language processing (NLP) models to address the diverse needs of their corporate customers.
We launched a large-scale AI chatbot service on the Amazon EC2 Inf1 instances and reduced our inference latency by 97% over comparable GPU-based instances while also reducing costs. We were very pleased to see further performance improvements in our initial test results on Amazon EC2 Inf2 instances. Using the same custom NLP model, AWS Inf2 was able to further reduce the latency by 10x over Inf1. As we move to larger multibillion parameter models, Inf2 gives us the confidence that we can continue to provide our customers with a superior end-to-end user experience.
Takuya Nakade, CTO at Money Forward Inc. -
Fileread
At Fileread.ai, we are building solutions to make interacting with your docs as easy as asking them questions, enabling users to find what they looking for, from all their docs and getting the right information faster. Since switching to the new Inf2 EC2 instance, we've seen a significant improvement in our NLP inference capabilities. The cost savings alone have been a game-changer for us, allowing us to allocate resources more efficiently without sacrificing quality. We reduced our inferencing latency by 33% while increasing throughput by 50%—delighting our customers on faster turnarounds. Our team has been blown away by the speed and performance of Inf2 compared to the older G5 instances, and it's clear that this is the future deploying NLP models
Daniel Hu, CEO at Fileread -
Yaraku
At Yaraku, our mission is to build the infrastructure that helps people communicate across language barriers. Our flagship product, YarakuZen, enables anyone, from professional translators to monolingual individuals, to confidently translate and post-edit texts and documents. To support this process, we offer a wide range of sophisticated tools based on DL models, covering tasks such as translation, bitext word alignment, sentence segmentation, language modeling, and many others. By using Inf1 instances, we have been able to speed up our services to meet the increasing demand while reducing the inference cost by more than 50% compared to GPU-based instances. We are now moving into the development of next-generation larger models that will require the enhanced capabilities of Inf2 instances to meet demand while maintaining low latency. With Inf2, we will be able to scale up our models by 10x while maintaining similar throughput, allowing us to deliver even higher levels of quality to our customers.
Giovanni Giacomo, NLP Lead at Yaraku -
Hugging Face
Hugging Face’s mission is to democratize good ML to help ML developers around the world solve real-world problems. And key to that is ensuring the latest and greatest models run as fast and efficiently as possible on the best ML chips in the cloud. We are incredibly excited about the potential for Inferentia2 to become the new standard way to deploy generative AI models at scale. With Inf1, we saw up to 70% lower cost than traditional GPU-based instances, and with Inf2 we have seen up to 8x lower latency for BERT-like Transformers compared to Inferentia1. With Inferentia2, our community will be able to easily scale this performance to LLMs at the 100B+ parameters scale, and to the latest diffusion and computer vision models as well.
-
PyTorch
PyTorch accelerates the path from research prototyping to production deployments for ML developers. We have collaborated with the AWS team to provide native PyTorch support for the new AWS Inferentia2 powered Amazon EC2 Inf2 instances. As more members of our community look to deploy large generative AI models, we are excited to partner with the AWS team to optimize distributed inference on Inf2 instances with high-speed NeuronLink connectivity between chips. With Inf2, developers using PyTorch can now easily deploy ultra-large LLMs and vision transformer models. Additionally, Inf2 instances bring other innovative capabilities to PyTorch developers, including efficient data types, dynamic shapes, custom operators, and hardware-optimized stochastic rounding, making them well-suited for wide adoption by the PyTorch community.
-
Weights & Biases
Weights & Biases (W&B) provides developer tools for ML engineers and data scientists to build better models faster. The W&B platform provides ML practitioners a wide variety of insights to improve the performance of models, including the utilization of the underlying compute infrastructure. We have collaborated with the AWS team to add support for Amazon Trainium and Inferentia2 to our system metrics dashboard, providing valuable data much needed during model experimentation and training. This enables ML practitioners to optimize their models to take full advantage of AWS’s purpose-built hardware to train their models faster and at lower cost.
Phil Gurbacki, VP of Product at Weights & Biases -
OctoML
OctoML helps developers reduce costs and build scalable AI applications by packaging their DL models to run on high-performance hardware. We have spent the last several years building expertise on the best software and hardware solutions and integrating them into our platform. Our roots as chip designers and system hackers make AWS Trainium and Inferentia even more exciting for us. We see these chips as a key driving factor for the future of AI innovation on the cloud. The GA launch of Inf2 instances is especially timely, as we are seeing the emergence of popular LLM as a key building block of next-generation AI applications. We are excited to make these instances available in our platform to help developers easily take advantage of their high performance and cost-saving benefits.
Jared Roesch, CTO and Cofounder at OctoML -
Nextira
The historic challenge with LLMs, and more broadly with enterprise-level generative AI applications, are the costs associated with training and running high-performance DL models. Along with AWS Trainium, AWS Inferentia2 removes the financial compromises our customers make when they require high-performance training. Now, our customers looking for advantages in training and inference can achieve better results for less money. Trainium and Inferentia accelerate scale to meet even the most demanding DL requirements for today’s largest enterprises. Many Nextira customers running large AI workloads will benefit directly with these new chipsets, increasing efficiencies in cost savings and performance and leading to faster results in their market.
Jason Cutrer, founder and CEO at Nextira -
Amazon CodeWhisperer
Amazon CodeWhisperer is an AI coding companion that generates real-time single-line or full-function code recommendations in your integrated development environment (IDE) to help you quickly build software.
With CodeWhisperer, we're improving software developer productivity by providing code recommendations using generative AI models. To develop highly effective code recommendations, we scaled our DL network to billions of parameters. Our customers need code recommendations in real time as they type, so low-latency responses are critical. Large generative AI models require high-performance compute to deliver response times in a fraction of a second. With Inf2, we're delivering the same latency as running CodeWhisperer on training optimized GPU instances for large input and output sequences. Thus, Inf2 instances are helping us save cost and power while delivering the best possible experience for developers.
Doug Seven, General Manager at Amazon CodeWhisperer -
Amazon Search
Amazon's product search engine indexes billions of products, serves billions of customer queries daily, and is one of the most heavily used services in the world.
I am super excited at the Inf2 GA launch. The superior performance of Inf2, coupled with its ability to handle larger models with billions of parameters, makes it the perfect choice for our services and enables us to unlock new possibilities in terms of model complexity and accuracy. With the significant speedup and cost-efficiency offered by Inf2, integrating them into Amazon Search serving infrastructure can help us meet the growing demands of our customers. We are planning to power our new shopping experiences using generative LLMs using Inf2.
Trishul Chilimbi, VP at Amazon Search