Many organizations are building generative AI applications powered by large language models (LLMs) to boost productivity and build differentiated experiences. These LLMs are large and complex and deploying them requires powerful computing resources and results in high inference costs. For businesses and researchers with limited resources, the high inference costs of generative AI models can be a barrier to enter the market, so more efficient and cost-effective solutions are needed. Most generative AI use cases involve human interaction, which requires AI accelerators that can deliver real time response rates with low latency. At the same time, the pace of innovation in generative AI is increasing, and it’s becoming more challenging for developers and researchers to quickly evaluate and adopt new models to keep pace with the market.
One of ways to get started with LLMs such as Llama and Mistral are by using Amazon Bedrock. However, customers who want to deploy LLMs in their own self-managed workflows for greater control and flexibility of underlying resources can use these LLMs optimized on top of AWS Inferentia2-powered Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instances. In this blog post, we will introduce how to use an Amazon EC2 Inf2 instance to cost-effectively deploy multiple industry-leading LLMs on AWS Inferentia2, a purpose-built AWS AI chip, helping customers to quickly test and open up an API interface to facilitate performance benchmarking and downstream application calls at the same time.
Model introduction
There are many popular open source LLMs to choose from, and for this blog post, we will review three different use cases based on model expertise using Meta-Llama-3-8B-Instruct, Mistral-7B-instruct-v0.2, and CodeLlama-7b-instruct-hf.
Model name | Release company | Number of parameters | Release time | Model capabilities |
Meta-Llama-3-8B-Instruct | Meta | 8 billion | April 2024 | Language understanding, translation, code generation, inference, chat |
Mistral-7B-Instruct-v0.2 | Mistral AI | 7.3 billion | March 2024 | Language understanding, translation, code generation, inference, chat |
CodeLlama-7b-Instruct-hf | Meta | 7 billion | August 2023 | Code generation, code completion, chat |
Meta-Llama-3-8B-Instruct is a popular language models, released by Meta AI in April 2024. The Llama 3 model has improved pre-training, instant comprehension, output generation, coding, inference, and math skills. The Meta AI team says that Llama 3 has the potential to be the initiator of a new wave of innovation in AI. The Llama 3 model is available in two publicly released versions, 8B and 70B. At the time of writing, Llama 3.1 instruction-tuned models are available in 8B, 70B, and 405B versions. In this blog post, we will use the Meta-Llama-3-8B-Instruct model, but the same process can be followed for Llama 3.1 models.
Mistral-7B-instruct-v0.2, released by Mistral AI in March 2024, marks a major milestone in the development of the publicly available foundation model. With its impressive performance, efficient architecture, and wide range of features, Mistral 7B v0.2 sets a new standard for user-friendly and powerful AI tools. The model excels at tasks ranging from natural language processing to coding, making it an invaluable resource for researchers, developers, and businesses. In this blog post, we will use the Mistral-7B-instruct-v0.2 model, but the same process can be followed for the Mistral-7B-instruct-v0.3 model.
CodeLlama-7b-instruct-hf is a collection of models published by Meta AI. It is an LLM that uses text prompts to generate code. Code Llama is aimed at code tasks, making developers’ workflow faster and more efficient and lowering the learning threshold for coders. Code Llama has the potential to be used as a productivity and educational tool to help programmers write more powerful and well-documented software.
Solution architecture
The solution uses a client-server architecture, and the client uses the HuggingFace Chat UI to provide a chat page that can be accessed on a PC or mobile device. Server-side model inference uses Hugging Face’s Text Generation Inference, an efficient LLM inference framework that runs in a Docker container. We pre-compiled the model using Hugging Face’s Optimum Neuron and uploaded the compilation results to Hugging Face Hub. We have also added a model switching mechanism to the HuggingFace Chat UI to control the loading of different models in the Text Generation Inference container through a scheduler (Scheduler).
Solution highlights
- All components are deployed on an Inf2 instance with a single chip instance (inf2.xl or inf2.8xl), and users can experience the effects of multiple models on one instance.
- With the client-server architecture, users can flexibly replace either the client or the server side according to their actual needs. For example, the model can be deployed in Amazon SageMaker, and the frontend Chat UI can be deployed on the Node server. To facilitate the demonstration, we deployed both the front and back ends on the same Inf2 server.
- Using a publicly available framework, users can customize frontend pages or models according to their own needs.
- Using an API interface for Text Generation Inference facilitates quick access for users using the API.
- Deployment using AWS Cloudformation, suitable for all types of businesses and developers within the enterprise.
Main components
The following are the main components of the solution.
Hugging Face Optimum Neuron
Optimum Neuron is an interface between the HuggingFace Transformers library and the AWS Neuron SDK. It provides a set of tools for model load, training, and inference for single and multiple accelerator setups of different downstream tasks. In this article, we mainly used Optimum Neuron’s export interface. To deploy the HuggingFace Transformers model on Neuron devices, the model needs to be compiled and exported to a serialized format before the inference is performed. The export interface is pre-compiled (Ahead of-time compilation (AOT)) using the Neuron compiler (Neuronx-cc), and the model is converted into a serialized and optimized TorchScript module. This is shown in the following figure.
During the compilation process, we introduced a tensor parallelism mechanism to split the weights, data, and computations between the two NeuronCores. For more compilation parameters, see Export a model to Inferentia.
Hugging Face’s Text Generation Inference (TGI)
Text Generation Inference (TGI) is a framework written in Rust and Python for deploying and serving LLMs. TGI provides high performance text generation services for the most popular publicly available foundation LLMs. Its main features are:
- Simple launcher that provides inference services for many LLMs
- Supports both generate and stream interfaces
- Token stream using server-sent events (SSE)
- Supports AWS Inferentia, Trainium, NVIDIA GPUs and other accelerators
HuggingFace Chat UI
HuggingFace Chat UI is an open-source chat tool built by SvelteKit and can be deployed to Cloudflare, Netlify, Node, and so on. It has the following main features:
- Page can be customized
- Conversation records can be stored, and chat records are stored in MongoDB
- Supports operation on PC and mobile terminals
- The backend can connect to Text Generation Inference and supports API interfaces such as Anthropic, Amazon SageMaker, and Cohere
- Compatible with various publicly available foundation models (Llama series, Mistral/Mixtral series, Falcon, and so on.
Thanks to the page customization capabilities of the Hugging Chat UI, we’ve added a model switching function, so users can switch between different models on the same EC2 Inf2 instance.
Solution deployment
- Before deploying the solution, make sure you have an inf2.xl or inf2.8xl usage quota in the us-east-1 (Virginia) or us-west-2 (Oregon) AWS Region. See the reference link for how to apply for a quota.
- Sign in to the AWS Management Consol and switch the Region to us-east-1 (Virginia) or us-west-2 (Oregon) in the upper right corner of the console page.
- Enter
Cloudformation
in the service search box and choose Create stack. - Select Choose an existing template, and then select Amazon S3 URL.
- If you plan to use an existing virtual private cloud (VPC), use the steps in a; if you plan to create a new VPC to deploy, use the steps in b.
- Use an existing VPC.
- Enter
https://zz-common.s3.amazonaws.com/tmp/tgiui/20240501/launch_server_default_vpc_ubuntu22.04.yaml
in the Amazon S3 URL. - Stack name: Enter the stack name.
- InstanceType: select inf2.xl (lower cost) or inf2.8xl (better performance).
- KeyPairName (optional): if you want to sign in to the Inf2 instance, enter the KeyPairName name.
- VpcId: Select VPC.
- PublicSubnetId: Select a public subnet.
- VolumeSize: Enter the size of the EC2 instance EBS storage volume. The minimum value is 80 GB.
- Choose Next, then Next again. Choose Submit.
- Enter
- Create a new VPC.
- Enter
https://zz-common.s3.amazonaws.com/tmp/tgiui/20240501/launch_server_new_vpc_ubuntu22.04.yaml
in the Amazon S3 URL. - Stack name: Enter the stack name.
- InstanceType: Select inf2.xl or inf2.8xl.
- KeyPairName (optional): If you want to sign in to the Inf2 instance, enter the KeyPairName name.
- VpcId: Leave as New.
- PublicSubnetId: Leave as New.
- VolumeSize: Enter the size of the EC2 instance EBS storage volume. The minimum value is 80 GB.
- Enter
- Use an existing VPC.
- Choose Next, and then Next again. Then choose Submit.6. After creating the stack, wait for the resources to be created and started (about 15 minutes). After the stack status is displayed as
CREATE_COMPLETE
, choose Outputs. Choose the URL where the key is the corresponding value location for Public endpoint for the web server (close all VPN connections and firewall programs).
User interface
After the solution is deployed, users can access the preceding URL on the PC or mobile phone. On the page, the Llama3-8B model will be loaded by default. Users can switch models in the menu settings, select the model name to be activated in the model list, and choose Activate to switch models. Switching models requires reloading the new model into the Inferentia 2 accelerator memory. This process takes about 1 minute. During this process, users can check the loading status of the new model by choosing Retrieve model status. If the status is Available
, it indicates that the new model has been successfully loaded.
The effects of the different models are shown in the following figure:
The following figures shows the solution in a browser on a PC:
API interface and performance testing
The solution uses a Text Generation Inference Inference Server, which supports /generate
and /generate_stream
interfaces and uses port 8080 by default. You can make API calls by replacing
The /generate
interface is used to return all responses to the client at once after generating all tokens on the server side.
curl :8080/generate\
-X POST\
-d '{"inputs”: "Calculate the distance from Beijing to Shanghai"}'\
-H 'Content-Type: application/json'
/generate_stream
is used to reduce waiting delays and enhance the user experience by receiving tokens one by one when the model output length is relatively large.
curl :8080/generate_stream \
-X POST\
-d '{"inputs”: "Write an essay on the mental health of elementary school students with no more than 300 words. "}' \
-H 'Content-Type: application/json'
Here is a sample code to use requests interface in python.
import requests
url = "http://:8080/generate"
headers = {"Content-Type": "application/json"}
data = {"inputs": "Calculate the distance from Beijing to Shanghai","parameters":{
"max_new_tokens":200
}
}
response = requests.post(url, headers=headers, json=data)
print(response.text)
Summary
References
About the authors
Zheng Zhang is a technical expert for Amazon Web Services machine learning products, focus on Amazon Web Services-based accelerated computing and GPU instances. He has rich experiences on large-scale model training and inference acceleration in machine learning.
Bingyang Huang is a Go-To-Market Specialist of Accelerated Computing at GCR SSO GenAI team. She has experience on deploying the AI accelerator on customer’s production environment. Outside of work, she enjoys watching films and exploring good foods.
Tian Shi is Senior Solution Architect at Amazon Web Services. He has rich experience in cloud computing, data analysis, and machine learning and is currently dedicated to research and practice in the fields of data science, machine learning, and serverless. His translations include Machine Learning as a Service, DevOps Practices Based on Kubernetes, Practical Kubernetes Microservices, Prometheus Monitoring Practice, and CoreDNS Study Guide in the Cloud Native Era.
Chuan Xie is a Senior Solution Architect at Amazon Web Services Generative AI, responsible for the design, implementation, and optimization of generative artificial intelligence solutions based on the Amazon Cloud. River has many years of production and research experience in the communications, ecommerce, internet and other industries, and rich practical experience in data science, recommendation systems, LLM RAG, and others. He has multiple AI-related product technology invention patents.