Tailoring foundation models for your business needs: A comprehensive guide to RAG, fine-tuning, and hybrid approaches | Amazon Web Services

Foundation models (FMs) have revolutionised AI capabilities, but adopting them for specific business needs can be challenging. Organizations often struggle with balancing model performance, cost-efficiency, and the need for domain-specific knowledge. This blog post explores three powerful techniques for tailoring FMs to your unique requirements: Retrieval Augmented Generation (RAG), fine-tuning, and a hybrid approach combining both methods. We dive into the advantages, limitations, and ideal use cases for each strategy.

AWS provides a suite of services and features to simplify the implementation of these techniques. Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock Knowledge Bases provides native support for RAG, streamlining the process of enhancing model outputs with domain-specific information. Amazon Bedrock also offers native features for model customizations through continued pre-training and fine-tuning. In addition, you can use Amazon Bedrock Custom Model Import to bring and use your customized models alongside existing FMs through a single serverless, unified API. Use Amazon Bedrock Model Distillation to use smaller, faster, more cost-effective models that deliver use-case specific accuracy that is comparable to the most advanced models in Amazon Bedrock.

For this post, we have used Amazon SageMaker AI for the fine-tuning and hybrid approach to maintain more control over the fine-tuning script and try different fine-tuning methods. In addition, we have used Amazon Bedrock Knowledge Bases for the RAG approach as shown in Figure 1.

To help you make informed decisions, we provide ready-to-use code in our Github repo, using these AWS services to experiment with RAG, fine-tuning, and hybrid approaches. You can evaluate their performance based on your specific use case and your dataset, and use the model that best fits to effectively customize FMs for your business needs.

Figure 1: Architecture diagram for RAG, fine-tuning and hybrid approaches

Retrieval Augmented Generation

RAG is a cost-effective way to enhance AI capabilities by connecting existing models to external knowledge sources. For example, an AI powered customer service chatbot using RAG can answer questions about current product features by first checking the product documentation knowledge base. If a customer asks a question, the system retrieves the specific details from the product knowledge base before composing its response, helping to make sure that the information is accurate and up-to-date.

A RAG approach gives AI models access to external knowledge sources for better responses and has two main steps: retrieval for finding the relevant information from connected data sources and generation using an FM to generate an answer based on the retrieved information.

Fine-tuning

Fine-tuning is a powerful way to customize FMs for specific tasks or domains using additional training data. In fine-tuning, you adjust the model’s parameters using a smaller, labelled dataset relevant to the target domain.

For example, to build an AI powered customer service chatbot, you can fine-tune an existing FM using your own dataset to handle questions about a company’s product features. By training the model on historical customer interactions and product specifications, the fine-tuned model learns the context and the company messaging tone to provide more accurate responses.

If the company launches a new product, the model should be fine-tuned again with new data to update its knowledge and maintain relevance. Fine-tuning helps make sure that the model can deliver precise, context-aware responses. However, it requires more computational resources and time compared to RAG, because the model itself needs to be retrained with the new data.

Hybrid approach

The hybrid approach combines the strengths of RAG and fine-tuning to deliver highly accurate, context-aware responses. Let’s consider an example, a company frequently updates the features of its products. They want to customize their FM using internal data, but keeping the model updated with changes in the product catalog is challenging. Because product features change monthly, keeping the model up to date would be costly and time-consuming.

By adopting a hybrid approach, the company can reduce costs and improve efficiency. They can fine-tune the model every couple of months to keep it aligned with the company’s overall tone. Meanwhile, RAG can retrieve the latest product information from the company’s knowledge base, helping to make sure that responses are up-to-date. Fine-tuning the model also enhances RAG’s performance during the generation phase, leading to more coherent and contextually relevant responses. If you want to further improve the retrieval phase, you can customize the embedding model, use a different search algorithm, or explore other retrieval optimization techniques.

The following sections provide the background for dataset creation and implementation of the three different approaches

Prerequisites

To deploy the solution, you need:

Dataset description

For the proof-of-concept, we created two synthetic datasets using Anthropic’s Claude 3 Sonnet on Amazon Bedrock.

Product catalog dataset

This dataset is your primary knowledge source in Amazon Bedrock. We created a product catalog which consists of 15 fictitious manufacturing products by prompting Anthropic’s Claude 3 Sonnet using example product catalogs. You should create your dataset in .txt format. The format in the example for this post has the following fields:

  • Product names
  • Product descriptions
  • Safety instructions
  • Configuration manuals
  • Operation instructions

Train and test the dataset

We use the same product catalog we created for the RAG approach as training data to run domain adaptation fine-tuning.

The test dataset consists of question-and-answer pairs about the product catalog dataset created earlier. We used this code in the Question-Answer Dataset Jupyter notebook section to generate the test dataset.

Implementation

We implemented three different approaches: RAG, fine-tuning, and hybrid. See the Readme file for instructions to deploy the whole solution.

RAG

The RAG approach uses Amazon Bedrock Knowledge Bases and consists of two main parts.

To set up the infrastructure:

  1. Update the config file with your required data (details in the Readme)
  2. Run the following commands in the infrastructure folder:
cd infrastructure
./prepare.sh
cdk bootstrap aws://<>/<>
cdk synth
cdk deploy --all

Context retrieval and response generation:

  1. The system finds relevant information by searching the knowledge base with the user’s question
  2. It then sends both the user’s question and the retrieved information to Meta LLama 3.1 8b LLM on Amazon Bedrock
  3. The LLM will then generate a response based on the user’s question and retrieved information

Fine-tuning

We used Amazon SageMaker AI JumpStart to fine-tune the Meta Llama 3.1 8b Instruct model using domain adaptation method for 5 epochs. You can adjust the following parameters in the config.py file:

  • Fine-tuning method: You can change the fine-tuning method in the config file, the default is domain_adaptation.
  • Number of epochs: Adjust number of epochs in the config file according to your data size.
  • Fine-tuning template: Change the template based on your use-case. The current one prompts the LLM to answer a customer question.

Hybrid

The hybrid approach combines RAG and fine-tuning, and uses the following high-level steps:

  1. Retrieve the most relevant context based on the user’s question from the Knowledge Base
  2. The fine-tuned model generates answers using the retrieved context

You can customize the prompt template in the config.py file.

Evaluation

For this example, we use three evaluation metrics to measure performance. You can modify src/evaluation.py to implement your own metrics for your evaluation implementation.

Each metric helps you understand different aspects of how well each of the approaches works:

  • BERTScore: BERTScore tells you how similar the generated answers are to the correct answers using cosine similarities. It calculates precision, recall, and F1 measure. We used the F1 measure as the evaluation score.
  • LLM evaluator score: We use different language models from Amazon Bedrock to score the responses from RAG, fine-tuning, and Hybrid approaches. Each evaluation receives both the correct answers and the generated answers and gives a score between 0 and 1 (closer to 1 indicates higher similarity) for each generated answer. We then calculate the final score by averaging all the evaluation scores. The process is shown in the following figure.

Figure 2: LLM evaluator method

  • Inference latency: Response times are important in applications like chatbots, so depending on your use case, this metric might be important in your decision. For each approach, we averaged the time it took to receive a full response for each sample.
  • Cost analysis: To do a full cost analysis, we made the following assumptions:
    • We used one OpenSearch compute unit (OCU) for indexing and another for the search related to document indexing in RAG. See OpenSearch Serverless pricing for more details.
    • We assume an application that has 1,000 users, each of them conducting 10 requests per day with an average of 2,000 input tokens and 1,000 output tokens. See Amazon Bedrock pricing for more details.
    • We used ml.g5.12xlarge instance for fine-tuning and hosting the fine-tuned model. The fine-tuning job took 15 minutes to complete. See SageMaker AI pricing for more details.
    • For fine-tuning and the hybrid approach, we assume that the model instance is up 24/7, which might vary according to your use case.
    • The cost calculation is done for one month.

Based on those assumptions, the cost associated with each of the three approaches is calculated as follows:

  • For RAG: 
    • OpenSearch Serverless monthly costs = Cost of 1 OCU per hour * 2 OCUs * 24 hours * 30 days
    • Total invocations for Meta Llama 3.1 8b = 1000 user * 10 requests * (price per input token * 2,000 + price per output token * 1,000) * 30 days
  • For fine-tuning:
    • (Number of minutes used for the fine-tuning job / 60) * Hourly cost of an ml.g5.12xlarge instance
    • Hourly cost of an ml.g5.12xlarge instance hosting * 24 hours * 30 days
  • For hybrid:
    • OpenSearch Serverless monthly costs = Cost of 1 OCU per hour * 2 OCUs * 24 hours * 30 days
    • (Number of minutes used for the finetuning job / 60) * cost of an ml.g5.12xlarge instance
    • Hourly cost of ml.g5.12xlarge instance hosting * 24 hours * 30 days

Results

You can find detailed evaluation results in two places in the code repository. The individual scores for each sample are in the JSON files under data/output, and a summary of the results is in summary_results.csv in the same folder.

The results shown in the following table show:

  • How each approach (RAG, fine-tuning, and hybrid) performs
  • Their scores from both BERTScore and LLM evaluators
  • The cost analysis for each method calculated for the US East region
Approach Average BERTScore Average LLM evaluator score Average inference time (in seconds) Cost per month (US East region)
RAG 0.8999 0.8200 8.336 ~=350 + 198 ~= 548$
Finetuning 0.8660 0.5556 4.159 ~= 1.77 + 5105 ~= 5107$
Hybrid 0.8908 0.8556 17.700 ~= 350 + 1.77 + 5105 ~= 5457$

Note that the costs for both the fine-tuning and hybrid approach can decrease significantly depending on the traffic pattern if you set the real-time inference endpoint from SageMaker to scaledown to zero instances when not in use. 

Clean up

Follow the cleanup section in the Readme file in order to avoid paying for unused resources.

Conclusion

In this post, we showed you how to implement and evaluate three powerful techniques for tailoring FMs to your business needs: RAG, fine-tuning, and a hybrid approach combining both methods. We provided ready-to-use code to help you experiment with these approaches and make informed decisions based on your specific use case and dataset.

The results in this example were specific to the dataset that we used. For that dataset, RAG outperformed fine-tuning and achieved comparable results to the hybrid approach with a lower cost, but fine-tuning led to the lowest latency. Your results will vary depending on your dataset.

We encourage you to test these approaches using our code as a starting point:

  1. Add your own datasets in the data folder
  2. Fill out the config.py file
  3. Follow the rest of the readme instructions to run the full evaluation

About the Authors

Idil Yuksel is a Working Student Solutions Architect at AWS, pursuing her MSc. in Informatics with a focus on machine learning at the Technical University of Munich. She is passionate about exploring application areas of machine learning and natural language processing. Outside of work and studies, she enjoys spending time in nature and practicing yoga.

Karim Akhnoukh is a Senior Solutions Architect at AWS working with customers in the financial services and insurance industries in Germany. He is passionate about applying machine learning and generative AI to solve customers’ business challenges. Besides work, he enjoys playing sports, aimless walks, and good quality coffee.