Automating model customization in Amazon Bedrock with AWS Step Functions workflow

Large language models have become indispensable in generating intelligent and nuanced responses across a wide variety of business use cases. However, enterprises often have unique data and use cases that require customizing large language models beyond their out-of-the-box capabilities. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. To enable secure and scalable model customization, Amazon Web Services (AWS) announced support for customizing models in Amazon Bedrock at AWS re:Invent 2023. This allows customers to further pre-train selected models using their own proprietary data to tailor model responses to their business context. The quality of the custom model depends on multiple factors including the training data quality and hyperparameters used to customize the model. This requires customers to perform multiple iterations to develop the best customized model for their requirement.

To address this challenge, AWS announced native integration between Amazon Bedrock and AWS Step Functions. This empowers customers to orchestrate repeatable and automated workflows for customizing Amazon Bedrock models.

In this post, we will demonstrate how Step Functions can help overcome key pain points in model customization. You will learn how to configure a sample workflow that orchestrates model training, evaluation, and monitoring. Automating these complex tasks through a repeatable framework reduces development timelines and unlocks the full value of Amazon Bedrock for your unique needs.

Architecture

Architecture Diagram

We will use a summarization use case using Cohere Command Light Model in Amazon Bedrock for this demonstration. However, this workflow can be used for the summarization use case for other models by passing the base model ID and the required hyperparameters and making model-specific minor changes in the workflow. See the Amazon Bedrock user guide for the full list of supported models for customization. All the required infrastructure will be deployed using the AWS Serverless Application Model (SAM).

The following is a summary of the functionality of the architecture:

User uploads the training data in JSON Line into an Amazon Simple Storage Service (Amazon S3) training data bucket and the validation, reference inference data into the validation data bucket. This data must be in the JSON Line format.
The Step Function CustomizeBedrockModel state machine is started with the input parameters such as the model to customize, hyperparameters, training data locations, and other parameters discussed later in this post.
- The workflow invokes the Amazon Bedrock CreateModelCustomizationJob API synchronously to fine tune the base model with the training data from the S3 bucket and the passed-in hyperparameters.
- After the custom model is created, the workflow invokes the Amazon Bedrock CreateProvisionedModelThroughput API to create a provisioned throughput with no commitment.
- The parent state machine calls the child state machine to evaluate the performance of the custom model with respect to the base model.
- The child state machine invokes the base model and the customized model provisioned throughput with the same validation data from the S3 validation bucket and stores the inference results into the inference bucket.
- An AWS Lambda function is called to evaluate the quality of the summarization done by custom model and the base model using the BERTScore metric. If the custom model performs worse than the base model, the provisioned throughput is deleted.
- A notification email is sent with the outcome.

Prerequisites

Create an AWS account if you do not already have one.
Access to the AWS account through the AWS Management Console and the AWS Command Line Interface (AWS CLI). The AWS Identity and Access Management (IAM) user that you use must have permissions to make the necessary AWS service calls and manage AWS resources mentioned in this post. While providing permissions to the IAM user, follow the principle of least-privilege.
Git Installed.
AWS Serverless Application Model (AWS SAM) installed.
Docker must be installed and running.
You must enable the Cohere Command Light Model access in the Amazon Bedrock console in the AWS Region where you’re going to run the AWS SAM template. We will customize the model in this demonstration. However, the workflow can be extended with minor model-specific changes to support customization of other supported models. See the Amazon Bedrock user guide for the full list of supported models for customization. You must have no commitment model units reserved for the base model to run this demo.

Demo preparation

The resources in this demonstration will be provisioned in the US East (N. Virginia) AWS Region (us-east-1). We will walk through the following phases to implement our model customization workflow:

Deploy the solution using the AWS SAM template
Upload proprietary training data to the S3 bucket
Run the Step Functions workflow and monitor
View the outcome of training the base foundation model
Clean up

Step 1: Deploy the solution using the AWS SAM template

Refer to the GitHub repository for latest instruction. Run the below steps to deploy the Step Functions workflow using the AWS SAM template. You can

Create a new directory, navigate to that directory in a terminal and clone the GitHub repository:

git clone https://github.com/aws-samples/amazon-bedrock-model-customization.git

Change directory to the solution directory:

cd amazon-bedrock-model-customization

Run the build.sh to create the container image.

When prompted, enter the following parameter values:

image_name=model-evaluation
repo_name=bedrock-model-customization
aws_account={your-AWS-account-id}
aws_region={your-region}

From the command line, use AWS SAM to deploy the AWS resources for the pattern as specified in the template.yml file:

Provide the below inputs when prompted:

Enter a stack name.
Enter us-east-1 or your AWS Region where you enabled Amazon Bedrock Cohere Command Light Model.
Enter SenderEmailId - Once the model customization is complete email will come from this email id. You need to have access to this mail id to verify the ownership.
Enter RecipientEmailId - User will be notified to this email id.
Enter ContainerImageURI - ContainerImageURI is available from the output of the `bash build.sh` step.
Keep default values for the remaining fields.

Note the outputs from the SAM deployment process. These contain the resource names and/or ARNs which are used in the subsequent steps.

Step 2: Upload proprietary training data to the S3 bucket

Our proprietary training data will be uploaded to the dedicated S3 bucket created in the previous step, and used to fine-tune the Amazon Bedrock Cohere Command Light model. The training data needs to be in JSON Line format with every line containing a valid JSON with two attributes: prompt and completion.

I used this public dataset from HuggingFace and converted it to JSON Line format.

Upload the provided training data files to the S3 bucket using the command that follows. Replace TrainingDataBucket with the value from the sam deploy --guided output. Update your-region with the Region that you provided while running the SAM template.

aws s3 cp training-data.jsonl s3://{TrainingDataBucket}/training-data.jsonl --region {your-region}

Upload the validation-data.json file to the S3 bucket using the command that follows. Replace ValidationDataBucket with the value from the sam deploy --guided output. Update your-region with the Region that you provided while running the SAM template:

aws s3 cp validation-data.json s3://{ValidationDataBucket}/validation-data.json --region {your-region}

Upload the reference-inference.json file to the S3 bucket using the command that follows. Replace ValidationDataBucket with the value from the sam deploy --guided output. Update your-region with the region that you provided while running the SAM template.

aws s3 cp reference-inference.json s3://{ValidationDataBucket}/reference-inference.json --region {your-region}

You should have also received an email for verification of the sender email ID. Verify the email ID by following the instructions given in the email.

Email Address Verification Request

Step 3: Run the Step Functions workflow and monitor

We will now start the Step Functions state machine to fine tune the Cohere Command Light model in Amazon Bedrock based on the training data uploaded into the S3 bucket in the previous step. We will also pass the hyperparameters. Feel free to change them.

Run the following AWS CLI command to start the Step Functions workflow. Replace StateMachineCustomizeBedrockModelArn and TrainingDataBucket with the values from the sam deploy --guided output. Replace UniqueModelName and UniqueJobName with unique values. Change the values of the hyperparameters based on the selected model. Update your-region with the region that you provided while running the SAM template.

aws stepfunctions start-execution --state-machine-arn "{StateMachineCustomizeBedrockModelArn}" --input "{\"BaseModelIdentifier\": \"cohere.command-light-text-v14:7:4k\",\"CustomModelName\": \"{UniqueModelName}\",\"JobName\": \"{UniqueJobName}\", \"HyperParameters\": {\"evalPercentage\": \"20.0\", \"epochCount\": \"1\", \"batchSize\": \"8\", \"earlyStoppingPatience\": \"6\", \"earlyStoppingThreshold\": \"0.01\", \"learningRate\": \"0.00001\"},\"TrainingDataFileName\": \"training-data.jsonl\"}" --region {your-region}

Example output:

{
"executionArn": "arn:aws:states:{your-region}:123456789012:execution:{stack-name}-wcq9oavUCuDH:2827xxxx-xxxx-xxxx-xxxx-xxxx6e369948",
"startDate": "2024-01-28T08:00:26.030000+05:30"
}

The foundation model customization and evaluation might take 1 hour to 1.5 hours to complete! You will get a notification email after the customization is done.

Run the following AWS CLI command or sign in to the AWS Step Functions console to check the Step Functions workflow status. Wait until the workflow completes successfully. Replace the executionArn from the previous step output and update your-region.

aws stepfunctions describe-execution --execution-arn {executionArn} --query status --region {your-region}

Step 4: View the outcome of training the base foundation model

After the Step Functions workflow completes successfully, you will receive an email with the outcome of the quality of the customized model. If the customized model isn’t performing better than the base model, the provisioned throughput will be deleted. The following is a sample email:

Model Customization Complete

If the quality of the inference response is not satisfactory, you will need to retrain the base model based on the updated training data or hyperparameters.

See the ModelInferenceBucket for the inferences generated from both the base foundation model and custom model.

Step 5: Cleaning up

Properly decommissioning provisioned AWS resources is an important best practice to optimize costs and enhance security posture after concluding proofs of concept and demonstrations. The following steps will remove the infrastructure components deployed earlier in this post:

Delete the Amazon Bedrock provisioned throughput of the custom mode. Ensure that the correct ProvisionedModelArn is provided to avoid an accidental unwanted delete. Also update your-region.

aws bedrock delete-provisioned-model-throughput --provisioned-model-id {ProvisionedModelArn} --region {your-region}

Delete the Amazon Bedrock custom model. Ensure that the correct CustomModelName is provided to avoid accidental unwanted delete. Also update your-region.

aws bedrock delete-custom-model --model-identifier {CustomModelName} --region {your-region}

Delete the content in the S3 bucket using the following command. Ensure that the correct bucket name is provided to avoid accidental data loss:

aws s3 rm s3://{TrainingDataBucket} --recursive --region {your-region}
aws s3 rm s3://{CustomizationOutputBucket} --recursive --region {your-region}
aws s3 rm s3://{ValidationDataBucket} --recursive --region {your-region}
aws s3 rm s3://{ModelInferenceBucket} --recursive --region {your-region}

To delete the resources deployed to your AWS account through AWS SAM, run the following command:

Conclusion

This post outlined an end-to-end workflow for customizing an Amazon Bedrock model using AWS Step Functions as the orchestration engine. The automated workflow trains the foundation model on customized data and tunes hyperparameters. It then evaluates the performance of the customized model against the base foundation model to determine the efficacy of the training. Upon completion, the user is notified through email of the training results.

Customizing large language models requires specialized machine learning expertise and infrastructure. AWS services like Amazon Bedrock and Step Functions abstract these complexities so enterprises can focus on their unique data and use cases. By having an automated workflow for customization and evaluation, customers can customize models for their needs more quickly and with fewer the operational challenges.

Further study

About the Author

Biswanath Mukherjee is a Senior Solutions Architect at Amazon Web Services. He works with large strategic customers of AWS by providing them technical guidance to migrate and modernize their applications on AWS Cloud. With his extensive experience in cloud architecture and migration, he partners with customers to develop innovative solutions that leverage the scalability, reliability, and agility of AWS to meet their business needs. His expertise spans diverse industries and use cases, enabling customers to unlock the full potential of the AWS cloud.

Automating model customization in Amazon Bedrock with AWS Step Functions workflow | Amazon Web Services