Build Agentic Workflows with OpenAI GPT OSS on Amazon SageMaker AI and Amazon Bedrock AgentCore | Amazon Web Services

OpenAI has released two open-weight models, gpt-oss-120b (117 billion parameters) and gpt-oss-20b (21 billion parameters), both built with a Mixture of Experts (MoE) design and a 128K context window. These models are the leading open source models, according to Artificial Analysis benchmarks, and excel at reasoning and agentic workflows. With Amazon SageMaker AI, you can fine-tune or customize models and deploy with your choice of framework through a fully managed service. Amazon SageMaker Inference gives you the flexibility to bring your own inference code and framework without having to build and maintain your own clusters.

Although large language models (LLMs) excel at understanding language and generating content, building real-world agentic applications requires complex workflow management, tool calling capabilities, and context management. Multi-agent architectures address these challenges by breaking down complex systems into specialized components, but they introduce new complexities in agent coordination, memory management, and workflow orchestration.

In this post, we show how to deploy gpt-oss-20b model to SageMaker managed endpoints and demonstrate a practical stock analyzer agent assistant example with LangGraph, a powerful graph-based framework that handles state management, coordinated workflows, and persistent memory systems. We will then deploy our agents to Amazon Bedrock AgentCore, a unified orchestration layer that abstracts away infrastructure and allows you to securely deploy and operate AI agents at scale.

Solution overview

In this solution, we build an agentic stock analyzer with the following key components:

  • The GPT OSS 20B model deployed to a SageMaker endpoint using vLLM, an open source serving framework for LLMs
  • LangGraph to build a multi-agent orchestration framework
  • Amazon Bedrock AgentCore to deploy the agents

The following diagram illustrates the solution architecture.

This architecture illustrates a multi-agent workflow hosted on Amazon Bedrock AgentCore Runtime running on AWS. A user submits a query, which is handled by a pipeline of specialized agents—Data Gathering Agent, Stock Performance Analyzer Agent, and Stock Report Generation Agent—that are each responsible for a distinct part of the stock evaluation process.

These agents collaborate within Amazon Bedrock AgentCore Runtime, and when language understanding or generation is required, they invoke a GPT OSS model hosted on SageMaker AI. The model processes the input and returns structured outputs that inform agent actions, enabling a fully serverless, modular, and scalable agentic system using open-source models.

Prerequisites

  1. Ensure that you have required quota for G6e instances to deploy the model. Request quota here if you do not.
  2. If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain.
  3. Ensure your IAM role has required permissions to deploy SageMaker Models and Endpoints. For more information, see How Amazon SageMaker AI works with IAM in the SageMaker Developer Guide.

Deploy GPT-OSS models to SageMaker Inference

Customers who want to customize their models and frameworks can deploy using serverful deployments, but this requires access to GPUs, serving frameworks, load balancers, and infrastructure setup. SageMaker AI provides a fully managed hosting platform that takes care of provisioning the infrastructure with the necessary drivers, downloads the models, and deploys them. OpenAI’s GPT-OSS models are launched with a 4-bit quantization scheme (MXFP4), enabling fast inference while keeping resource usage low. These models can run on P5(H100), P6(H200), and P4(A100) and G6e(L40) instances.The GPT-OSS models are sparse MoE architectures with 128 experts (120B) or 32 experts (20B), where each token is routed to 4 experts with no shared expert. Using MXFP4 for MoE weights alone reduces the model sizes to 63 GB (120B) and 14 GB (20B), making them runnable on a single H100 GPU.

To deploy these models effectively, you need a powerful serving framework like vLLM. To deploy the model, we build a vLLM container with the latest version that supports GPT OSS models on SageMaker AI.

You can use the following Docker file and script to build the container and push it to a local Amazon Elastic Container Registry (Amazon ECR). The recommended approach is to do this directly from Amazon SageMaker Studio, which provides a managed JupyterLab environment with AWS CLI access where you can build and push images to ECR as part of your SageMaker workflow. Alternatively, you can also perform the same steps on an Amazon Elastic Compute Cloud (Amazon EC2) instance with Docker installed.

After you have built and pushed the container to Amazon ECR, you can open Amazon SageMaker Studio by going to the SageMaker AI console, as shown in the following screenshot.

You can then create a Jupyter space or use an existing one to launch JupyterLab and run notebooks.

Clone the following notebook and run “Option 3: Deploying from HF using BYOC.” Update the required parameters, such as the inference image in the notebook with the container image. We also provide necessary environment variables, as shown in the following code.

inference_image  f"{account_id}.dkr.ecr.{region}.amazonaws.com/vllm:v0.10.0-gpt-oss"
instance_type  "ml.g6e.4xlarge"
num_gpu  1
model_name  sagemakerutilsname_from_base("model-byoc")
endpoint_name  model_name
inference_component_name  f"ic-{model_name}"
config  {
"OPTION_MODEL": "openai/gpt-oss-20b",
"OPTION_SERVED_MODEL_NAME": "model",
"OPTION_TENSOR_PARALLEL_SIZE": jsondumps(num_gpu),
"OPTION_ASYNC_SCHEDULING": "true",
}

After you set up the deployment configuration, you can deploy to SageMaker AI using the following code:

from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

lmi_model = sagemaker.Model(
    image_uri=inference_image,
    env=config,
    role=role,
    name=model_name,
)

lmi_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=600,
    endpoint_name=endpoint_name,
    endpoint_type=sagemaker.enums.EndpointType.INFERENCE_COMPONENT_BASED,
    inference_component_name=inference_component_name,
    resources=ResourceRequirements(requests={"num_accelerators": num_gpu, "memory": 1024*5, "copies": 1,}),
)

You can now run an inference example:

payload={
    "messages": [
        {"role": "user", "content": "Name popular places to visit in London?"}
    ],
}
res = llm.predict(payload)
print("-----\n" + res["choices"][0]["message"]["content"] + "\n-----\n")
print(res["usage"])

-----
Here are some of the must‑see spots in London — a mix of iconic landmarks, world‑class museums, and vibrant neighborhoods:

| # | Place | Why It’s Popular |
|---|-------|------------------|
| 1 | **Buckingham Palace** | The Queen’s official London residence – watch the Changing of the Guard. |
| 2 | **The Tower of London & Tower Bridge** | Historic castle, Crown Jewels, and the iconic bridge with glass floors. |
| 3 | **The British Museum** | World‑famous collection from the Rosetta Stone to Egyptian mummies (free entry). |
| 4 | **The Houses of Parliament & Big Ben** | The classic symbol of London’s politics and architecture. |
| 5 | **The National Gallery (Tate Britain)** | Home to masterpieces from Van Gogh to Turner. |
| 6 | **Buckinghamshire Gardens (Kew Gardens)** | Stunning botanical gardens with a glasshouse and the Horniman Insect Zoo. |
| 7 | **Camden Market** | Eclectic stalls, street food, music and vintage fashion. |
| 8 | **Covent Garden** | Lively piazza with street performers, boutique shops, and the Royal Opera House. |
| 9 | **West End Theatres** | Theatre district famous for grand productions (musicals, dramas). |
|10 | **The Shard** | Skyscraper with panoramic 360° views of London. |
|11 | **St. Paul’s Cathedral** | Massive dome, stunning interior and a climb up the Whispering Gallery. |
|12 | **The Tate Modern** | Contemporary art museum set in a former power station. |
|13 | **The Victoria & Albert Museum** | Design and fashion, costume, and jewelry collections. |
|14 | **Hyde Park & Kensington Gardens** | Huge green spaces with Serpentine Lake, Speaker’s Corner and Speakers' Corner. |
|15 | **Oxford Street & Regent Street** | Prime shopping streets for fashion, flagship stores, and historic architecture. |

These spots cover history, culture, shopping, and leisure—perfect for a first visit or a weekend escape in London!
-----

Use LangGraph to build a stock analyzer agent

For our stock analyzing multi-agent system, we use LangGraph to orchestrate the workflow. Jupyter notebook for the code is located in this github repository. The system comprises three specialized tools that work together to analyze stocks comprehensively:

  • The gather_stock_data tool collects comprehensive stock data for a given ticker symbol, including current price, historical performance, financial metrics, and market data. It returns formatted information covering price history, company fundamentals, trading metrics, and recent news headlines.
  • The analyze_stock_performance tool performs detailed technical and fundamental analysis of stock data, calculating metrics like price trends, volatility, and overall investment scores. It evaluates multiple factors including P/E ratios, profit margins, and dividend yields to provide a comprehensive performance analysis
  • The generate_stock_reporttool creates professional PDF reports from the gathered stock data and analysis, automatically uploading them to Amazon S3 with organized date-based folders.

For local testing, you can use a simplified version of the system by importing the necessary functions from your local script. For example:

from langgraph_stock_local import langgraph_stock_sagemaker
# Test the agent locally
result = langgraph_stock_sagemaker({
    "prompt": "Analyze SIM_STOCK Stock for Investment purposes."
})
print(result)

This way, you can iterate quickly on your agent’s logic before deploying it to a scalable platform, making sure each component functions correctly and the overall workflow produces the expected results for different types of stocks.

Deploy to Amazon Bedrock AgentCore

After you have developed and tested your LangGraph framework locally, you can deploy it to Amazon Bedrock AgentCore Runtime. Amazon Bedrock AgentCore handles the heavy lifting of container orchestration, session management, scalability and abstracting the management of infrastructure. It provides persistent execution environments that can maintain an agent’s state across multiple invocations.

Before deploying our stock analyzer agent to Amazon Bedrock AgentCore Runtime, we need to create an AWS Identity and Access Management IAM role with the appropriate permissions. This role allows Amazon Bedrock AgentCore to invoke your SageMaker endpoint for GPT-OSS model inference, manage ECR repositories for storing container images, write Amazon CloudWatch logs for monitoring and debugging, access Amazon Bedrock AgentCore workload services for runtime operations, and send telemetry data to AWS X-Ray and CloudWatch for observability. See the following code:

from create_agentcore_role import create_bedrock_agentcore_role
role_arn = create_bedrock_agentcore_role(
    role_name="MyStockAnalyzerRole",
    sagemaker_endpoint_name="your-endpoint-name",
    region="us-west-2"
)

After creating the role, you can use the Amazon Bedrock AgentCore Starter Toolkit to deploy your agent. The toolkit simplifies the deployment process by packaging your code, creating the necessary container image, and configuring the runtime environment:

from bedrock_agentcore_starter_toolkit import Runtime
agentcore_runtime = Runtime()
# Configure the agent
response = agentcore_runtime.configure(
    entrypoint="langgraph_stock_sagemaker_gpt_oss.py",
    execution_role=role_arn,
    auto_create_ecr=True,
    requirements_file="requirements.txt",
    region="us-west-2",
    agent_name="stock_analyzer_agent"
)
# Deploy to the cloud
launch_result = agentcore_runtime.launch(local=False, local_build=False)

When you’re using BedrockAgentCoreApp, it automatically creates an HTTP server that listens on port 8080, implements the required /invocations endpoint for processing the agent’s requirements, implements the/ping endpoint for health checks (which is very important for asynchronous agents), handles proper content types and response formats, and manages error handling according to AWS standards.

After you deploy to Amazon Bedrock AgentCore Runtime, you will be able to see the status show as Ready on the Amazon Bedrock AgentCore console.

Invoke the agent

After you create the agent, you must set up the agent invocation entry point. With Amazon AgentCore Runtime, we decorate the invocation part of our agent with the @app.entrypoint decorator and use it as the entry point for our runtime. After you deploy the agent to Amazon AgentCore Runtime, you can invoke it using the AWS SDK:

import boto3
import json
agentcore_client = boto3.client('bedrock-agentcore', region_name="us-west-2")
response = agentcore_client.invoke_agent_runtime(
    agentRuntimeArn=launch_result.agent_arn,
    qualifier="DEFAULT",
    payload=json.dumps({
        "prompt": "Analyze SIM_STOCK for investment purposes"
    })
)

After invoking the stock analyzer agent through Amazon Bedrock AgentCore Runtime, you must parse and format the response for clear presentation. The response processing involves the following steps:

  1. Decode the byte stream from Amazon Bedrock AgentCore into readable text.
  2. Parse the JSON response containing the complete stock analysis.
  3. Extract three main sections using regex pattern matching:
    1. Stock Data Gathering Section: Extracts core stock information including symbol, company details, current pricing, market metrics, financial ratios, trading data, and recent news headlines.
    2. Performance Analysis section: Analyzes technical indicators, fundamental metrics, and volatility measures to generate comprehensive stock analysis.
    3. Stock Report Generation Section: Generates a detailed PDF report with all the Stock Technical Analysis.

The system also includes error handling that gracefully handles JSON parsing errors, falls back to plain text display if structured parsing fails, and provides debugging information for troubleshooting parsing issues of the stock analysis response.

stock_analysis = parse_bedrock_agentcore_stock_response(invoke_response)

This formatted output makes it straightforward to review the agent’s decision-making process and present professional stock analysis results to stakeholders, completing the end-to-end workflow from model deployment to meaningful business output:

STOCK DATA GATHERING REPORT:
================================
Stock Symbol: SIM_STOCK
Company Name: Simulated Stock Inc.
Sector: SIM_SECTOR
Industry: SIM INDUSTRY
CURRENT MARKET DATA:
- Current Price: $29.31
- Market Cap: $3,958
- 52-Week High: $29.18
- 52-Week Low: $16.80
- YTD Return: 1.30%
- Volatility (Annualized): 32.22%
FINANCIAL METRICS:
- P/E Ratio: 44.80
- Forward P/E: 47.59
- Price-to-Book: 11.75
- Dividend Yield: 0.46%
- Revenue (TTM): $4,988
- Profit Margin: 24.30%

STOCK PERFORMANCE ANALYSIS:
===============================
Stock: SIM_STOCK | Current Price: $29.31
TECHNICAL ANALYSIS:
- Price Trend: SLIGHT UPTREND
- YTD Performance: 1.03%
- Technical Score: 3/5
FUNDAMENTAL ANALYSIS:
- P/E Ratio: 34.80
- Profit Margin: 24.30%
- Dividend Yield: 0.46%
- Beta: 1.165
- Fundamental Score: 3/5
STOCK REPORT GENERATION:
===============================
Stock: SIM_STOCK 
Sector: SIM_INDUSTRY
Current Price: $29.78
REPORT SUMMARY:
- Technical Analysis: 8.33% YTD performance
- Report Type: Comprehensive stock analysis for informational purposes
- Generated: 2025-09-04 23:11:55
PDF report uploaded to S3: s3://amzn-s3-demo-bucket/2025/09/04/SIM_STOCK_Stock_Report_20250904_231155.pdf
REPORT CONTENTS:
• Executive Summary with key metrics
• Detailed market data and financial metrics
• Technical and fundamental analysis
• Professional formatting for documentation

Clean up

You can delete the SageMaker endpoint to avoid accruing costs after your testing by running the following cells in the same notebook:

sessdelete_inference_component(inference_component_name)
sessdelete_endpoint(endpoint_name)
sessdelete_endpoint_config(endpoint_name)
sessdelete_model(model_name)

You can also delete Amazon Bedrock AgentCore resources using the following commands:

runtime_delete_response  agentcore_control_clientdelete_agent_runtime(
agentRuntimeIdlaunch_resultagent_id
)
response  ecr_clientdelete_repository(
repositoryNamelaunch_resultecr_urisplit('/')[1],
force
)

Conclusion

In this post, we built an end-to-end solution for deploying OpenAI’s open-weight models on a single G6e(L40s) GPU, creating a multi-agent stock analysis system with LangGraph and deploying it seamlessly with Amazon Bedrock AgentCore. This implementation demonstrates how organizations can now use powerful open source LLMs cost-effectively with efficient serving frameworks such as vLLM. Beyond the technical implementation, enhancing this workflow can provide significant business value, such as reduction in stock analysis processing time, increased analyst productivity by automating routine stock assessments. Furthermore, by freeing analysts from repetitive tasks, organizations can redirect skilled professionals toward complex cases and relationship-building activities that drive business growth.

We invite you to try out our code samples and iterate your agentic workflows to meet your use cases.


About the authors

Vivek Gangasani is a Worldwide Lead GenAI Specialist Solutions Architect for SageMaker Inference. He drives Go-to-Market (GTM) and Outbound Product strategy for SageMaker Inference. He also helps enterprises and startups deploy, manage, and scale their GenAI models with SageMaker and GPUs. Currently, he is focused on developing strategies and solutions for optimizing inference performance and GPU efficiency for hosting Large Language Models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Surya Kari is a Senior Generative AI Data Scientist at AWS, specializing in developing solutions leveraging state-of-the-art foundation models. He has extensive experience working with advanced language models including DeepSeek-R1, the Llama family, and Qwen, focusing on their fine-tuning and optimization for specific scientific applications. His expertise extends to implementing efficient training pipelines and deployment strategies using AWS SageMaker, enabling the scaling of foundation models from development to production. He collaborates with customers to design and implement generative AI solutions, helping them navigate model selection, fine-tuning approaches, and deployment strategies to achieve optimal performance for their specific use cases.