Amazon SageMaker AI Adds OpenAI-Compatible /openai/v1 API for Real-Time Endpoints

News

5/21/2026, 12:29:23 AM

Amazon SageMaker AI Adds OpenAI-Compatible /openai/v1 API for Real-Time Endpoints

Amazon SageMaker AI now exposes an /openai/v1 path on its real-time inference endpoints that accepts OpenAI Chat Completions requests and returns container responses unchanged, including streaming. This makes any OpenAI — compatible client able to call SageMaker endpoints directly without container — side protocol translation, effectively letting developers treat SageMaker like an OpenAI — style inference service. The change simplifies integration for teams that want to run models in their own AWS accounts while using standard OpenAI SDKs and clients.

The OpenAI — compatible interface is enabled for all SageMaker AI endpoints and inference components when accessed via the standard SageMaker APIs and SDKs. SageMaker routes each incoming request by the endpoint name embedded in the URL, and callers can authenticate with time-limited bearer tokens instead of relying on API keys or implementing AWS Signature V4 signing in client code.

Authentication uses bearer tokens generated by the SageMaker Python SDK. Tokens can carry the role or user credentials, be valid for up to 12 hours, and require the IAM permissions sagemaker:CallWithBearerToken and sagemaker:InvokeEndpoint. The blog includes a sample snippet showing generation with from sagemaker.core.token_generator import generate_token and token = generate_token(region="us-west-2", expiry=timedelta(minutes=5)).

To follow the walkthrough you need an AWS account with permissions to create SageMaker AI endpoints, the SageMaker Python SDK (pip install sagemaker), the OpenAI Python SDK (pip install openai), and a model stored in Amazon S3; the post uses Qwen3 — 4B from Hugging Face as its example model. Creating and invoking endpoints also requires an IAM execution role with AmazonSageMakerFullAccess plus the invocation permissions noted above.

AWS frames several developer — facing use cases: run agentic multi — step workflows (for example, Strands Agents or LangChain) entirely on private SageMaker endpoints using dedicated GPU instances in your account; host multiple models on a single endpoint by composing inference components with per-model resource allocation (for example a general Llama model, a fine-tuned Mistral for domain tasks, and a smaller classifier); and serve fine‑tuned open-source models without changing SDK calls, streaming logic, or prompt formatting — only the endpoint URL needs to change.

To help implementers, AWS provides a step-by-step solution walkthrough and a GitHub notebook that demonstrate token generation, deploying single — model endpoints, deploying inference components for multi — model setups, and integrating with the Strands Agents framework. Giorgio Piatti, AI/ML Engineer at Caffeine.AI, says the bearer token feature let his team add SageMaker as a drop-in OpenAI — compatible inference endpoint without custom SigV4 signing, allowing it to work with their LLM gateway (Bifrost), the Vercel AI SDK, and standard OpenAI clients.

Sources

AWS Machine Learning Blog · 5/20/2026

Replies (0)

No replies in this topic yet.

Back