Amazon SageMaker AI adds HTTP/2 bidirectional streaming; vLLM adds WebSocket real‑time transcription

News

5/20/2026, 5:58:09 PM

Amazon SageMaker AI adds HTTP/2 bidirectional streaming; vLLM adds WebSocket real‑time transcription

Amazon SageMaker AI will support HTTP/2 bidirectional streaming to model containers starting November 2025, enabling clients to stream audio in while receiving transcription tokens back over a single persistent connection. At the same time, vLLM’s Realtime API adds native WebSocket transcription, allowing incremental token emission instead of waiting for entire recordings to finish. This combination matters because it enables low‑latency, full‑duplex speech workloads — voice agents, live captioning, contact‑center analytics and accessibility tools — that need transcription as audio arrives and cannot be served efficiently by traditional request — response APIs.

The announcement includes a deployment example that runs Voxtral — Mini-4B-Realtime-2602, Mistral AI’s compact real‑time speech model, on a SageMaker AI endpoint using a vLLM container. vLLM exposes its native WebSocket Realtime API at /v1/realtime and is designed to emit tokens piecewise as audio is received. To reduce per‑token latency for such incremental output, vLLM uses piecewise CUDA graph execution to cut GPU kernel launch overhead. Because vLLM is open source, developers keep control over model configuration, quantization and compilation when they build and optimize real‑time pipelines.

On the transport and data side, SageMaker AI provides native HTTP/2 bidirectional streaming on port 8443 and automatically bridges the HTTP/2 event stream protocol from the client to a WebSocket at the container. Client applications are responsible for resampling incoming audio — typically to 16 kHz mono PCM16-chunking it and base64‑encoding each chunk. Those base64 PCM16 chunks are sent over the bridged WebSocket into vLLM, which streams tokens back in real time, enabling incremental transcription and reducing the need to buffer full recordings.

Operational features aim to make the setup production‑ready: SageMaker AI maintains WebSocket connections with ping/pong keepalive frames, performs container health checks, and exposes endpoint‑level monitoring via Amazon CloudWatch for observability and resilience. These platform features remove some of the custom engineering that previously fell to builders — connection management, health monitoring and basic telemetry — so teams can focus on model tuning and application logic rather than low‑level protocol handling.

For developers, the net effect is a managed speech‑to‑text path with lower latency and no custom protocol translation between client and container, backed by example code in the referenced GitHub repository. By combining platform‑level HTTP/2 streaming on port 8443 with vLLM’s WebSocket Realtime API and CUDA optimizations, organizations can meet the strict latency and persistent connection requirements of live speech applications without rebuilding the underlying transport stack.

Sources

AWS Machine Learning Blog · 5/20/2026

Replies (0)

No replies in this topic yet.

Back