
Together AI published a technical breakdown (Sebastien Beurnier, 5/29/2026) describing a production speech — to-text stack it says achieved top latency rankings in Artificial Analysis. The team frames automatic speech recognition (ASR) as a full-path systems problem — spanning GPU execution, CPU preprocessing, memory movement, transport, scheduling and runtime behavior — arguing that end-to-end engineering, not just GPU inference, is what delivered the performance gains. This matters most for production streaming workloads where small jitter and per-packet latency determine quality of service.
The stack serves two low-latency models called out by Artificial Analysis: NVIDIA’s Parakeet — TDT 0.6B v3 and OpenAI’s Whisper Large v3. Together reports Parakeet — TDT 0.6B v3 as the faster model and cites a headline throughput: roughly 20 hours of speech transcribed in under 10 seconds, a figure used to illustrate the stack’s end-to-end speed rather than single — step inference performance.
Together’s engineers lay out why audio changes the serving problem. Raw audio inputs are orders of magnitude larger than text and must pass through container decoding, resampling, noise filtering, voice — activity detection (VAD), segmentation and feature computation before reaching the GPU. Speech models themselves are often hundreds of millions to a few billion parameters — much smaller than modern LLMs-so CPU preprocessing and I/O paths commonly dominate latency, especially in streaming scenarios where per-chunk overhead and jitter matter.
The post walks through concrete runtime optimizations the team applied: compiling the encoder to match real audio shape distributions; using NVIDIA TensorRT multi — profile engines to avoid wasteful padded execution paths; conditional CUDA graphs for more efficient control flow; GPU-side decoder control flow to minimize CPU-GPU handoffs; lower — copy CPU paths and shared memory for interprocess transfers; evented streaming I/O; and a Python garbage — collection fix to reduce runtime stalls. Together notes the encoder holds roughly 95% of model weights, making encoder profile tuning particularly valuable, and reports that optimizing encoder execution exposed the decoder loop as the next performance bottleneck.
Before adopting TensorRT for selected encoder profiles, Together ran an optimized PyTorch path using torch.compile plus CUDA graphs as a baseline. Moving those profiles into profile — aware TensorRT reduced memory use from about 6 GB to about 5 GB and produced several — times faster execution for short inputs versus a single large padded profile. gather realistic input — shape distributions, build multi — profile execution plans, cut CPU-copy work, adopt event — driven I/O and control runtime GC-these are the levers Together used to shift latency and throughput in production ASR.
Sources
Replies (0)
No replies in this topic yet.