
The Together Inference Engine posted substantially better end‑to‑end performance for production coding agents in a benchmark published 19 May 2026, reporting 31% higher tokens‑per‑second (TPS) than the next‑fastest open‑source engine (with the TLDR also citing TensorRT — LLM), roughly 2× faster time‑to‑first‑token (TTFT) under saturation, and a 76% lower inference cost compared with Claude Opus 4.6. Those gains matter because they change which inference stacks remain usable once realistic multi‑tenant traffic, long contexts and concurrency are introduced.
The authors attribute Together’s advantage to full‑stack improvements rather than single‑component gains: ThunderMLA scheduling, custom kernel rewrites and end‑to‑end profiling against real traffic combine to reduce prefill contention and improve throughput under load. The report frames these optimizations as coordinated changes that shift behavior across the scheduler, memory/cache use and GPU kernels, not just isolated single‑request speedups.
Methodologically, the benchmark measures multi‑tenant inference behavior instead of isolated single‑request peaks. The report argues that common public tests — which run one user against a dedicated endpoint — miss the failure modes operators see in production, where dozens or hundreds of concurrent requests contend for shared resources and change latency and throughput characteristics as load grows. Under realistic concurrency, shared KV caches, memory bandwidth and GPU cycles become primary bottlenecks: as caches fill and scheduling pressure increases, prefill latency and TTFT grow and some engines begin to fail or degrade. To capture that shift, the authors compare engines across rising queries‑per‑second (QPS) and realistic churn rather than reporting only single‑request peak throughput.
For engineers building coding assistants the study highlights two practical constraints. First, perceived responsiveness tracks TTFT under load because developers wait for the first streamed token. Second, coding workloads pair very long contexts with bounded output shapes (functions instead of essays), which stresses prefill and scheduler behavior more than sustained decode throughput; engines tuned mainly for long, uninterrupted decode may therefore underperform on this class of traffic.
The workload model used prompt lengths that grow to reflect live coding sessions (roughly 45k — 200k tokens) and generation lengths averaging about 450 tokens (p50 293, p99 2,230). Key measurements included input tokens per minute (TPM), per‑user TPS and p50 TTFT while tracking how KV cache occupancy and scheduler behavior evolve with load. Test hardware was 4× NVIDIA B200 per engine, with the SGLang variant run on 8× B200 as noted in the report.
The benchmark release is labeled version one and is authored by Alex Angus, Will Van Eaton and Dan Fu. It reports results across 40+ models selected for production scenarios and is presented as a guide for operators and builders who must tune inference stacks for long‑context, high‑concurrency coding‑assistant traffic rather than relying on single‑request peak numbers.
Sources
Replies (0)
No replies in this topic yet.