
On May 4, 2026, Together AI published a technical blog by Will Van Eaton, Adee Feiner and Hiral Jasani that frames a broad industry shift: inference, not training, is now the dominant driver of production costs for large language models. The authors argue inference is an ongoing, scaling expense that grows with users and features, and they position the company’s recent research and production systems as direct responses to that operational challenge for builders.
The post summarizes several concrete research outputs and deployed systems. FlashAttention has advanced to FlashAttention-4, which the company reports is up to 1.3× faster than cuDNN on NVIDIA Blackwell hardware. The blog also highlights ThunderKittens and ATLAS, an adaptive speculative decoding system the authors say can deliver up to 4× faster LLM inference. Together additionally describes Aurora as an open-source, reinforcement learning — based framework that learns from live inference, and notes that this research is frequently pushed into production for customers within weeks.
Together frames inference as a multi — dimensional optimization problem where latency and throughput require different trade — offs. The blog stresses that sub-500 ms responses remain critical for interactive experiences, and that agentic workflows amplify latency because multiple model calls accumulate — for example, five calls at 200 ms each add up to a full second. On the throughput side, the company emphasizes that faster inference increases requests served per GPU-hour and therefore improves unit economics for AI-native firms.
The authors quantify the economic pressure driving these investments: inference is estimated to account for 80 — 90% of the total lifetime cost of a production AI system, and they cite industry data suggesting inference can represent roughly 23% of revenue at scaling — stage AI companies. Based on those figures, Together argues that continuous investment across the stack — from algorithms to orchestration to hardware — is necessary rather than optional for teams operating at scale.
At NVIDIA GTC 2026, Jensen Huang summarized the shift by saying, “People pay for information, but people mostly pay for work. Agentic systems get work done.” Together’s CTO Ce Zhang presented related operational lessons at the same conference, noting that concurrency, variable context lengths and evolving model architectures create scheduling, orchestration and hardware challenges that change how infrastructure teams prioritize engineering effort. adaptive speculative decoding and other compiler — and kernel — level advances can yield roughly 1.5–4× end-to-end speedups on predictable workloads; aligning latency and throughput targets with product expectations directly affects margins; and maintaining competitive inference performance requires ongoing research, system design and continuous hardware tuning rather than one-off fixes.
Sources
Replies (0)
No replies in this topic yet.