
Toto 2.0 has been released on Hugging Face as an open-weight family of time-series forecasting foundation models spanning five parameter budgets from 4 million to 2.5 billion parameters and accompanied by an infrastructure library for distributed training. The release includes model weights and supporting code under the Apache 2.0 license. This launch matters because the team reports consistent quality improvements as model size increases, a property they highlight as rare for time-series foundation models and important for builders deciding where to invest compute and engineering effort.
All Toto 2.0 variants were trained from a single recipe on a mix of observability and synthetic data, and the authors say the models did not see any public forecasting datasets during pretraining. A technical report is promised that will disclose full training details: data composition, architectural and training recipes, and the u-μP hyperparameter transfer pipeline the team used to tune hyperparameters once on a small proxy model and then transfer them to larger sizes.
Benchmark results place Toto 2.0 at or near the top of the evaluated leaderboards. On the BOOM observability forecasting benchmark, every Toto 2.0 size sits on or near the Pareto frontier. The three largest sizes lead GIFT‑Eval among foundation models and also hold top ranks on TIME, a contamination — resistant zero-shot benchmark. When broader leaderboards include finetuned and ensemble systems, a finetuned 2.5 billion — parameter variant (FT) and a “Family and Friends” (FnF) ensemble occupy the top two slots.
The release emphasizes scaling behavior and efficiency: reported metrics show monotonic improvements in CRPS rank as parameter count increases across the family, with no sign of saturation at the largest 2.5 billion — parameter model. The authors also claim a generational jump over Toto 1.0 — Toto 2.0 is said to be roughly seven times more parameter — efficient to match quality and to deliver materially faster inference, improving the cost‑performance trade — offs for production use.
The team calls out practical behaviors relevant to engineers and researchers. They report improvements in inference latency and better long-horizon stability, and note that the model generalizes broadly despite not using public forecasting datasets in pretraining. At the same time, the release identifies remaining gaps and research directions: closing the long-horizon performance gap versus some classical baselines, improved data curation, evaluation metrics that track downstream value, and work on multimodality.
Sources
Replies (0)
No replies in this topic yet.