
JetBrains published Mellum2 on June 1, 2026, a 12‑billion‑parameter Mixture‑of‑Experts (MoE) model aimed at high‑throughput, low‑latency text and code workloads. Trained from scratch on natural language and code, Mellum2 activates only about 2.5 billion parameters per token during inference and is released under the Apache 2.0 license, enabling private and commercial deployment. The release frames Mellum2 as an evolution of a code‑completion foundation broadened to general natural language and software engineering tasks and explicitly scopes the model to text and code rather than multimodal inputs.
At the architectural level, Mellum2 uses MoE routing to keep total model capacity high while directing each token to a subset of expert parameters. That selective activation is designed to reduce per‑token compute and serving cost compared with dense models of similar total size. JetBrains highlights this cost and latency profile as a way to make latency‑sensitive model calls practical inside larger, multi‑model stacks where frequent intermediate calls can be a bottleneck.
In its technical report, the team evaluates Mellum2 across code generation, reasoning, science, and math benchmarks. The paper reports the model is competitive with similarly sized open models on those tasks while delivering more than 2× faster inference, a combination the authors argue is well suited to production workloads where throughput and latency matter. The report documents the architecture, training recipe, benchmarks, and evaluation methodology in detail (arXiv:2605.31268).
JetBrains positions Mellum2 for concrete roles inside AI stacks: lightweight routing and orchestration such as prompt classification, tool selection, and control flow; latency‑sensitive retrieval‑augmented generation tasks including context compression, summarization, and retrieval post‑processing; and as sub‑agents for planning, validation, and transformation. The company also presents the model as practical for private, self‑hosted deployments that involve proprietary code or internal data, where avoiding repeated calls to larger models can reduce cost and improve responsiveness.
The Mellum2 model collection and resources are available for download and experimentation in IDEs, RAG pipelines, agent workflows, or private infrastructure; the announcement links to a Hugging Face collection and the full technical report for those who want to reproduce or build on the evaluations.
Sources
Replies (0)
No replies in this topic yet.