StepFun Releases Step 3.7 Flash, a 198B MoE multimodal model for Coding Agents and search

News

5/29/2026, 11:16:07 PM

StepFun Releases Step 3.7 Flash, a 198B MoE multimodal model for Coding Agents and search

StepFun published Step 3.7 Flash on May 29, 2026 as a multimodal sparse Mixture‑of‑Experts (MoE) model aimed at agentic coding and search workflows, highlighting native vision input and improved tool‑use reliability. The release introduces an Advisor Mode designed to keep most execution on cheaper paths while escalating to a larger advisor model only at strategic planning or failure‑recovery points. That combination positions the model for predictable, cost‑aware agent deployments in environments that mix retrieval, code generation and visual reasoning.

The model totals 198 billion parameters: a 196B language backbone plus a 1.8B ViT visual encoder that injects image representations into the language context. Because Step 3.7 Flash uses sparse MoE routing, roughly 11B parameters activate per token during inference, which StepFun frames as keeping inference compute closer to an 11B dense model while retaining a 198B parameter budget. The system supports a 256k token context window, throughputs up to 400 tokens/sec, and three selectable reasoning depths (low/medium/high) to trade latency for deeper reasoning.

StepFun released the model under an Apache 2.0 license. StepFun reports measurable coding gains versus Step 3.5 Flash on multiple benchmarks: SWE‑Bench Pro rose to 56.26% from 51.3%, Terminal‑Bench climbed to 59.55% from 53.37%, and SWE‑MTLG reached 72.42%. In its internal Step‑SWE‑Bench harness comparisons, Step 3.7 Flash’s per‑harness results span 64.5%–71.5%, narrowing the variance previously seen with 3.5 (43%–73%). The company highlights this reduced cross‑harness spread as beneficial for more predictable agent behavior across heterogeneous scaffolds and tool schemas.

Advisor Mode is StepFun’s on‑device advisor strategy: the model runs the agentic loop end‑to‑end-calling tools, reading outputs and iterating — and escalates to a larger advisor at planning or failure‑recovery points. On StepFun’s internal SWE‑Bench Verified tests, enabling Advisor Mode reportedly brings Step 3.7 Flash to about 97% of Claude Opus 4.6’s coding performance while lowering per‑task cost to roughly $0.19 versus $1.76; StepFun describes these figures as internal estimates.

Multimodal capabilities include two visual tool pathways. A Visual Search Tool handles recognition and retrieval when parametric knowledge is insufficient; on SimpleVQA (with Search) Step 3.7 Flash scores 79.16%, comparable to GPT 5.5 (79.11%) and ahead of Kimi K2.6 (78.24%) and GLM 5V Turbo (78.20%). A Python Tool enables fine‑grained image probing and manipulation: StepFun’s self‑tested results include 95.29% on V, 89.13% on HR‑Bench 4K and 86.34% on HR‑Bench 8K.

On Android Daily, a long‑horizon phone UI benchmark, Step 3.7 Flash scores 61.87%, compared with Kimi K2.6 at 53.36% and GLM 5V Turbo at 51.68% (Gemini 3 Flash leads at 63.21%). StepFun also observed emergent compositional tool use during testing: the model combined visual and non‑visual tools without explicit training — for example rendering frontend output via a GUI and then inspecting it before iterating.

Sources

MarkTechPost AI · 5/29/2026

Replies (0)

No replies in this topic yet.

Back