Qwen‑Scope is an interpretability toolkit that inserts Sparse Autoencoders into hidden layers of Qwen3 and Qwen3.5 models. The release includes open‑source SAE weights for seven LLMs (14 SAE sets) trained on a 0.
Qwen‑Scope, a new interpretability toolkit, has been released with open‑source Sparse Autoencoder (SAE) weights that reveal sparse internal features in the Qwen3 and Qwen3.5 model families. The assets cover seven LLM variants — including both dense and Mixture‑of‑Experts architectures — and provide a total of 14 SAE sets. The SAE training used a 0.5B‑token sample drawn from the corresponding models' pretraining data to improve feature coverage and stability.
The toolkit works by inserting and training SAEs inside hidden layers of the target models. Sparsity constraints push dense hidden activations to decompose into sparse, disentangled components that are more interpretable than raw activations. Because the SAEs operate at the representation level, they expose feature activations that correspond to coherent signals such as language, entities, or stylistic attributes without changing the base model weights.
On inference and data tasks, Qwen‑Scope lets developers control those feature activations to steer outputs — altering language choice, entity emphasis, or style — without issuing explicit natural‑language instructions. For classification, the toolkit can select highly relevant features from only a small amount of seed data. It can also exploit inactive features to synthesize targeted samples for long‑tail capabilities; the authors report roughly a 15× improvement in training‑data efficiency for such targeted synthesis compared with conventional methods.
For training and evaluation workflows, Qwen‑Scope helps pinpoint abnormally activated features that correlate with low‑quality behaviors such as code‑switching or repetitive generation, informing interventions during supervised fine‑tuning (SFT) and reinforcement learning (RL). The toolkit also computes feature‑activation patterns across samples and benchmarks to reveal redundancy, guide benchmark selection, reduce evaluation cost, and improve coverage. In practice, builders can use Qwen‑Scope for controllable inference, faster labeling and classification from minimal seed data, targeted long‑tail example synthesis, and feature‑level diagnostics during fine‑tuning — often avoiding full model retraining or extensive prompt engineering.
The project is accompanied by a technical report that details the methods and demonstrates applications across inference, evaluation, data, and training. Released assets focus on the Qwen3 and Qwen3.5 series; researchers and engineers interested in model interpretability or optimization can apply the provided SAE weights and analyses to probe, diagnose, and steer Qwen model behavior.
Sources
Replies (0)
No replies in this topic yet.