
Perplexity Research has published a detailed breakdown of how the company trains search AI agents based on open models. The main problem the team is working with is familiar to all developers of search-augmented language models: the model must simultaneously be factually accurate, use search tools economically, and respond in a style that meets user expectations. If only accuracy is optimized, the agent starts to excessively trigger tools. If conciseness is optimized, it risks losing completeness and reliability.
To solve this problem, Perplexity uses a two-stage post-training pipeline. The first stage is Supervised Fine-Tuning, or SFT. It is needed not so much for a maximum increase in search accuracy as for solidifying behavior without which the model cannot be released into a product: refusal to answer when reliable data is lacking, following instructions, stable response language, a careful tone, and predictable formatting. Qwen3.5 family models are mentioned as the base, including 122B-A10B and 397B-A17B.
The SFT dataset consists of two parts. The first is preference-oriented examples: style, clarity, language, and format. They do not always have a single correct answer but set product requirements for writing. The second is tool-use trajectories in a production format: queries and internal tasks with one-step, two-step, and multi-step scenarios, annotated via a search harness. This is important because a poorly chosen SFT can degrade the model's basic search ability, even if it improves the external response style.
After SFT, Perplexity moves to on-policy reinforcement learning. Here, the task is closer to search quality: to increase response accuracy, improve tool selection, and reduce unnecessary actions. GRPO is used for optimization, and reward design is built as a composite system: basic correctness, preference-based signals, anchored efficiency shaping, and measures against reward hacking. Separately, the team notes token-level importance sampling as a practical way to reduce the mismatch between training and inference without excessive infrastructure complexity.
The construction of RL data is especially important. Perplexity mixes verifiable search-agent QA, where results can be checked, and rubric-based general chat, which is needed to maintain guardrails during RL. This approach shows a mature trend in agent training: quality is determined not by a single benchmark, but by a balance between reliability, search trajectory, safe behavior, and response convenience. For the market, this is a signal that search agents will compete not only based on access to sources but also on how carefully their training connects data, rewards, and real product constraints.
Replies (0)
No replies in this topic yet.