
Microsoft on June 2, 2026 unveiled ASSERT (Adaptive Spec‑driven Scoring for Evaluation and Regression Testing), an open‑source framework that converts natural‑language descriptions of desired AI behavior into concrete, application‑specific evaluations and regression tests. The announcement frames ASSERT as a tool for developers and product teams who need repeatable, scored checks that enforce policy and behavior inside particular applications.
ASSERT’s pipeline parses plain‑language descriptions of expected behaviors and policies, then structures those descriptions into explicit acceptable and unacceptable behaviors. From that specification it synthesizes problem scenarios and test cases, executes them against the target system, and assigns scores to outcomes. The framework also captures execution traces — including intermediate actions and tool calls — so teams can inspect exactly where and how a failure occurred.
Teams can customize evaluations by supplying system context, available tools and operational constraints, allowing ASSERT to scope tests to a product’s specific environment. The framework is designed to run tests at multiple stages: during development, immediately after deployment, and as part of continuous monitoring to detect regressions or policy drift. Microsoft illustrates the approach with a document research agent: product rules could specify that the agent must not send emails outside the company, must limit confidential details to C‑level recipients, and must produce concise summaries that respect prior context. ASSERT uses those textual rules to synthesize scenarios that continuously check the agent’s behavior against those constraints.
Sarah Bird, chief product officer of Responsible AI at Microsoft, argued that broader benchmarks do not suffice: "One of the things we’ve learned is that evaluations are absolutely critical to making good decisions... if you really want to have a trustworthy system, you should evaluate many more dimensions that are application‑specific." Microsoft positions ASSERT as a way to operationalize those application‑level dimensions.
The release arrives amid a wider industry move toward repeatable testing and regression checks for model behavior; researchers and groups have produced benchmarks such as Stanford’s HELM, MLCommons’ AILuminate and evaluation initiatives like METR. Microsoft describes ASSERT as filling a gap left by broader evaluations when models must obey product policies, use specific tools, or behave within particular application contexts.
For builders, ASSERT promises a practical path to operationalize policy‑level requirements: generate repeatable, scored tests from specs, run ongoing regression checks, and use recorded tool calls and intermediate steps to trace failures. As open‑source software, the framework is intended for integration into teams’ existing evaluation workflows and for tailoring tests to product‑specific security, compliance and user‑experience requirements.
Sources
Replies (0)
No replies in this topic yet.