AgentCore Optimization preview adds Recommendations API, batch checks and A/B testing to automate agent fixes

News

5/5/2026, 2:25:48 AM

AgentCore Optimization preview adds Recommendations API, batch checks and A/B testing to automate agent fixes

AgentCore Optimization (preview) introduces a Recommendations API plus batch evaluation and live A/B testing to close the observe — evaluate–improve loop for production agents, automating detection and remediation of drift and measuring impact before rollout.

AgentCore Optimization has entered preview with tools to detect and repair quality drift in production agents as models, prompts and usage patterns change. The release bundles a Recommendations API that analyzes traces and evaluation outputs to propose specific changes — such as edits to the system prompt or tool descriptions — and ties those recommendations to measurable validation paths so teams can test and deploy improvements with fewer blind spots.

Traceability is captured in OpenTelemetry‑compatible spans and surfaced through AgentCore Observability, which records every model call, tool invocation and reasoning step for downstream scoring and debugging. The Recommendations API consumes those traces (for example, agent traces written to a CloudWatch Log group) along with evaluator outputs to generate actionable suggestions and maintain an auditable chain from production behavior to proposed fixes.

Validation is provided two ways. Batch evaluation runs a proposed change across a predefined test dataset and returns aggregate scores so teams can detect regressions on cases they already care about. When hand‑authored test cases fall short, teams can simulate scenarios with an LLM‑backed actor that plays end users and exercises behaviors not captured in static tests. These batch flows let teams estimate likely effects before touching live traffic.

Live validation uses AgentCore Gateway to run controlled A/B tests that split production traffic at a configured percentage and report results with confidence intervals and statistical significance. By comparing the treatment and control paths on real users, teams can confirm whether a recommendation improves goal success, tool selection or other metrics before rolling it out broadly, reducing the risk of regressions across user cohorts.

The product formalizes a data‑driven improvement loop: recommendations derived from production traces, automated batch checks or simulated scenarios, and controlled live comparisons to confirm impact. That flow replaces a common manual cycle of reading traces, guessing fixes, testing a few examples, and deploying without robust validation — a pattern the preview aims to shorten and make repeatable with measurable outcomes.

Operationally, builders generate recommendations by pointing the Recommendations API at their agent trace logs, selecting a reward signal (a built‑in evaluator or a custom evaluator), and choosing whether to optimize the system prompt or tool descriptions. Evaluators can score traces on dimensions such as goal success rate, tool selection accuracy, helpfulness and safety, using ground truth labels, built‑in metrics, or LLM‑as‑judge scoring. the preview positions Optimization as the closing piece for continuous improvement with infrastructure‑level security enforced. As NTT DATA’s Yoshiharu Okuda notes, deriving improvement recommendations from production trace data and validating them with A/B testing enables rapid, repeatable cycles and continuous, scalable improvement compared with traditional manual tuning.

Sources

AWS Machine Learning Blog · 5/4/2026

Replies (0)

No replies in this topic yet.

Back