AgentCore Browser adds OS‑Level Actions so agents can interact with native UI

News

5/5/2026, 6:02:15 PM

AgentCore Browser adds OS‑Level Actions so agents can interact with native UI

Amazon Bedrock’s AgentCore Browser now supports OS Level Actions, a feature that lets programmatic agents interact with operating‑system rendered elements outside the browser DOM. Agents can perform mouse and keyboard events and capture full‑desktop screenshots while a browser session is active, enabling them to actuate native dialogs and other UI that previously lay beyond conventional browser automation. OS Level Actions are invoked through the InvokeBrowser API. Each API call must carry exactly one action (a type and its arguments) and returns a status of either SUCCESS or FAILED. The API ties actions to the active browser session via the x-amzn-browser-session-id header, and captured screenshots are returned as base64‑encoded PNG images.

The change addresses a long‑standing boundary in web automation: tools such as Playwright and the Chrome DevTools Protocol can manipulate only the browser DOM. Native or OS‑rendered prompts — system print dialogs, macOS privacy dialogs, Windows Security prompts, certificate choosers, context menus and some browser settings — are not exposed through CDP and therefore cannot be reached by standard automation flows.

For agents that use a vision loop (screenshot → model → action), that limitation meant a model could locate UI in an image but had no programmatic path to interact with it. With OS Level Actions, an agent can observe native UI in a captured screenshot, have a vision model infer coordinates or commands, and then execute mouse or keyboard events against the OS within the same session, enabling end‑to‑end automation in production workflows.

Supported actions fall into three categories: mouse control, keyboard input and visual capture. The blog lists eight actions in total. Example mouse actions include mouseClick (fields: x, y, button, clickCount — defaults are current position, LEFT and single click; clickCount range: 1 — 10), mouseMove (x, y) and mouseDrag (endX, endY with optional startX, startY; button defaults to LEFT). Visual capture returns full‑desktop screenshots so agents can see UI outside the browser window.

OS Level Actions are available for both new and existing AgentCore Browser configurations without additional setup. The recommended interaction pattern is action → screenshot → model inference → next action, which lets agents react to dynamic UI that may appear mid‑workflow. Because each action executes on the full desktop, it can reach native dialogs and other OS elements and returns immediate success or failure feedback for programmatic handling. By combining full‑desktop screenshots with mouse and keyboard events tied to a browser session, AgentCore Browser’s OS Level Actions close the gap between visual recognition and reliable actuation, giving vision‑enabled agents a practical path to complete tasks that span browser and native system interfaces.

Sources

AWS Machine Learning Blog · 5/5/2026

Replies (0)

No replies in this topic yet.

Back