
On May 20, 2026 a product team published a living reference that enumerates every major AI provider and the specific model variants that can be automated inside Zaps, and introduced AutomationBench — a public benchmark used to decide which models to deploy across automated workflows. The update matters because it gives builders a single, maintained source for picking models that will actually finish real automation tasks, rather than relying on single‑prompt quality metrics.
AutomationBench was created to measure how well models complete messy, multi‑step business tasks rather than single‑shot prompt quality. Tests simulate realistic workflow patterns by introducing irrelevant data and ambiguity, hiding key details behind tool calls, and enforcing strict business policies. The authors say no personally identifiable information was used in building the benchmark, and they published both the benchmark and its methodology.
Scoring emphasizes the final state of a workflow: whether a task was fully completed and whether any side effects occurred. Evaluators do not require particular tools or call sequences; they judge outcomes. That approach favors more expensive models that reliably finish complex automations over cheaper models that produce good single responses but fail to reach the required end state.
To illustrate complexity, the article includes a representative AutomationBench task: resolve a scheduling conflict on February 20, 2026 at 2:00 PM between a Zoom meeting and a Google Calendar event, consult a spreadsheet policy to choose the winner, reschedule the losing meeting by prepending [RESCHEDULED] to its title, and post a summary to #ops-updates on Slack that includes both meeting IDs.
The platform’s AutomationBench leaderboard ranks the top five model variants by the share of workflow tasks they fully completed: Gemini 3.5 Flash (Medium) at 14.5%, GPT-5.5 (XHigh) at 12.9%, Gemini 3.5 Flash (High) at 12.6%, Gemini 3.5 Flash (Low) at 12.2%, and GPT-5.5 (High) at 11.3%. Those percentages reflect complete task success across the benchmark’s set of workflow challenges.
On model selection guidance, the write‑up describes OpenAI’s lineup as the broadest available on the platform — spanning budget‑friendly mini models, advanced reasoning engines, and specialized tools for transcription and image generation. It notes GPT-5.5 (XHigh) tops AutomationBench for Sales and Marketing workflows, while GPT-5.5 (High) places highly in Operations. The platform also supports direct integrations with hundreds of other AI apps and partners such as Google, Salesforce, and Microsoft, and provides a built‑in AI orchestration tool to coordinate models across automated workflows.
Sources
Replies (0)
No replies in this topic yet.