A Datadog engineering team combined Jobs Monitoring telemetry with an agentic AI built on Claude to locate execution — plan bottlenecks in a daily Spark job, cutting compute costs by 44% and run time by 60% in the company’s largest datacenter.
Datadog’s Referential Data Platform team used an agentic AI built on Claude together with Jobs Monitoring telemetry to diagnose and optimize ServiceQueryEdge, a daily Spark job responsible for mapping service entities to their metric and log queries. The integration let the agent surface slow operators in the Spark execution graph and point engineers directly to problematic plan nodes, reducing the manual work of correlating telemetry and source code. The change accelerated root-cause resolution and lowered operating expense for a high‑scale pipeline.
ServiceQueryEdge runs daily across seven datacenters and handles extreme data volumes: single partitions can process up to 27 TB of input and 16 billion records. At that scale the job had previously averaged about $1,500 in infrastructure costs per day and took more than 17 hours per run. To enable actionable analysis, the team fed stage metrics, the Spark SQL execution plan, telemetry, and relevant source code into a custom prompt structure so the agent could link slow operators with the application code and surface precise evidence for fixes.
The project relied on Jobs Monitoring’s interactive Spark SQL Plan to provide a visual, end‑to‑end representation of the execution plan, because reasoning over code alone often yields guesses without telemetry. Early runs exposed a practical constraint: the agent exhausted its context window during Model Context Protocol (MCP) calls through the Datadog MCP Server while fetching traces via get_datadog_trace, apm_search_spans and apm_explore_trace, and multiple attempts produced incomplete suggestions.
Through iterative tuning and targeted changes the team addressed context limits and improved data acquisition. They introduced subagents that delegated specific trace — and metric‑fetching tasks so the primary analysis agent retained the context required for coherent reasoning. The engineers found that recommendation quality depended less on sheer data volume than on narrowly scoped inputs; constraining what the agent ingested improved the coherence and actionability of its guidance compared with earlier, less-useful suggestions.
The combined workflow produced measurable gains: in US1, the company’s largest data center, daily Spark compute costs fell by 44% and run duration shortened by 60%. Beyond cost and time savings, the approach cut several hours of investigation per issue by delivering contextualized evidence at relevant execution‑plan nodes, enabling faster fixes. The team documents what worked and what didn’t, including the specific code and plan changes that produced the savings.
For builders, the case yields two practical takeaways: include observability — level artifacts (Spark SQL Plan, stage metrics, traces) alongside source code in an agent’s prompt, and scope telemetry acquisition through subagents or dedicated tooling to prevent context exhaustion. Those steps helped this team turn agentic reasoning from noisy guesswork into targeted diagnostic support.
Sources
Replies (0)
No replies in this topic yet.