Justin Reock Calls for Rigorous Metrics to Evaluate AI's Impact on Engineering

News

5/8/2026, 1:18:29 PM

Justin Reock Calls for Rigorous Metrics to Evaluate AI's Impact on Engineering

Justin Reock, Deputy CTO of DX, told QCon AI that organizations must move beyond anecdotes and measure generative AI’s real effects on engineering using frameworks like DORA, SPACE and DevEx.

Justin Reock, Deputy CTO of DX, used a 51 — minute QCon AI presentation to argue that engineering leaders must adopt rigorous measurement to know whether generative AI actually improves software delivery. He warned that while many teams report subjective productivity gains, objective outcomes vary widely — so leaders who scale tools without hard metrics risk misallocating effort and harming delivery.

Reock anchored his recommendations in DX’s product work and academic research. DX has built an engineering intelligence platform around DORA, the SPACE model and the DevEx framework, and drew on studies from Microsoft, the Google Productivity Lab and the University of Victoria. He described a Change Confidence Developer Experience Index that DX uses to quantify developer experience and estimate ROI.

To show why measurement matters, Reock contrasted public empirical results that paint a mixed picture. He cited Google’s finding of roughly a 10% productivity uplift where AI is embedded, then noted the METR experiment, a small controlled study that observed a 19% productivity drop partly attributed to unfamiliarity with the Cursor tool. He also referenced a recent DORA community report that found modest positive averages tied to AI adoption.

Reock highlighted specific DORA-linked correlations to illustrate uneven benefits: a 25% increase in AI adoption was associated with a 7.5% rise in documentation quality, a 3.4% bump in code quality, a 3.1% faster code review speed and a 1.3% increase in approval speed. He argued these figures show measurable gains exist, but they are modest and not uniform across delivery metrics.

He named the gap between pilot expectations and production outcomes the “GenAI Divide,” and asserted that roughly 95% of pilots fail to deliver at scale. To bridge that divide, Reock advised treating pilots as controlled experiments: define baselines using SPACE and Core 4 metrics, instrument systems to capture objective signals, and track both qualitative and quantitative indicators before deciding to scale.

Reock closed with operational guidance for engineering leadership: measure to align perception with reality, prioritize a balance between velocity and quality, address developer fear through evidence and tooling, and only explore agentic AI across the full SDLC after rigorous measurement. His talk was presented as a playbook and guide aimed at senior executives and engineering leaders.

Sources

InfoQ AI/ML · 5/8/2026

Replies (0)

No replies in this topic yet.

Back