How to measure ROI in coding agents
Most teams measuring coding agent ROI are measuring the wrong thing. They count tokens and developer feedback. Those signals are useful, but they are not ROI.
What is ROI for coding agents?
ROI measures whether an organization ships more valuable software, at equal or better quality, for less total cost, regardless of how many developers use agents or how often. Coding agents can create the illusion of productivity until delayed costs show up in review, debugging, rework, and model spend.
Effectively measuring ROI means treating coding agents as production infrastructure, with incremental value tracked through cycle time, deployment frequency, and defect rate, and incremental cost tracked through model spend, review time, and rework.
Why is AI productivity hard to measure?
Software engineering has always been hard to measure, and AI agents make it harder.
The most important finding in this field is that AI's gains shrink as you move from writing code to shipping. A 2026 NBER study of more than 100,000 GitHub developers found that AI coding tools raised commit volume by roughly 40% for autocomplete, 140% for sync agents, and 180% for async agents, but the gains attenuated across the production hierarchy: the 180% commit effect fell to about 50% for projects and 30% for actual releases. The extra code piles up at the human bottlenecks of review, integration, and release.
METR's 2025 randomized controlled trial found experienced open-source developers were 19% slower on repositories they knew well, while believing AI had made them faster.
A single output metric like PR throughput is worse than no metric, because it will make you confidently wrong. In fact, it’s not too different from tracking KLOCs. Measuring ROI requires understanding the whole value delivery chain.
What should you measure?
To begin measuring ROI effectively, we recommend DX's AI Measurement Framework: utilization, impact, and cost. The impact layer is where DORA metrics live, so a team already tracking deployment frequency, lead time, and change failure rate can reuse that work directly.
- Utilization: Is it being used? Track weekly and daily active users, share of PRs that are AI-assisted, and suggestion acceptance rate. This data usually lives in IDE telemetry, e.g. the GitHub Copilot usage dashboard and REST API or your tool's export.
- Impact: Is it working? Track time saved per developer per week from a survey, PR throughput with agent PRs counted under their operator, cycle time from first commit to merge, lead time from merge to deploy, change failure rate, and escaped defects at 30 and 90 days. This data usually lives in git history, CI/CD and deploy logs, incident trackers, and a short recurring developer survey.
- Cost: Is the return worth it? Track AI spend per developer, licenses plus inference divided by headcount, review and rework hours, and cost per shipped change. This data usually lives in model provider billing exports plus the time data above.
When tracking, make sure that you pair every speed metric with a quality metric given these can often be in conflict with each other. A cycle time gain that lifts change failure rate is deferred cost. Additionally, settle how agents are counted before the data arrives: an agent is an extension of the engineer who supervises it, not a separate teammate, so its PRs are attributed to that engineer and rolled up into team-level metrics. This attribution exists for accounting purposes.
Quick and dirty metrics for measuring coding agent ROI
It's easy to get overwhelmed with the amount of developer productivity metrics out there. If you're not sure where to get started with measuring ROI, the three metrics we have found most useful to track are:
- Merged PRs per developer per week, weighted by LOC
- Cycle time (time from PR opened to merged)
- Revert rate (PRs reverted within 7 days)
Together, these metrics help you identify a) what is the output volume of AI generated code, and b) what is the quality of that code upon human review and production deployment.
These metrics are not perfect, as many engineers do not produce PRs. But they can be a good place to start. Once you begin tracking these baselines, you can begin to layer on other metrics.
Where do the numbers come from?
Here’s how to source the data you need to measure ROI across utilization, impact, and cost:
- Vendor usage APIs. Every major assistant ships usage telemetry. For example, the GitHub Copilot metrics API returns suggestions shown, suggestions accepted, acceptance rate, and active users, broken down by team and language. Cursor, Claude Code, and the rest expose equivalents.
- IDE instrumentation. Acceptance rate does not reveal code share. To answer how much of what ships was written by AI, companies capture insertion provenance inside the editor. For example, a VS Code extension or the platform's IDE agent can tag each inserted block as AI-generated, typed, or pasted. This is the method behind Google's public figure that 75% of its new code is now AI-generated and approved by engineers, up from 50% in late 2025.
- Assessing quality with Git and delivery-system analysis. Cycle time, throughput, change failure rate, and rework all come from systems you already run: git history, CI/CD logs, and the incident tracker. Engineering intelligence platforms such as DX, Jellyfish, Faros AI, LinearB, and Swarmia join this delivery data to the AI-usage metadata from the first two layers, so you can compare AI-assisted and non-AI work within the same teams.
What this looks like in practice
Booking.com was already running DX for baseline productivity metrics, then fed its assistant's usage metadata into the same platform. Daily active AI users had a 16% higher change-merge throughput than non-users, and the analysis exposed enablement gaps that, once fixed, drove a 65% increase in adoption. Those numbers predate the agent wave, so treat the design rather than the figures as the takeaway: AI users compared against non-users inside one company on one platform. For a current reference point, Google reported in April 2026 that a complex code migration run by agents and engineers together finished six times faster than the same work a year earlier.
Jellyfish's analysis of 20M pull requests from 200,000 developers across 1,000 companies found adoption climbing from 22% to around 90%, cycle time down around 24%, PR size up around 18%, and up to 2x PR throughput at full adoption.
Measuring quality and sentiment
For sentiment analysis, run a short pulse survey built on the DevEx framework rather than a tool-satisfaction poll. DX's survey template operationalizes it as 5-point Likert items; select the handful most sensitive to coding agent quality, such as perceived productivity, effort, satisfaction, engagement, and focus time, then administer the same instrument before rollout and again after. Two details make the result usable. Ask about the developer's experience of the work rather than opinions about the tool, which removes acquiescence and sponsor bias. And omit items that cannot shift in the measurement window, such as deploy frequency or meeting culture, since they add noise without signal. Set the regression threshold before the survey goes out, for example no decline of more than half a point per question between baseline and follow-up.
Build vs. buy. If you want to buy, DX, Jellyfish, Faros AI, LinearB, Swarmia, Sleuth, and GitClear all ship AI-impact dashboards out of the box. To build it yourself, the minimum viable stack is your assistant's metrics API piped into a warehouse, joined to git and CI data, with a quarterly developer survey on top.
How do you calculate the return?
"Value delivered" sounds abstract, so reduce it to net time gain per developer, a number a finance leader can act on.
Net time gain = hours recovered − (fully-loaded AI cost ÷ loaded hourly rate)
Both terms are in hours: the cost side converts AI spend into hours by dividing it by the loaded hourly rate, so the result reads as engineering capacity gained or lost per developer.
Define "hours recovered" by output. The gain arrives in two shapes. A developer can do the same work in less time, which surveys capture as freed hours. Or a developer can work the same hours and ship more: someone who saves no hours but now produces more would report zero time saved while delivering a larger return. The definition that covers both shapes is:
Hours recovered = (hours the shipped output would have taken at the pre-AI baseline) − (hours actually worked)
Here’s a worked example, per developer, per month. Take a team whose pre-AI baseline was 8 merged changes per developer per month on roughly 160 working hours, about 20 hours per change. After adopting agents the same developer ships 10 changes in the same 160 hours. That output would have taken 200 baseline hours, so hours recovered is 40 per month, worth $4,000 at a fully loaded rate of $100 per hour. Surveys give a cheaper proxy for the same number: DX's data shows developers report saving around 3.9 hours per week, roughly 16 hours per month, but self-reports miss reinvested time and overstate per-task savings, so treat the survey as a directional floor and the throughput calculation as the primary figure. On the cost side, the tool license runs $30 to $40 per seat and inference is the swing factor for long-horizon agents, anywhere from a few dollars to several hundred per developer per month depending on how many frontier-model calls each task makes. At $400 of monthly AI spend the cost term is 4 hours, and net time gain is 36 hours per developer per month. If that number is positive and growing over time, the tool is paying for itself. If inference creeps toward the value of the hours recovered, you are paying for activity rather than outcomes.
For unit economics, divide fully loaded AI cost over a period by the number of merged, accepted changes in that period to get cost per shipped change, and watch it next to throughput. This metric has no time-accounting subtlety: its denominator is output, so the developer who ships three times more simply cuts it to a third. Rising cost per shipped change with flat throughput is an early warning that spend is outrunning value.
Five rules for measuring ROI
- Baseline against the same engineers, not other teams. Comparing each engineer's post-AI output to their own pre-AI trailing baseline removes tenure, seasonality, and team-composition confounders.
- Wait out the learning curve. Productivity gains appear after at least 2 months as developers ramp up. Week-one numbers can be noisy as they capture the effects of novelty.
- Segment by task type. Agents are excellent at test generation, refactors, dependency upgrades, and boilerplate. They are less effective on ambiguous product work and deep, familiar code.
- Track rework explicitly. Count follow-up fixes filed within 30 days against AI-assisted changes, including convention cleanup.
- Read the layers together. Speed up with quality and cost flat is a win. Speed up with defects up implies deferred cost. Flat throughput with cost up means you are buying activity.
Intelligent model routing maximizes ROI
Once you can see cost per shipped change, the place to act is the agent loop. For long-horizon agents the expense is the loop rather than the final answer: plan, search files, call tools, edit, run tests, retry, compact context, often dozens of model calls per task. Running a frontier model on every step is where inference compounds out of proportion to quality.
The solution is to route each step to the cheapest model that clears the quality bar, with frontier reasoning models on planning and debugging and cheaper models on summarization, simple edits, and repetitive transforms. An AI model router makes that decision per request; the distinction from a gateway matters here, since gateways handle access, policy, and observability while routers decide where traffic should go. For a full breakdown of where agent spend comes from and how to reduce it, see our guide to reducing Claude Code costs without sacrificing output quality; for teams approaching renewal, the same routing data becomes the leverage point in an enterprise renegotiation. Routing is how you move cost per shipped change down once you are measuring it.