How to reduce Claude Code costs without sacrificing output quality
Claude Code is one of the most important tools for modern engineering organizations, and one of the most expensive. With developers spending thousands on long-running agent sessions and enterprises burning through annual budgets in less than a quarter, understanding how to reduce Claude Code costs is critical for teams that are scaling their use of AI.
This guide is for engineering leaders, platform teams, and power users who want to cut the fat on their Claude Code spend without sacrificing output quality. We use Claude Code at Not Diamond, and we build tools for reducing costs ourselves, so we know (nearly) every lever there is to pull in the harness. In this article we cover why coding agent bills have grown so rapidly, every method people are using to reduce costs today, the specific tools and repos behind those levers, and how to sequence the work so cost governance becomes durable infrastructure.
Why Claude Code is so expensive
Across enterprise deployments Anthropic now reports an average of $150–$250 per developer per month, with the 90th percentile spending 2-3x that amount and some power users spending tens of thousands per month. Much of this is due simply to heavy usage, but there is a tremendous amount of spend which is simply wasteful and unnecessary.
The key drivers behind spend:
- Quadratic context accumulation. Every turn re-sends the full conversation, so cost grows quadratically with turns.
- Cache mismanagement. The KV cache is one of the most powerful tools for managing costs in long-running sessions, yet it is very frequently broken or left cold.
- Overpowered models. Most developers default to the most powerful model available, even when a significant portion of coding tasks can be handled by much cheaper models.
- Poorly timed compaction. When Claude Code hits context limits, it auto-compacts by summarizing the entire history into a new prompt—which itself costs the full cached input price.
- Subagent parallelization. Cost grows roughly linearly with fan-out agents under supervision and superlinearly when subagents spawn further subagents or run unattended.
We walk through how to understand each of these cost drivers and the tools and practices that allow you to bring them down.
1. Measure before you optimize
Getting a detailed baseline of your spend is critical before you start trying to reduce it. There are many tools that can be used for this instrumentation:
- ccusage (5.8K+ stars) — the de facto CLI for Claude Code/Codex usage. Parses the local
~/.claude/projects/files and gives you daily, weekly, monthly, session, and 5-hour-block reports with model breakdown and pricing from LiteLLM. Install:/ .jsonl npm i -g @ryoppippi/ccusage && ccusage daily. Pairs with ccusage-monitor (macOS menu bar) and Usage Monitor for Claude (Windows tray). - Claude-Code-Usage-Monitor — real-time terminal monitor with ML-based burn-rate predictions and limit warnings.
- claude-code-usage-analyzer — deeper cost & token breakdown by model and token type, built on top of ccusage data.
- Live tracking via hooks. Claude Code hooks fire on every tool call; users have built DIY trackers that calibrate the actual token-per-window burn.
- The
/costcommand is a starting point but hides cache-write costs. Trust the JSONL logs, not the in-app number. - Cost per accepted diff is the metric finance cares about. Track tokens-in, tokens-out, model, and PR-merge / test-pass outcome, then divide. Most teams discover the exercise of producing this number is what unlocks every subsequent decision.
2. Cache rules everything around me
Prompt caching is the single biggest economic lever Anthropic gives you, and the most fragile. Here's an example of the math, based on Anthropic’s pricing, for Sonnet 4.6:
- Base input: $3/M tokens
- Cache write (5m TTL): $3.75/M (25% premium)
- Cache write (1h TTL): $6/M (100% premium)
- Cache hit/refresh: $0.30/M (90% discount)
The Claude Code team has described how "we build our entire harness around prompt caching… we run alerts on our prompt cache hit rate and declare SEVs if they're too low." If your cache hit rate is low, you're paying 10x the base price.
What breaks the cache and multiplies your bill
- Anything that changes the prefix. Cache lookups hash the entire prompt prefix in order. A single edit to your CLAUDE.md, a new MCP server, a tool added mid-session, or a non-deterministic timestamp injected into the system prompt invalidates every downstream entry.
- Compaction. Every compaction event re-prompts a fresh prefix, dumping the cache and forcing full-price re-prefill on the next turn. See section 5.
- TTL. When the TTL is up, your cache breaks. The default TTL for subscriptions silently dropped from 1 hour to 5 minutes in early March 2026, confirmed in an open Claude Code issue with JSONL evidence, causing a 20–32% jump in cache-creation costs. Anthropic's March 2026 caching incident caused 10–20× token inflation on some users with no warning. The billing layer is more opaque than most builders assume.
- Model changes. KV caches are per-model. Routing between Sonnet and Opus mid-session means you re-prefill from scratch on the new model. Even effort-level changes on the same model can invalidate the cache. Switching models can make sense early in a trajectory or within sub-agents, but switching when context has accumulated and the cache is warm often incurs more cost than it saves.
What to do
- For API users, set a 1-hour TTL explicitly if you are frequently working periodically across multiple sessions:
"cache_control": {"type": "ephemeral", "ttl": "1h"}. The 2× write premium pays back after avoiding a single miss. - Use all of the 4 cache breakpoints Anthropic allows: typically (1) system prompt, (2) tool definitions, (3) stable docs/CLAUDE.md, (4) conversation prefix.
- Monitor cache hit rate as a first-class metric. The JSONL logs show
ephemeral_5m_input_tokensvsephemeral_1h_input_tokensper turn; a healthy session should be >80% cache reads after the first turn.
3. Intelligent model selection and routing
Task complexity in coding agents is variable in complexity. Yet most developers default to Opus for everything.
This is why one of the highest-leverage architectural change is to stop sending every request to a single model. A routing layer makes the per-prompt decision automatically, sending each request to the best-performing model that meets your quality bar. On production coding workloads, this typically removes a large portion of the bill with no regression on performance, because the router exploits a distribution that already exists in the traffic.
Intelligent model routing is what we work on at Not Diamond—you can read more about routers, how they differ from gateways, and how to implement them in agentic settings in our comprehensive guide to model routing.
- Cache economics > routing economics for short tasks. A mid-trajectory model switch invalidates the KV cache. Routing must be cache-aware to avoid naively blowing up costs.
- Coding workloads have heterogeneous intent. Planning, code localization, code generation, testing, and error recovery each benefit from different models. Routing is valuable precisely because the workload is not uniform—but only when the cache cost of switching is paid back by output-token savings.
- Haiku 4.5 is dramatically underrated for coding. $1/$5 vs Opus's $5/$25 and achieves 66% on SWE-bench Verified. Use it as the default for research, documentation, log parsing, simple refactors, and as the subagent model.
- Effort levels have significant impact on spend. Claude Code exposes Low/Medium/High/Max via
/effort; thinking tokens are billed at output rates and at Max can use far more tokens than the visible output suggests. Anthropic's April postmortem is explicit that the default is a chosen point on the test-time-compute curve, not a free upgrade. Tested: Max effort on Opus 4.7 can be a trap—the marginal quality gain on most coding problems doesn't always justify the token blow-up. - Migrating to Opus 4.7? Tokenizer changes mean the same input now maps to 1.0–1.35× more tokens than 4.6, and 4.7 thinks more at higher effort, particularly on later turns. Re-baseline your spend before assuming the upgrade is free.
4. Context engineering: the high-leverage habits
Compressing context is a critical skill for managing cost. Below we outline concrete actions that save you money.
CLAUDE.md hygiene
CLAUDE.md loads on every session and persists in the context window for the entire session. It is not lazy-loaded. A 5,000-token CLAUDE.md is a 5,000-token tax on every turn. Practical rules:
- Keep CLAUDE.md under ~1,500 tokens. Move long procedures into Skills (
SKILL.md)—skill bodies load only when invoked, so reference material costs nothing until used. - Use
@docs/testing.md-style references for hierarchical docs instead of inlining; Claude Code reads the chain only when needed. - Strip preambles. Drop-in repos like caveman or claude-token-efficient to impose terse-response rules (no "Sure!", no restating the question, no smart quotes) to meaningfully reduce output tokens.
Scoped reads, not whole-codebase loads
Claude Code's default is to load more than it needs. The structural fixes to this problem come in different shapes:
- Codebase indexers via MCP. Tools like Serena and code-index-mcp enable much more efficient code indexing and search**.** claude-context-mode spawns isolated subprocesses for MCP tool calls so only stdout enters context.
- LSP integration. LSP plugins let Claude resolve symbols, definitions, and references without reading whole files.
- Tool-output compression. RTK (Rust Token Killer) sits between your shell and the agent and compresses command output before it hits context. Reported ~89% average savings across 7,000+ commands, e.g.
cargo testfrom 155 lines to 3, 10M tokens saved across one user's sessions. Install withrtk init -gand restart Claude Code. RTK also ships an auto-rewrite hook and a starter RTK-optimized CLAUDE.md. - Token-optimization MCPs. token-optimizer-mcp bundles caching, compression, and tool intelligence into a single MCP server—reports up to 95% reductions on heavy tool-use workloads.
- Prompt-level compression. Microsoft's LLMLingua-2 achieves 50–80% prompt compression without semantic loss on long documents. TOON (tool output compression) reduces structured-data tokens 60–70% and is in production at SAP.
MCP token overhead is real
A single MCP server can add 2,000–5,000 tokens of tool definitions per request; with 10 servers you're at 18,000+ tokens of overhead before your first prompt. Mitigations include:
- Move noisy or rarely-used MCP servers from the global
~/.claude/settings.jsonto project-level.claude/settings.json. - Disable any MCP server you aren't using this session. The cost is paid even when the tools aren't called.
- Anthropic's research on advanced tool use shows dynamic tool search reducing token usage by 85% when 30+ tools are available; Speakeasy's Dynamic Toolsets hit a 100× reduction in MCP-related tokens across 40–400-tool sets while maintaining 100% success rates.
/clear aggressively, /compact deliberately
Every message in a session is re-sent on every turn. Switching tasks without /clear means you're paying for the authentication discussion while debugging CSS.Heuristics that work:
- Run
/clearwhenever you switch task or codebase area. Cost: zero. Benefit: resets to baseline. - Use
/compactonly when you genuinely need history preserved.
5. Avoid the long-running sessions that trigger auto-compaction
Compaction can be one of the most expensive surprise in Claude Code billing. Mechanically, /compact takes the entire conversation, sends it to the model as input, and replaces the history with the summary. You pay full input price to re-feed the entire prefix you've already paid for and the summary itself is generated as output, billed at output rates. These summaries can be quite large.
The long-running session is the root cause. Quality degrades over iterative long horizon tasks. The longer a session goes, the worse the output gets as the risk the agent starts forgetting decisions you made earlier increases.
- Plan-Build-Run / scratchpad workflows. Push real work out to fresh subagent context windows, persist decisions and state to disk (markdown plans, JSON state files), and start a new session per phase. Open-source implementations: Plan-Build-Run, get-shit-done, and the workflow described in my-claude-code-context-workflow.
- Plan in a cheaper surface. Do the thinking in Chat, then execute in Claude Code; or use Opus Plan Mode (
/model opus-plan) to plan with Opus then execute with Opus or Sonnet, depending on the complexity of the plan. - Anthropic's first-party tools: server-side compaction (beta header
compact-2026-01-12) is more efficient than client-side/compactbecause it avoids the round-trip. The memory tool persists structured notes to disk so the model retrieves on demand. Context editing can selectively clear staletool_resultblocks, though enough needs to be removed to justify the cache miss. The Anthropic cookbook on memory + compaction + tool clearing is the best reference for combining them.
6. Subagents: powerful, dangerous, easy to misuse
Subagents have their own context window and execution loop, which makes them excellent for context isolation and a fast path to a growing bill.
Where they pay off: noisy work that would otherwise pollute the main context window. Test runs, log parsing, file enumeration, and research/exploration are textbook fits. The Anthropic context engineering guide explains how subagents do the heavy lifting, and only the results returns to the parent's context.
Where they explode: parallel teams. Community benchmarks put a 3-agent team at 3–4× tokens, a 5-agent team at ~5×, and Anthropic's own docs cite 7× for plan-mode teammates.
- Never spawn parallel agents unattended. Cap concurrent subagents at a known number; require human approval before fan-out beyond that.
- Multiply your single-agent cost estimate by your parallelism factor, then add overhead. Treat parallelism as a budget multiplier, not a productivity multiplier.
- Use subagents to keep the parent context clean, not to accelerate work that a single agent could do sequentially. The cost-benefit only flips when the noise reduction is real.
- Mid-tier or smaller models for subagents. A typical pattern: Sonnet as orchestrator, Haiku for log/test/file subagents, Opus only for genuinely hard reasoning. The "advisor strategy"—Opus as senior advisor, Haiku/Sonnet as executor—has been shown to reduce costs while improving benchmark performance.
7. Subscriptions vs API: be strategic
The era of cheap, all-you-can-eat coding subscriptions is starting to wind down. Two recent signals are worth weighing into any tool usage horizon longer than a quarter:
- Anthropic briefly removed Claude Code from the $20 Pro plan on April 21, 2026, updating the pricing page and support docs to make Claude Code Max-only. Head of Growth Amol Avasare described it as a "small test on ~2% of new prosumer signups"; after public backlash Anthropic reverted the change within ~48 hours. The reversal doesn't change the direction of travel—Opus on Pro now requires opting into extra paid credits, and the test itself signals where Anthropic's economics are pointing.
- GitHub Copilot is moving from request-based to usage-based billing on June 1, 2026. Every plan will get a fixed monthly allotment of GitHub AI Credits priced at published API rates (input, output, and cached tokens); overages are pay-as-you-go. GitHub also paused new Pro / Pro+ / Student signups and removed Opus from the Pro tier entirely. The framing is: Copilot "is not the same product it was a year ago" and pricing has to align to actual compute cost.
Subscriptions are still the cheapest surface for individual heavy users today, but plan around the assumption that flat-rate economics on coding agents will keep tightening. For anything mission-critical or multi-year, API-based deployments and a routing layer give you a cost trajectory you control rather than one a vendor's pricing team controls.
That said, for individual developers and small teams the Max subscription is still dramatically subsidized for heavy users. Concrete data points users have published:
- A user on Max 5x ($100/month): 138 sessions, 61.8M tokens in 30 days; estimated API cost $1,593. ~15× the plan price.
- Kyle Redelinghuys: ~10B tokens over 8 months, estimated $15,000 in API; paid ~$800 on Max, 93% saving.
- Rough breakeven: ~$100/month in API usage. If you're hitting limits regularly, Max pays for itself.
Strategic-consumption rules that hold up:
- One human, one subscription, one beneficiary. Anthropic's policy is explicit, and there are documented bans for sharing accounts or proxying a subscription to multiple humans. If your team needs more capacity, it's per-seat plans or API, not shared logins.
- Use the cheapest surface for the work. Plan and explore in Chat (or Sonnet); execute in Claude Code with Opus only when the job demands it.
- Track 5-hour blocks. Subscription throttling resets in 5-hour windows; ccusage's
blockscommand shows where you are in the current window so you can avoid getting throttled mid-task.
9. Hard cost controls
Caps aren't a cost-reduction tool in themselves but a containment mechanism that makes every other lever safe to run at scale. You cannot manage what you don't control. Larger enterprises implementing cost management should explore implementing the following controls:
- Per-user and per-team API budgets with automatic throttling when exceeded.
- Repo-level policies. Production repos get higher limits than experimental ones.
- Task-level caps. A single agent run cannot burn half a million tokens uncapped.
- Daily/weekly spend alerts routed to engineering leads, not just finance.
- Automatic downgrade as users approach their limits: Sonnet by default, step to Haiku as budget tightens.
10. Right-sizing the contract
For enterprise deployments, the contract itself is a lever. Most teams that audit their commits find they're overcommitted relative to realized usage. Beyond volume discounts, two terms worth pushing on specifically:
- Rollover or soft-commit structures that survive variance in usage.
- Explicit rights to use other model providers in Claude Code. Some enterprise terms restrict this; that restriction is the single most expensive line in the contract over a multi-year horizon.
The term that compounds most over a multi-year horizon is portability: contracts that lock traffic to one provider weaken your negotiating position every year, while contracts that preserve the right to route keep leverage intact across the next renewal and the one after that.
A 30-day sequence
If you're starting from zero, here's the best path forward:
- Week 1: Instrument. Install ccusage and a usage monitor. Capture cost-per-developer, cost-per-repo, cache hit rate, and (if possible) cost-per-accepted-diff. Establish the baseline.
- Week 2: Cache and context. Audit CLAUDE.md size, MCP overhead, and average tool-output volume. Install RTK or an equivalent output compressor. Set explicit cache TTLs. Move long procedures into Skills.
- Week 3: Models and routing. Roll out intelligent routing to make use of smaller models on lower-complexity tasks and monitor quality.
- Week 4: Budgets and contract. Implement per-user and per-team limits. Begin the commit-structure conversation with Anthropic if applicable.
Depending on your workloads, these techniques can lead to cost reduction of anywhere from 30-90%. What’s more, if done correctly these techniques will not degrade developer experience or productivity.
Conclusion
Coding agent costs have exploded in the past 6 months, and they are expected to 10x again in the coming 6 months. Models are becoming more expensive and more capable, meaning they can run longer and more autonomously. This is painful for individual developers, and even more painful for organizations deploying coding agents at scale. The good news is that there is a tremendous amount we can do to cut the fat on spend and run leaner, faster, and more effectively. By changing behavior and implementing the right tools and guardrails, you can invest in your Claude Code usage much more intelligently and continue scaling at the pace of the models themselves.
For more information on Not Diamond and our intelligent model routing for coding agents, reach out to set up a demo of the product.