Key takeaways

  1. Headline pricing tells you nothing useful: your effective rate is cache hit rate x input price + miss rate x full input price + output x output price.
  2. Cache reads bill at $0.50/MTok (one-tenth the $5 input rate), so on long-context agent loops, where we measured hit rates north of 90% with disciplined prefixes, that discount is where the real savings live.
  3. For interactive coding workloads, Opus 4.8 is now within 12% of Sonnet 4.6 on effective per-request cost while keeping the Opus quality bar.
  4. GPT-5.5 still wins on raw output throughput per dollar; Gemini 3.1 Pro wins on context-cost ratio above 500K tokens.
  5. Anyone with an annualized inference bill above $50K should rerun their pricing spreadsheet this week.

For two years, the calculus for picking a frontier model in production was simple: use Opus when quality matters, swap to Sonnet (or GPT-4-class, or Gemini-Flash-class) for everything else, and accept that Opus would cost 4-6x more on a per-request basis. That mental model is now stale. Opus 4.8 lists at $5/MTok input and $25 output (a third of the $15 / $75 the prior Opus generation charged), and the cheaper headline is only half the story. The other half is what those numbers actually mean in production.

This is the gap most pricing comparisons miss: published rates are headline rates. The number you actually pay is determined by your cache hit rate, your prefix structure, and how long your average request lives.

What the numbers actually are

Three things, in plain language:

  1. Headline rates are a third of the old Opus. Opus 4.8 lists at $5/MTok input, $0.50/MTok on cache reads, and $25/MTok output, versus $15 / $1.50 / $75 for the prior Opus generation (Anthropic pricing). Cache reads are billed at one-tenth the input rate.
  2. The cache hit rate is where the savings concentrate. Every input token served from cache costs $0.50/MTok instead of $5, a 90% discount on that token. On long-context agentic workloads with stable prefixes, a disciplined setup keeps most input tokens cached; in our testing, well-structured agent loops sit north of 90%.
  3. The default cache TTL is 5 minutes. You can extend it to 1 hour, but the cache write then costs 2x the base input rate instead of 1.25x (pricing), worth it only for batch jobs that revisit the same context window across separated calls.

The third one is small. The first two compound.

The math everyone gets wrong

The standard mistake when comparing model prices is to look at the price-per-million-tokens row in the docs and divide it by what your average request uses. That gives you a number, but it’s the worst case number. In practice your effective rate looks something like this:

effective_cost = (cache_hit_rate × cached_input_price × input_tokens)
               + ((1 - cache_hit_rate) × full_input_price × input_tokens)
               + (output_price × output_tokens)

Most production workloads (RAG, agents, coding assistants, anything with stable system prompts) sit somewhere between 60% and 90% cache hit rate on input. Long-running agents that keep coming back to the same context window sit above 90%. The arithmetic between “what the price tag says” and “what you actually pay” is a 3-6x difference for these workloads.

The current Opus pricing matters because both terms move in the right direction at once: a low headline rate and a steep cache discount.

Opus 4.8 got cheaper, and much better at being cheap.

A side-by-side, run on real workloads

We rebuilt our internal pricing model against current pricing and ran four representative workloads through each candidate frontier model. The numbers below are dollars per 1,000 requests, normalized to the same token budgets (our own estimates, not vendor figures). The “Opus (prior gen)” column is the old $15/$75 Opus, kept in to show the generational drop. The Gemini 3.1 Pro column is illustrative: it reflects the prior Google flagship’s modeling and hasn’t been re-derived for 3.1 Pro’s lower current rates.

WorkloadOpus (prior gen)Opus 4.8Sonnet 4.6GPT-5.5Gemini 3.1 Pro
Interactive coding (high cache)$48.20$28.40$25.10$36.50$31.80
RAG over 200K-token corpus$112.00$66.00$44.00$59.00$51.00
Long agent loop (>20 turns)$86.50$48.30$39.10$52.40$44.20
One-shot batch summarization$22.00$18.50$11.20$14.50$19.80
$0$50$100$150$200Interactive codingCoding$48.2$28.4$25.1RAG 200KRAG 200K$112$66$44Long agent loopAgent loop$86.5$48.3$39.1One-shot batchBatch$22$18.5$11.2
Effective cost per workload — prior-gen Opus vs 4.8 vs Sonnet The Counter Brief's own estimates (not vendor figures); GPT-5.5 and Gemini 3.1 Pro are in the table above. Source: The Counter Brief's model cost calculator
Effective cost per workload — prior-gen Opus vs 4.8 vs Sonnet
CategoryOpus (prior gen)Opus 4.8Sonnet 4.6
Interactive coding $48.2$28.4$25.1
RAG 200K $112$66$44
Long agent loop $86.5$48.3$39.1
One-shot batch $22$18.5$11.2
Cite or embed this

Free to reuse with a credit link back to The Counter Brief.

What jumps out: for the workloads where Opus quality genuinely matters (agents, complex coding, multi-step reasoning), the current Opus pricing closes the once-massive gap against the prior generation. Sonnet 4.6 is still the cheaper option in absolute terms, but the cost premium for going to Opus has narrowed from “3x more” to roughly “10-15% more” on the high-cache workloads. That changes the calculus.

For one-shot batch jobs without cache reuse, Sonnet (or in some cases GPT-5.5) remains the right call. Don’t pay for Opus on workloads where its strengths don’t matter.

The cache-hit-rate caveat

The high cache hit rate we observed is real, but it depends on you actually structuring your requests for cache hits. Two things will tank it:

  • Reordering messages. Cache keys are prefix-based. If you slot a new system message in the middle of a long conversation, every cached token after that point invalidates.
  • Per-request drift in the system prompt. Stamping the current timestamp into the system prompt at every call defeats the cache. Move dynamic content to the user message.

If you’re rebuilding the system prompt template per call, you’ll never see the headline rate. Audit your prompt scaffolding before you blame the model for being expensive.

What this means for your roadmap

If you’re running any meaningful inference volume:

  1. Rerun your pricing spreadsheet with the 4.8 numbers and your actual cache hit rate. The Counter Brief’s model cost calculator does this with the current rates built in. Don’t use Anthropic’s example workloads. Use yours.
  2. Audit prompt structure for cache-killers (per-request timestamps, reordered system messages, randomized tool order).
  3. Re-evaluate the Opus / Sonnet split. Workloads you previously routed to Sonnet for cost reasons may now be cheap enough on Opus that the quality lift is worth it.
  4. Don’t panic-migrate. The same audit applies to GPT-5.5 and Gemini, both of which have their own pricing inflections coming. Build the spreadsheet once; rerun it each quarter.

The big-picture takeaway: for the first time since Opus launched, the answer to “should we use the most capable model?” isn’t automatically “no, it’s too expensive.” For high-cache-locality work, the right model is now the best one. That’s a meaningful shift in how production AI gets architected, and almost everyone’s pricing assumptions are now stale.

The Counter Brief — one email, every Monday.

The week's AI-for-revenue moves in a 5-minute read: which tools are worth the budget, which to skip, and the one thing to do about it this week. Source-checked, no vendor decks.

Edited by Aditya Marin Gasga · Read a recent issue →

Free. One click to unsubscribe.

About Aditya Marin Gasga

Founding Editor

Aditya Marin Gasga is the founding editor of The Counter Brief and Head of Growth at Demand Nexus, its parent company, where he works on sourcing qualified pipeline across SDR, content, and paid channels. His background is in performance marketing and demand generation. He studied business administration at Northumbria University.

More from Aditya Marin Gasga →