Key takeaways

  1. Caching is opt-in: nothing caches until you add a cache_control: {type: 'ephemeral'} breakpoint (or top-level auto-caching). An identical prefix on its own does nothing.
  2. The cached prefix is an exact match. Any byte change before the breakpoint (a timestamp, a reordered JSON key) busts every token after it.
  3. Keep stable instructions in the system role with the breakpoint on its last block; move timestamps, IDs, and per-user data into the user turn. You do not need to smuggle the system prompt into a user message.
  4. On Opus 4.8 the minimum cacheable prefix is 4,096 tokens. A shorter prefix silently won't cache, with no error and cache_creation_input_tokens stuck at 0.
  5. Reads bill at 0.1x base input; writes at 1.25x (5-minute TTL) or 2x (1-hour). Break-even is two requests on the default TTL, three on the hour.
  6. Verify with the response: usage.cache_read_input_tokens should be non-zero on the second identical-prefix call. If it stays zero, a silent invalidator is in your prefix.

A team I talked to was paying full price on a system prompt they sent on every request: the same few thousand tokens of instructions, tools, and examples, billed at the base input rate thousands of times a day. They didn’t have a volume problem or a model problem. They had a trailing-timestamp problem: one dynamic line at the top of the prompt was busting the cache on every call. Anthropic bills a cache hit at one-tenth the base input rate (prompt caching docs), so that one line was charging them roughly 10x for the part of the prompt that never actually changed.

That’s the whole game with prompt caching, and the whole trap. Two things have to be true for a token to be served from cache: you have to mark a cache breakpoint, and everything before that breakpoint has to be byte-for-byte identical to a recent request. Miss either one and you pay full price. Most “why is my hit rate low” problems are a broken second condition, but the first one, forgetting to turn caching on at all, trips up more people than you’d think.

How caching is actually turned on

Caching is opt-in. An identical prefix on its own caches nothing; you have to tell the API where the reusable boundary is with a cache_control marker.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": STATIC_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},  # this is the breakpoint
        }
    ],
    messages=[{"role": "user", "content": user_message_content}],
)

The cache_control block marks the end of the cacheable prefix. Everything up to and including it (your tools, then your system prompt) gets cached; everything after it (the user turn) is read fresh. If you’d rather not manage placement, set cache_control={"type": "ephemeral"} at the top level of the request and the API puts the breakpoint on the last cacheable block for you.

cached · byte-identicalread freshtoolsposition 0systemstatic promptuser turndynamic valuescache_control breakpoint cached · byte-identicaltoolsposition 0systemstatic promptuser turndynamic valuescache_control breakpointread fresh
Render order is tools, then system, then messages. The cache_control marker ends the cacheable prefix: everything up to and including it is cached and must stay byte-identical; the user turn after it is read fresh.

A few mechanics worth holding in your head, because they’re the ones that quietly cost you:

  • The render order is tools then system then messages. The cache key is built in that order, so stable content (frozen system prompt, deterministic tool list) goes first and volatile content goes after the last breakpoint.
  • The minimum cacheable prefix on Opus 4.8 is 4,096 tokens. A shorter prefix silently won’t cache, with no error: cache_creation_input_tokens just stays 0. The three-sentence prompts in this piece are below that floor, so treat them as illustrations of structure, not of a prompt that would actually cache. (Sonnet 4.6’s floor is 2,048; older Sonnet models are 1,024.)
  • You get four breakpoints per request, maximum. Spend them at real stability boundaries: a frozen tool list, a frozen system prompt, the end of a long shared document.
  • Changing the model or the tool set invalidates the whole prefix. Tools render at position 0, so adding, removing, or reordering one busts everything after it. Serialize your tool list deterministically, and don’t swap models mid-conversation.

With caching switched on, the rest of the work is keeping the prefix byte-identical.

What quietly kills your cache

A handful of ordinary prompt-engineering habits look harmless and quietly wreck your hit rate. These are the ones I see most.

1. Dynamic timestamps and random IDs

Including current_datetime or a request_id directly in your system prompt makes every request unique, guaranteeing a cache miss. This is the most common offender.

Bad Example:

# This will always miss the cache
system_prompt = f"""
You are a helpful assistant. Current time: {datetime.now().isoformat()}. Request ID: {uuid.uuid4()}.
Your task is to summarize user input.
"""

If you need the current time or an ID for logging or a specific instruction, pass it in the user message, outside the cached prefix.

2. Per-user data in the system prompt

Embedding data that changes per user or per session (user_id, session_token, user_preferences) into the system prompt makes the prefix unique for each interaction. That data belongs in the user message or a context block, after the breakpoint.

Bad Example:

# This makes the system prompt unique per user
system_prompt = f"""
You are a personalized assistant for user {user.id}. Their preferred language is {user.language}.
Respond in a helpful and concise manner.
"""

3. Inconsistent formatting and whitespace

Subtle differences in how you build prompt strings, varied newlines, extra spaces, inconsistent JSON key ordering, all cause misses. The cache sees {"a": 1, "b": 2} as different from {"b": 2, "a": 1} or the same object pretty-printed across three lines.

Recommendation: Use a canonical serialization for any structured data in the cached prefix, e.g. json.dumps(data, sort_keys=True, separators=(",", ":")).

4. Irrelevant context bloat

Don’t put context in the cached prefix that doesn’t shape the model’s behavior for that system prompt’s task. If a block changes often but doesn’t alter the core task definition, it belongs after the breakpoint, in the user turn.

What keeps the hit rate high

Flip each of those around and you get the patterns that work. The throughline: treat the prefix before your breakpoint like version-controlled code, not like a string you rebuild per request.

1. A static system prompt, cached at the breakpoint

If you do one thing, do this. Your system prompt should be a fixed string for a given application context: persona, core instructions, constraints, ideally checked into source control. Pass it in the system role with a cache_control marker on its last block, and keep everything that varies in the user turn.

Good Example:

STATIC_SYSTEM_PROMPT = """
You are an expert financial analyst. Provide concise, factual summaries of earnings reports.
Use only information explicitly provided in the user's input; do not add external data.
Format your response as a JSON object with 'summary' and 'key_metrics'.
"""  # in practice, 4,096+ tokens of instructions, schema, and few-shot examples

response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    system=[
        {"type": "text", "text": STATIC_SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}
    ],
    messages=[{"role": "user", "content": user_message_content}],
)

You do not need to smuggle the system prompt into a user message to cache it. The system role is cached directly; the breakpoint is what matters, not the role. Keep instructions in system, data in the user turn.

2. Canonical serialization for any structured data

When you pass structured information, use a canonical serialization so logically identical data produces an identical string across requests.

import json

def canonical(data):
    return json.dumps(data, sort_keys=True, separators=(",", ":"))

user_message_content = (
    "Here is the customer's context:\n"
    f"{canonical(user_context)}\n\n"
    "Analyze their recent orders and suggest a relevant upsell."
)

Identical user_context now serializes to an identical string, so two such requests share a prefix instead of fragmenting the cache over key ordering or whitespace.

3. RAG: stable prefix first, retrieved context after the breakpoint

For retrieval-augmented generation, keep the static system prompt (and its breakpoint) first, and put the retrieved snippets plus the question in the user turn, where they’re read fresh.

import anthropic

client = anthropic.Anthropic()

SYSTEM = """
You are a question-answering bot for a documentation system.
Answer strictly from the provided documentation snippets.
If the answer isn't in them, say you don't have enough information.
"""  # plus your full instruction set, 4,096+ tokens

def answer(question, retrieved_docs):
    docs = "<documents>\n" + "".join(
        f'<doc id="{d["id"]}">{d["content"]}</doc>\n' for d in retrieved_docs
    ) + "</documents>"
    return client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1024,
        system=[{"type": "text", "text": SYSTEM, "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": f"{docs}\n\nQuestion: {question}"}],
    )

The system prompt hits the cache on every call; the retrieved snippets and the question are read fresh, which is what you want, since they change each time. If your retrieved context is itself stable across a session (the same handful of docs for a multi-turn conversation), add a second breakpoint at the end of it.

4. Version-control your prompts

Treat your system prompts like code. Store them in your repository, review changes, and version them, so a stray edit can’t silently invalidate the cache for every caller. A simple prompt_v1.txt is often enough.

How to actually measure it

You don’t need to hash anything. Every response tells you exactly what happened, in three usage fields:

print(response.usage.cache_creation_input_tokens)  # written to cache this call (~1.25x base)
print(response.usage.cache_read_input_tokens)      # served from cache this call (~0.1x base)
print(response.usage.input_tokens)                 # processed fresh, full price

The pattern to watch: on the first request with a given prefix, cache_creation_input_tokens is non-zero (you paid to write it) and cache_read_input_tokens is 0. On the second identical-prefix request inside the TTL, that flips, and cache_read_input_tokens is non-zero. If it stays 0 across calls you believe are identical, a silent invalidator is in your prefix: diff the rendered bytes of two requests and you’ll find the timestamp, the unsorted JSON, or the tool that moved. Make it a standing check rather than a one-off — a five-minute weekly cache audit catches a regression the week it lands. (Your total prompt size is the sum of all three fields, not input_tokens alone, so an agent that ran for hours on 4K of input_tokens was mostly served from cache.)

What it’s worth

Reads are the headline: a cache hit is billed at one-tenth the base input rate. But writes aren’t free, and a cost piece that skips them is lying to you — the same gap between a raw Opus 4.8 per-token sticker and what you actually pay once caching is in play. Writing the cache costs 1.25x base input on the default 5-minute TTL, or 2x if you opt into the 1-hour TTL. So the break-even is two requests on the default TTL (1.25x to write plus 0.1x to read, versus 2x for two uncached reads) and three on the hour. For anything you re-send more than once or twice inside the window, the math is overwhelmingly in your favor; for a prefix you touch exactly once and never again, caching costs you money.

0x0.4x0.8x1.1x1.5xCache read (hit)Fresh read (miss)Cache write (5-min TTL)
What each token costs, relative to base input Cache reads bill at 0.1x base input; a fresh read is 1x; writing the cache is 1.25x on the 5-minute TTL (2x on the 1-hour). Break-even is two requests. Source: Anthropic prompt caching docs
What each token costs, relative to base input
CategoryBilled rate vs base input
Cache read (hit) 0.1x
Fresh read (miss) 1x
Cache write (5-min TTL) 1.3x
Cite or embed this

Free to reuse with a credit link back to The Counter Brief.

Go back to that team with the trailing timestamp. A 5,000-token prefix sent 50,000 times a day is 250M input tokens daily. Missing the cache, that’s billed at the full base rate; hitting it, the prefix costs one-tenth as much, a 90% cut on the tokens that never change, plus a faster time-to-first-token on every call. They moved the timestamp into the user message, kept the breakpoint on the system prompt, and changed nothing else. The prefix went from a line item they argued about to one they stopped noticing. That’s the shape of the win: not clever, just disciplined. Mark the boundary once, keep the bytes before it byte-identical, and let the cache do the rest.

The Counter Brief — one email, every Monday.

The week's AI-for-revenue moves in a 5-minute read: which tools are worth the budget, which to skip, and the one thing to do about it this week. Source-checked, no vendor decks.

Edited by Aditya Marin Gasga · Read a recent issue →

Free. One click to unsubscribe.

About Aditya Marin Gasga

Founding Editor

Aditya Marin Gasga is the founding editor of The Counter Brief and Head of Growth at Demand Nexus, its parent company, where he works on sourcing qualified pipeline across SDR, content, and paid channels. His background is in performance marketing and demand generation. He studied business administration at Northumbria University.

More from Aditya Marin Gasga →