Prompt Caching for Cost Reduction

Prompt caching is one of the most underused cost optimization features for Claude on Amazon Bedrock. When you send the same system prompt or context prefix with every request, you are paying full price for those input tokens each time. Prompt caching lets you pay once and reuse the cached tokens at a 90% discount. For applications with large, static prompt prefixes, this can cut your input token costs dramatically.

How Prompt Caching Works

When you enable prompt caching, Bedrock stores the processed representation of your prompt prefix (the key-value attention cache) on the server. On subsequent requests that start with the same prefix, the cached tokens are loaded from memory instead of being reprocessed.

First request (cache write):

Full input token cost + a small cache write fee
The prompt prefix is cached for reuse

Subsequent requests (cache hit):

Cached tokens charged at ~10% of the normal input price
Only new/changed tokens at the end are charged at full price

Pricing Component	Haiku 4.5	Sonnet 4.6	Opus 4.6
Normal input	$0.80/1M	$3.00/1M	$15.00/1M
Cache write	$1.00/1M	$3.75/1M	$18.75/1M
Cache read (hit)	$0.08/1M	$0.30/1M	$1.50/1M

Cache read tokens cost 90% less than normal input tokens. The cache write costs 25% more, but you only pay it once.

Cache TTL and Invalidation

Default TTL: 5 minutes from the last cache hit
The TTL resets every time the cache is accessed, so active caches stay warm
Cache is invalidated if even a single token in the cached prefix changes
Minimum cacheable prefix: 1,024 tokens (Haiku) or 1,024 tokens (Sonnet/Opus)

The 5-minute TTL means caching is most effective for:

Applications with steady request volume (at least a few requests per minute)
Multi-turn conversations where the system prompt and history prefix remain stable
Batch processing where many requests share the same context

Best Practices: Structure Your Prompts for Caching

The key principle is: put static content first, dynamic content last. The cache matches from the beginning of the prompt, so everything before the first change point can be cached.

┌──────────────────────────────────────┐
│  System Prompt (static)              │  ← Cached
│  Few-shot Examples (static)          │  ← Cached
│  Reference Documents (static)        │  ← Cached
├──────────────────────────────────────┤
│  User's Current Query (dynamic)      │  ← NOT cached (full price)
└──────────────────────────────────────┘

Code Example: Cache-Optimized Prompt Structure

import boto3
import json

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

# Static system prompt with detailed instructions (cacheable)
SYSTEM_PROMPT = """You are an expert AWS solutions architect assistant.

## Response Guidelines
- Always provide specific AWS service recommendations
- Include cost estimates when relevant
- Reference AWS Well-Architected Framework pillars
- Suggest infrastructure-as-code implementations
- Consider multi-region and disaster recovery implications

## Knowledge Base
[... extensive reference documentation, architectural patterns,
best practices, compliance requirements, etc. - potentially
thousands of tokens of static context ...]
"""

# Static few-shot examples (cacheable)
FEW_SHOT_EXAMPLES = [
    {
        "role": "user",
        "content": "How should I set up a multi-region active-active database?"
    },
    {
        "role": "assistant",
        "content": "For multi-region active-active, use Amazon DynamoDB Global Tables..."
    },
    {
        "role": "user",
        "content": "What's the best way to handle file uploads at scale?"
    },
    {
        "role": "assistant",
        "content": "Use S3 presigned URLs for direct upload with API Gateway..."
    },
]


## Production Implementation

The complete implementation includes a cache-optimized query function with `cache_control` markers and a `CachedConversation` class for multi-turn conversations that automatically tracks cache hit rates and cost savings.

<PremiumCTA
  title="Production Caching Implementation"
  description="Get the complete CachedConversation class with cache performance tracking, savings calculator, and multi-turn conversation optimization."
/>

## When Caching Helps vs When It Doesn't

| Scenario | Caching Benefit | Why |
|----------|----------------|-----|
| Chatbot with large system prompt | **High** | System prompt cached across all user requests |
| RAG with static knowledge base prefix | **High** | Knowledge base cached, only query changes |
| Multi-turn conversations | **High** | Conversation history grows but prefix is stable |
| Batch processing same document | **High** | Document cached, each question is cheap |
| Unique prompts every request | **None** | Nothing to cache -- every request is different |
| Very short prompts (under 1024 tokens) | **None** | Below minimum cacheable size |
| Infrequent requests (over 5 min apart) | **Low** | Cache expires between requests |

## Measuring Cache Effectiveness

```python
def analyze_cache_performance(usage: dict, model: str = "sonnet") -> dict:
    """Analyze cache hit rate and cost savings from a response's usage data."""
    pricing = {
        "sonnet": {"input": 3.00, "cache_read": 0.30, "cache_write": 3.75},
    }
    prices = pricing[model]

    total_input = usage.get("input_tokens", 0)
    cache_read = usage.get("cache_read_input_tokens", 0)
    cache_write = usage.get("cache_creation_input_tokens", 0)
    uncached = total_input - cache_read

    # Cost with caching
    actual_cost = (
        (uncached / 1_000_000 * prices["input"])
        + (cache_read / 1_000_000 * prices["cache_read"])
        + (cache_write / 1_000_000 * prices["cache_write"])
    )

    # Cost without caching
    baseline_cost = total_input / 1_000_000 * prices["input"]

    return {
        "cache_hit_rate": f"{cache_read / max(total_input, 1) * 100:.1f}%",
        "actual_input_cost": round(actual_cost, 6),
        "baseline_input_cost": round(baseline_cost, 6),
        "savings": f"{(1 - actual_cost / max(baseline_cost, 0.000001)) * 100:.1f}%",
    }

Flashcards

1 / 6

Question

How much do cache read tokens cost compared to normal input tokens?

Click to reveal

Answer

Cache read tokens cost 90% less than normal input tokens. For example, Sonnet cache reads cost $0.30/1M vs $3.00/1M for normal input. Cache writes cost 25% more ($3.75/1M) but only happen once.

Key Insight

Prompt caching delivers the biggest ROI when you have a large, static prefix (system prompt + few-shot examples + reference docs) that is shared across many requests. Restructure your prompts to put all static content first, mark it with cache_control, and aim for at least a few requests per minute to keep the cache warm.

How Prompt Caching Works​

Cache TTL and Invalidation​

Best Practices: Structure Your Prompts for Caching​

Code Example: Cache-Optimized Prompt Structure​

Flashcards​

How Prompt Caching Works

Cache TTL and Invalidation

Best Practices: Structure Your Prompts for Caching

Code Example: Cache-Optimized Prompt Structure

Flashcards