Token Optimization Techniques

Token optimization is about doing more with fewer tokens -- sending less, generating less, and making every token count. These techniques reduce both cost (you pay per token) and latency (fewer output tokens means faster responses). This page covers practical techniques you can apply immediately to any Claude workload on Amazon Bedrock.

Measuring Token Usage

Before optimizing, you need to measure. Every Claude response includes usage data:

import boto3
import json

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

def invoke_and_measure(messages: list, system: str = None, **kwargs) -> dict:
    """Invoke Claude and return response with detailed token metrics."""
    body = {
        "anthropic_version": "bedrock-2023-05-31",
        "messages": messages,
        "max_tokens": kwargs.get("max_tokens", 1024),
    }
    if system:
        body["system"] = system

    response = bedrock.invoke_model(
        modelId=kwargs.get("model_id", "us.anthropic.claude-sonnet-4-6-20250514"),
        contentType="application/json",
        accept="application/json",
        body=json.dumps(body),
    )
    result = json.loads(response["body"].read())

    usage = result["usage"]
    return {
        "text": result["content"][0]["text"],
        "input_tokens": usage["input_tokens"],
        "output_tokens": usage["output_tokens"],
        "total_tokens": usage["input_tokens"] + usage["output_tokens"],
    }

For pre-flight token counting without making a request, use Anthropic's token counting:

# Estimate tokens before sending (rough approximation)
def estimate_tokens(text: str) -> int:
    """Approximate token count. Claude averages ~1.3 tokens per word in English."""
    return int(len(text.split()) * 1.3)

# More accurate: use Anthropic's Python SDK for exact counts
# pip install anthropic
from anthropic import Anthropic
client = Anthropic()
count = client.count_tokens("Your text here")

System Prompt Optimization

The system prompt is sent with every request. Reducing it by 500 tokens saves those tokens on every single call.

System prompts are often 5-10x longer than necessary. A well-structured concise prompt produces the same quality output while reducing input tokens on every request.

Key principles:

Remove filler words and conversational instructions ("You are an expert...", "Please make sure to...")
Use bullet points and abbreviations instead of full sentences
Remove policies/rules that Claude can infer from context
Keep only the instructions that change Claude's behavior

Premium

System Prompt Optimization Playbook

Get our before/after system prompt templates with specific token savings calculations and cost projections for high-volume deployments.

Output Format Control

Controlling output format is the single most effective way to reduce output tokens:

JSON Mode

# Verbose: free-form response (~300 tokens)
prompt_verbose = "What are the top 3 AWS services for serverless?"
# Response: "Great question! When it comes to serverless computing on AWS,
# there are several excellent services to consider. Let me walk you through
# the top three options: ..."

# Concise: JSON response (~60 tokens)
prompt_json = """List the top 3 AWS serverless services. Respond with JSON only:
{"services": [{"name": "string", "use_case": "string"}]}"""

body = json.dumps({
    "anthropic_version": "bedrock-2023-05-31",
    "messages": [
        {"role": "user", "content": prompt_json},
        {"role": "assistant", "content": "{"},  # Force JSON start
    ],
    "max_tokens": 200,
})
# Response: {"services": [{"name": "Lambda", "use_case": "compute"}, ...]}

Prefilled Assistant Messages

Skip preamble by starting Claude's response:

messages = [
    {"role": "user", "content": "Classify this as spam or not spam: 'You won $1M!'"},
    {"role": "assistant", "content": "Classification: "},
]
# Output: "spam" (1 token instead of "Based on my analysis, this message appears to be spam because..." at 30+ tokens)

Structured Templates

system = """Respond using this exact template, filling in only the bracketed sections:
SEVERITY: [low|medium|high]
CATEGORY: [one word]
ACTION: [one sentence]"""

# This produces ~15 output tokens instead of a 200-token paragraph

Few-Shot vs Zero-Shot Tradeoffs

Few-shot examples improve quality but increase input tokens. The tradeoff depends on your volume:

Approach	Input Tokens	Output Quality	Best When
Zero-shot	Low	Good for simple tasks	High volume, simple tasks
1-shot	+100-300 per example	Better consistency	Medium volume, formatting matters
3-5 shot	+300-1500	Highest consistency	Low volume, complex format

# Zero-shot: minimal input tokens
zero_shot = [{"role": "user", "content": "Extract entities: 'John works at Google in NYC'"}]

# One-shot: +~200 input tokens but much better format consistency
one_shot = [
    {"role": "user", "content": "Extract entities: 'Alice is a doctor at Mayo Clinic in Rochester'"},
    {"role": "assistant", "content": '{"people": ["Alice"], "orgs": ["Mayo Clinic"], "locations": ["Rochester"]}'},
    {"role": "user", "content": "Extract entities: 'John works at Google in NYC'"},
]

Decision rule: Start zero-shot. If output format is inconsistent, add 1 example. Only add more if quality is still insufficient. With prompt caching enabled, few-shot examples in the cached prefix add minimal cost after the first request.

max_tokens Parameter Impact

max_tokens sets the hard ceiling on output length. Claude stops generating when it hits this limit, even mid-sentence.

# Common mistake: using max_tokens as a "quality" setting
# max_tokens=4096 doesn't make the response better -- it just allows longer output

# Right approach: set max_tokens based on expected output length
task_limits = {
    "classification": 10,        # "positive", "negative", etc.
    "entity_extraction": 200,    # JSON with extracted entities
    "summarize_email": 150,      # 2-3 sentence summary
    "code_review": 1000,         # Detailed feedback
    "document_analysis": 2000,   # Comprehensive analysis
}

def invoke_with_task_limit(messages: list, task: str) -> dict:
    max_tokens = task_limits.get(task, 1024)
    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "messages": messages,
        "max_tokens": max_tokens,
    })
    response = bedrock.invoke_model(
        modelId="us.anthropic.claude-sonnet-4-6-20250514",
        contentType="application/json",
        accept="application/json",
        body=body,
    )
    return json.loads(response["body"].read())

Important: max_tokens does not affect cost directly -- you only pay for tokens actually generated. But a lower limit prevents Claude from generating unnecessarily verbose responses, which does save money.

Chunking Strategies for Long Documents

When processing documents that exceed the context window, or when you want to optimize costs for very long documents, you need a chunking strategy.

Key concepts:

Fixed-size chunking with overlap prevents losing context at chunk boundaries
Extract then synthesize: use a cheaper model (Haiku) for per-chunk extraction, then a stronger model (Sonnet) for final synthesis
Map-reduce pattern: summarize individual documents in parallel, then combine summaries into a final overview

Premium

Document Processing Patterns

Get the complete chunking, extract-synthesize, and map-reduce implementations with multi-model routing for cost-optimized document processing at scale.

Flashcards

1 / 7

Question

How many tokens does Claude use per word on average in English text?

Click to reveal

Answer

Approximately 1.3 tokens per word. This means 1,000 words of English text is roughly 1,300 tokens. Use this as a quick estimate; for exact counts, use Anthropic's token counting API.

Key Insight

The biggest token optimization wins come from output control, not input reduction. Prefilled assistant messages, JSON output format, and tight max_tokens limits can reduce output tokens by 80% or more. Since output tokens cost 5x more than input tokens, this is where the real savings are.

Measuring Token Usage​

System Prompt Optimization​

Output Format Control​

JSON Mode​

Prefilled Assistant Messages​

Structured Templates​

Few-Shot vs Zero-Shot Tradeoffs​

max_tokens Parameter Impact​

Chunking Strategies for Long Documents​

Flashcards​