Skip to main content

Token Optimization Techniques

Token optimization is about doing more with fewer tokens -- sending less, generating less, and making every token count. These techniques reduce both cost (you pay per token) and latency (fewer output tokens means faster responses). This page covers practical techniques you can apply immediately to any Claude workload on Amazon Bedrock.

Measuring Token Usage​

Before optimizing, you need to measure. Every Claude response includes usage data:

import boto3
import json

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

def invoke_and_measure(messages: list, system: str = None, **kwargs) -> dict:
"""Invoke Claude and return response with detailed token metrics."""
body = {
"anthropic_version": "bedrock-2023-05-31",
"messages": messages,
"max_tokens": kwargs.get("max_tokens", 1024),
}
if system:
body["system"] = system

response = bedrock.invoke_model(
modelId=kwargs.get("model_id", "us.anthropic.claude-sonnet-4-6-20250514"),
contentType="application/json",
accept="application/json",
body=json.dumps(body),
)
result = json.loads(response["body"].read())

usage = result["usage"]
return {
"text": result["content"][0]["text"],
"input_tokens": usage["input_tokens"],
"output_tokens": usage["output_tokens"],
"total_tokens": usage["input_tokens"] + usage["output_tokens"],
}

For pre-flight token counting without making a request, use Anthropic's token counting:

# Estimate tokens before sending (rough approximation)
def estimate_tokens(text: str) -> int:
"""Approximate token count. Claude averages ~1.3 tokens per word in English."""
return int(len(text.split()) * 1.3)

# More accurate: use Anthropic's Python SDK for exact counts
# pip install anthropic
from anthropic import Anthropic
client = Anthropic()
count = client.count_tokens("Your text here")

System Prompt Optimization​

The system prompt is sent with every request. Reducing it by 500 tokens saves those tokens on every single call.

System prompts are often 5-10x longer than necessary. A well-structured concise prompt produces the same quality output while reducing input tokens on every request.

Key principles:

  • Remove filler words and conversational instructions ("You are an expert...", "Please make sure to...")
  • Use bullet points and abbreviations instead of full sentences
  • Remove policies/rules that Claude can infer from context
  • Keep only the instructions that change Claude's behavior
Premium

System Prompt Optimization Playbook

Get our before/after system prompt templates with specific token savings calculations and cost projections for high-volume deployments.

Output Format Control​

Controlling output format is the single most effective way to reduce output tokens:

JSON Mode​

# Verbose: free-form response (~300 tokens)
prompt_verbose = "What are the top 3 AWS services for serverless?"
# Response: "Great question! When it comes to serverless computing on AWS,
# there are several excellent services to consider. Let me walk you through
# the top three options: ..."

# Concise: JSON response (~60 tokens)
prompt_json = """List the top 3 AWS serverless services. Respond with JSON only:
{"services": [{"name": "string", "use_case": "string"}]}"""

body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"messages": [
{"role": "user", "content": prompt_json},
{"role": "assistant", "content": "{"}, # Force JSON start
],
"max_tokens": 200,
})
# Response: {"services": [{"name": "Lambda", "use_case": "compute"}, ...]}

Prefilled Assistant Messages​

Skip preamble by starting Claude's response:

messages = [
{"role": "user", "content": "Classify this as spam or not spam: 'You won $1M!'"},
{"role": "assistant", "content": "Classification: "},
]
# Output: "spam" (1 token instead of "Based on my analysis, this message appears to be spam because..." at 30+ tokens)

Structured Templates​

system = """Respond using this exact template, filling in only the bracketed sections:
SEVERITY: [low|medium|high]
CATEGORY: [one word]
ACTION: [one sentence]"""

# This produces ~15 output tokens instead of a 200-token paragraph

Few-Shot vs Zero-Shot Tradeoffs​

Few-shot examples improve quality but increase input tokens. The tradeoff depends on your volume:

ApproachInput TokensOutput QualityBest When
Zero-shotLowGood for simple tasksHigh volume, simple tasks
1-shot+100-300 per exampleBetter consistencyMedium volume, formatting matters
3-5 shot+300-1500Highest consistencyLow volume, complex format
# Zero-shot: minimal input tokens
zero_shot = [{"role": "user", "content": "Extract entities: 'John works at Google in NYC'"}]

# One-shot: +~200 input tokens but much better format consistency
one_shot = [
{"role": "user", "content": "Extract entities: 'Alice is a doctor at Mayo Clinic in Rochester'"},
{"role": "assistant", "content": '{"people": ["Alice"], "orgs": ["Mayo Clinic"], "locations": ["Rochester"]}'},
{"role": "user", "content": "Extract entities: 'John works at Google in NYC'"},
]

Decision rule: Start zero-shot. If output format is inconsistent, add 1 example. Only add more if quality is still insufficient. With prompt caching enabled, few-shot examples in the cached prefix add minimal cost after the first request.

max_tokens Parameter Impact​

max_tokens sets the hard ceiling on output length. Claude stops generating when it hits this limit, even mid-sentence.

# Common mistake: using max_tokens as a "quality" setting
# max_tokens=4096 doesn't make the response better -- it just allows longer output

# Right approach: set max_tokens based on expected output length
task_limits = {
"classification": 10, # "positive", "negative", etc.
"entity_extraction": 200, # JSON with extracted entities
"summarize_email": 150, # 2-3 sentence summary
"code_review": 1000, # Detailed feedback
"document_analysis": 2000, # Comprehensive analysis
}

def invoke_with_task_limit(messages: list, task: str) -> dict:
max_tokens = task_limits.get(task, 1024)
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"messages": messages,
"max_tokens": max_tokens,
})
response = bedrock.invoke_model(
modelId="us.anthropic.claude-sonnet-4-6-20250514",
contentType="application/json",
accept="application/json",
body=body,
)
return json.loads(response["body"].read())

Important: max_tokens does not affect cost directly -- you only pay for tokens actually generated. But a lower limit prevents Claude from generating unnecessarily verbose responses, which does save money.

Chunking Strategies for Long Documents​

When processing documents that exceed the context window, or when you want to optimize costs for very long documents, you need a chunking strategy.

Key concepts:

  • Fixed-size chunking with overlap prevents losing context at chunk boundaries
  • Extract then synthesize: use a cheaper model (Haiku) for per-chunk extraction, then a stronger model (Sonnet) for final synthesis
  • Map-reduce pattern: summarize individual documents in parallel, then combine summaries into a final overview
Premium

Document Processing Patterns

Get the complete chunking, extract-synthesize, and map-reduce implementations with multi-model routing for cost-optimized document processing at scale.

Flashcards​

1 / 7
Question

How many tokens does Claude use per word on average in English text?

Click to reveal
Answer

Approximately 1.3 tokens per word. This means 1,000 words of English text is roughly 1,300 tokens. Use this as a quick estimate; for exact counts, use Anthropic's token counting API.

Key Insight

The biggest token optimization wins come from output control, not input reduction. Prefilled assistant messages, JSON output format, and tight max_tokens limits can reduce output tokens by 80% or more. Since output tokens cost 5x more than input tokens, this is where the real savings are.