Common Claude Performance Pitfalls
After working with dozens of production Claude deployments on AWS Bedrock, the same mistakes come up repeatedly. Each one is easy to fix once you know about it, but painful to debug in production. This page covers the seven most common pitfalls and exactly how to avoid them.
1. Timeout Errors​
The most common production issue. The default read_timeout in boto3 is 60 seconds, but Claude responses (especially with extended thinking or long outputs) can take longer.
Symptoms: ReadTimeoutError after exactly 60 seconds, even though the model would have completed in 90 seconds.
The fix configures the boto3 client with an extended read_timeout (300s), a short connect_timeout (5s), and adaptive retry mode.
Production Fix Patterns
Get the complete code patterns for timeout configuration, retry with exponential backoff, cold start warmup, and region-optimized client setup.
Rule of thumb: Set read_timeout to at least 2x your expected maximum response time. For extended thinking, use 300-600 seconds.
2. Cold Start Latency​
The first request to Bedrock in a new connection takes significantly longer due to TLS handshake, connection pooling setup, and potential model loading.
Symptoms: First request takes 2-5 seconds, subsequent requests take 500ms-2 seconds.
The warmup_bedrock() function sends a minimal 1-token Haiku request at startup to establish the TLS connection and warm the connection pool.
Production Fix Patterns
Get the complete code patterns for timeout configuration, retry with exponential backoff, cold start warmup, and region-optimized client setup.
For Lambda functions, use provisioned concurrency or a scheduled warmup to keep connections alive.
3. Oversized System Prompts​
A common pattern is to stuff everything into the system prompt: examples, rules, context, formatting instructions, persona details. A 10,000-token system prompt is processed on every single request, adding latency and cost even when the user asks a simple question.
Symptoms: Consistently high input token counts even for short user messages. Slow responses across the board.
The fix replaces a bloated 10K-token system prompt with a lean one-liner and a build_messages() function that injects examples only when needed.
Production Fix Patterns
Get the complete code patterns for timeout configuration, retry with exponential backoff, cold start warmup, and region-optimized client setup.
4. Not Using Streaming When You Should​
Every user-facing application should use streaming. The time-to-first-token with streaming is 200-800ms versus waiting for the entire response (which could be 5-30 seconds).
Symptoms: Users complaining about slow responses. High bounce rates. Timeout errors on longer responses.
The fix switches from invoke_model (batch) to invoke_model_with_response_stream, sending text chunks to the user as they arrive rather than waiting for the full response.
Production Fix Patterns
Get the complete code patterns for timeout configuration, retry with exponential backoff, cold start warmup, and region-optimized client setup.
5. Wrong Model for the Task​
Using Opus for text classification or Haiku for complex code review. Both waste resources -- the first wastes money, the second wastes quality.
Symptoms: Unexpectedly high costs for simple tasks, or poor quality outputs that require human correction.
| Task | Wrong Model | Right Model | Cost Savings |
|---|---|---|---|
| Email classification | Opus 4.6 | Haiku 4.5 | ~95% |
| Sentiment analysis | Sonnet 4.6 | Haiku 4.5 | ~80% |
| Contract analysis | Haiku 4.5 | Sonnet 4.6 | N/A (quality issue) |
| System design review | Sonnet 4.6 | Opus 4.6 | N/A (quality issue) |
Implement a model router (covered in the Model Selection Guide) to automatically direct tasks to the appropriate model.
6. Missing Retry Logic with Exponential Backoff​
Bedrock returns throttling errors (HTTP 429) when you exceed your account's throughput limits. Without retry logic, these become user-visible failures.
Symptoms: Intermittent ThrottlingException errors, especially during peak traffic.
Two approaches are shown: (1) boto3's built-in adaptive retry mode with 5 max attempts, and (2) a custom invoke_with_retry() function with exponential backoff plus jitter, capping wait time at 30 seconds.
Production Fix Patterns
Get the complete code patterns for timeout configuration, retry with exponential backoff, cold start warmup, and region-optimized client setup.
7. Region Mismatch​
Deploying your application in us-west-2 but calling Bedrock in us-east-1 adds 60-120ms of network latency to every request. This is pure waste.
Symptoms: Consistently high TTFT even with Haiku. Network latency visible in traces.
The fix reads the region from AWS_REGION environment variable instead of hardcoding it. For multi-region deployments, a get_bedrock_client() function provides automatic failover across regions.
Production Fix Patterns
Get the complete code patterns for timeout configuration, retry with exponential backoff, cold start warmup, and region-optimized client setup.
Flashcards​
What is the default boto3 read_timeout and why is it problematic for Claude?
Click to revealThe default read_timeout is 60 seconds. Claude responses with extended thinking or long outputs can take longer, causing ReadTimeoutError. Set read_timeout to at least 300 seconds for production use.
The single most impactful fix for most teams is increasing the read_timeout and adding proper retry logic. These two changes eliminate the majority of production incidents with Claude on Bedrock.