Common Claude Performance Pitfalls

After working with dozens of production Claude deployments on AWS Bedrock, the same mistakes come up repeatedly. Each one is easy to fix once you know about it, but painful to debug in production. This page covers the seven most common pitfalls and exactly how to avoid them.

1. Timeout Errors

The most common production issue. The default read_timeout in boto3 is 60 seconds, but Claude responses (especially with extended thinking or long outputs) can take longer.

Symptoms: ReadTimeoutError after exactly 60 seconds, even though the model would have completed in 90 seconds.

The fix configures the boto3 client with an extended read_timeout (300s), a short connect_timeout (5s), and adaptive retry mode.

Premium

Production Fix Patterns

Get the complete code patterns for timeout configuration, retry with exponential backoff, cold start warmup, and region-optimized client setup.

Rule of thumb: Set read_timeout to at least 2x your expected maximum response time. For extended thinking, use 300-600 seconds.

2. Cold Start Latency

The first request to Bedrock in a new connection takes significantly longer due to TLS handshake, connection pooling setup, and potential model loading.

Symptoms: First request takes 2-5 seconds, subsequent requests take 500ms-2 seconds.

The warmup_bedrock() function sends a minimal 1-token Haiku request at startup to establish the TLS connection and warm the connection pool.

Premium

Production Fix Patterns

Get the complete code patterns for timeout configuration, retry with exponential backoff, cold start warmup, and region-optimized client setup.

For Lambda functions, use provisioned concurrency or a scheduled warmup to keep connections alive.

3. Oversized System Prompts

A common pattern is to stuff everything into the system prompt: examples, rules, context, formatting instructions, persona details. A 10,000-token system prompt is processed on every single request, adding latency and cost even when the user asks a simple question.

Symptoms: Consistently high input token counts even for short user messages. Slow responses across the board.

The fix replaces a bloated 10K-token system prompt with a lean one-liner and a build_messages() function that injects examples only when needed.

Premium

Production Fix Patterns

Get the complete code patterns for timeout configuration, retry with exponential backoff, cold start warmup, and region-optimized client setup.

4. Not Using Streaming When You Should

Every user-facing application should use streaming. The time-to-first-token with streaming is 200-800ms versus waiting for the entire response (which could be 5-30 seconds).

Symptoms: Users complaining about slow responses. High bounce rates. Timeout errors on longer responses.

The fix switches from invoke_model (batch) to invoke_model_with_response_stream, sending text chunks to the user as they arrive rather than waiting for the full response.

Premium

Production Fix Patterns

Get the complete code patterns for timeout configuration, retry with exponential backoff, cold start warmup, and region-optimized client setup.

5. Wrong Model for the Task

Using Opus for text classification or Haiku for complex code review. Both waste resources -- the first wastes money, the second wastes quality.

Symptoms: Unexpectedly high costs for simple tasks, or poor quality outputs that require human correction.

Task	Wrong Model	Right Model	Cost Savings
Email classification	Opus 4.6	Haiku 4.5	~95%
Sentiment analysis	Sonnet 4.6	Haiku 4.5	~80%
Contract analysis	Haiku 4.5	Sonnet 4.6	N/A (quality issue)
System design review	Sonnet 4.6	Opus 4.6	N/A (quality issue)

Implement a model router (covered in the Model Selection Guide) to automatically direct tasks to the appropriate model.

6. Missing Retry Logic with Exponential Backoff

Bedrock returns throttling errors (HTTP 429) when you exceed your account's throughput limits. Without retry logic, these become user-visible failures.

Symptoms: Intermittent ThrottlingException errors, especially during peak traffic.

Two approaches are shown: (1) boto3's built-in adaptive retry mode with 5 max attempts, and (2) a custom invoke_with_retry() function with exponential backoff plus jitter, capping wait time at 30 seconds.

Premium

Production Fix Patterns

Get the complete code patterns for timeout configuration, retry with exponential backoff, cold start warmup, and region-optimized client setup.

7. Region Mismatch

Deploying your application in us-west-2 but calling Bedrock in us-east-1 adds 60-120ms of network latency to every request. This is pure waste.

Symptoms: Consistently high TTFT even with Haiku. Network latency visible in traces.

The fix reads the region from AWS_REGION environment variable instead of hardcoding it. For multi-region deployments, a get_bedrock_client() function provides automatic failover across regions.

Premium

Production Fix Patterns

Get the complete code patterns for timeout configuration, retry with exponential backoff, cold start warmup, and region-optimized client setup.

Flashcards

1 / 7

Question

What is the default boto3 read_timeout and why is it problematic for Claude?

Click to reveal

Answer

The default read_timeout is 60 seconds. Claude responses with extended thinking or long outputs can take longer, causing ReadTimeoutError. Set read_timeout to at least 300 seconds for production use.

Key Insight

The single most impactful fix for most teams is increasing the read_timeout and adding proper retry logic. These two changes eliminate the majority of production incidents with Claude on Bedrock.

1. Timeout Errors​

2. Cold Start Latency​

3. Oversized System Prompts​

4. Not Using Streaming When You Should​

5. Wrong Model for the Task​

6. Missing Retry Logic with Exponential Backoff​

7. Region Mismatch​

Flashcards​

1. Timeout Errors

2. Cold Start Latency

3. Oversized System Prompts

4. Not Using Streaming When You Should

5. Wrong Model for the Task

6. Missing Retry Logic with Exponential Backoff

7. Region Mismatch

Flashcards