Streaming vs Batch Response Patterns
Streaming is one of the most impactful optimizations for user-facing Claude applications. Instead of waiting for the entire response to generate, streaming delivers tokens as they are produced. For a response that takes 8 seconds to complete, streaming can show the first token in under 500ms. This section covers when and how to implement streaming on Amazon Bedrock.
Why Streaming Improves Perceived Latency​
With batch (synchronous) invocation, users stare at a loading spinner until the entire response is ready. With streaming, text appears almost immediately and flows continuously.
| Metric | Batch | Streaming |
|---|---|---|
| Time-to-first-token (TTFT) | Same as total time | 200-800ms |
| Total completion time | Same | Same (slightly higher) |
| Perceived user experience | Slow, jarring | Fast, natural |
| Memory for long responses | All at once | Incremental |
Streaming does not make the total response faster. It may even add a few percent overhead from chunked transfer encoding. But the user experience improvement is dramatic -- users perceive streaming responses as 3-5x faster even when total time is identical.
Bedrock Streaming API​
Amazon Bedrock provides InvokeModelWithResponseStream for streaming responses. The response arrives as a series of events, each containing a chunk of the response.
The stream_claude_response() generator function calls InvokeModelWithResponseStream, parses each event chunk, and yields text deltas as they arrive from Bedrock.
Production Streaming Implementation
Get the complete streaming implementation with time-to-first-token measurement, error handling, and retry logic for Bedrock.
Complete Streaming Implementation with Metadata​
In production, you need the full response text plus usage metadata. Here is a complete implementation:
The invoke_claude_streaming() function provides a full streaming invocation with TTFT measurement, token usage tracking across all event types (message_start, content_block_delta, message_delta, message_stop), and returns the assembled response with performance metadata.
Production Streaming Implementation
Get the complete streaming implementation with time-to-first-token measurement, error handling, and retry logic for Bedrock.
When to Use Streaming vs Batch​
Streaming is not always the right choice. Here is a decision framework:
| Use Case | Recommendation | Why |
|---|---|---|
| Chatbot / conversational UI | Streaming | Users need immediate feedback |
| API endpoint returning JSON | Batch | You need the complete response to parse it |
| Document processing pipeline | Batch | No user watching; simpler error handling |
| Real-time coding assistant | Streaming | Developers expect progressive output |
| Webhook / async processing | Batch | Response goes into a queue, not to a user |
| Long-form content generation | Streaming | Avoids timeout issues on very long responses |
The unified invoke_claude() function provides a single interface that dispatches to either streaming or batch mode based on a boolean flag, simplifying calling code.
Production Streaming Implementation
Get the complete streaming implementation with time-to-first-token measurement, error handling, and retry logic for Bedrock.
Error Handling in Streaming Responses​
Streaming introduces failure modes that batch does not have. The connection can drop mid-stream, chunks can arrive corrupted, or the model can hit a content filter partway through.
The safe_stream() function wraps streaming invocation with retry logic that handles EventStreamError (mid-stream interruptions), ThrottlingException (with exponential backoff), and ModelTimeoutException (by halving max_tokens).
Production Streaming Implementation
Get the complete streaming implementation with time-to-first-token measurement, error handling, and retry logic for Bedrock.
Flashcards​
What is time-to-first-token (TTFT) and why does it matter?
Click to revealTTFT is the time from sending the request to receiving the first token of the response. With streaming, TTFT is typically 200-800ms regardless of total response length. It matters because users perceive streaming responses as much faster even when total time is the same.
If a human is waiting for the response, use streaming. The perceived speed improvement is so significant that streaming should be the default for all user-facing Claude integrations. Reserve batch for machine-to-machine pipelines.