Skip to main content

Streaming vs Batch Response Patterns

Streaming is one of the most impactful optimizations for user-facing Claude applications. Instead of waiting for the entire response to generate, streaming delivers tokens as they are produced. For a response that takes 8 seconds to complete, streaming can show the first token in under 500ms. This section covers when and how to implement streaming on Amazon Bedrock.

Why Streaming Improves Perceived Latency​

With batch (synchronous) invocation, users stare at a loading spinner until the entire response is ready. With streaming, text appears almost immediately and flows continuously.

MetricBatchStreaming
Time-to-first-token (TTFT)Same as total time200-800ms
Total completion timeSameSame (slightly higher)
Perceived user experienceSlow, jarringFast, natural
Memory for long responsesAll at onceIncremental

Streaming does not make the total response faster. It may even add a few percent overhead from chunked transfer encoding. But the user experience improvement is dramatic -- users perceive streaming responses as 3-5x faster even when total time is identical.

Bedrock Streaming API​

Amazon Bedrock provides InvokeModelWithResponseStream for streaming responses. The response arrives as a series of events, each containing a chunk of the response.

The stream_claude_response() generator function calls InvokeModelWithResponseStream, parses each event chunk, and yields text deltas as they arrive from Bedrock.

Premium

Production Streaming Implementation

Get the complete streaming implementation with time-to-first-token measurement, error handling, and retry logic for Bedrock.

Complete Streaming Implementation with Metadata​

In production, you need the full response text plus usage metadata. Here is a complete implementation:

The invoke_claude_streaming() function provides a full streaming invocation with TTFT measurement, token usage tracking across all event types (message_start, content_block_delta, message_delta, message_stop), and returns the assembled response with performance metadata.

Premium

Production Streaming Implementation

Get the complete streaming implementation with time-to-first-token measurement, error handling, and retry logic for Bedrock.

When to Use Streaming vs Batch​

Streaming is not always the right choice. Here is a decision framework:

Use CaseRecommendationWhy
Chatbot / conversational UIStreamingUsers need immediate feedback
API endpoint returning JSONBatchYou need the complete response to parse it
Document processing pipelineBatchNo user watching; simpler error handling
Real-time coding assistantStreamingDevelopers expect progressive output
Webhook / async processingBatchResponse goes into a queue, not to a user
Long-form content generationStreamingAvoids timeout issues on very long responses

The unified invoke_claude() function provides a single interface that dispatches to either streaming or batch mode based on a boolean flag, simplifying calling code.

Premium

Production Streaming Implementation

Get the complete streaming implementation with time-to-first-token measurement, error handling, and retry logic for Bedrock.

Error Handling in Streaming Responses​

Streaming introduces failure modes that batch does not have. The connection can drop mid-stream, chunks can arrive corrupted, or the model can hit a content filter partway through.

The safe_stream() function wraps streaming invocation with retry logic that handles EventStreamError (mid-stream interruptions), ThrottlingException (with exponential backoff), and ModelTimeoutException (by halving max_tokens).

Premium

Production Streaming Implementation

Get the complete streaming implementation with time-to-first-token measurement, error handling, and retry logic for Bedrock.

Flashcards​

1 / 6
Question

What is time-to-first-token (TTFT) and why does it matter?

Click to reveal
Answer

TTFT is the time from sending the request to receiving the first token of the response. With streaming, TTFT is typically 200-800ms regardless of total response length. It matters because users perceive streaming responses as much faster even when total time is the same.

Key Insight

If a human is waiting for the response, use streaming. The perceived speed improvement is so significant that streaming should be the default for all user-facing Claude integrations. Reserve batch for machine-to-machine pipelines.