Skip to main content

Streaming and Real-Time

Real-time data processing is critical for ML workloads that need to act on data as it arrives — fraud detection, IoT telemetry, clickstream analysis, and real-time recommendations. AWS provides the Kinesis family of services plus managed Kafka (MSK) for streaming architectures.

Overview

Amazon Kinesis Family

ServiceWhat It DoesWhen to Use
Kinesis Data Streams (KDS)Real-time data streaming with manual shard management and custom consumersCustom processing logic, ordering guarantees, data replay
Kinesis Data FirehoseFully managed delivery to S3, Redshift, OpenSearch, Splunk. Auto-scales with near-real-time delivery (60s minimum buffer)Stream data to storage destinations with optional transformation
Kinesis Data Analytics (KDA)Real-time SQL or Apache Flink analytics on streamsWindowed aggregations, streaming SQL, real-time anomaly detection
Kinesis Video StreamsIngest, store, and process video/audio streamsVideo analytics, live monitoring. Integrates with Rekognition Video

Amazon MSK (Managed Kafka)

Fully managed Apache Kafka for teams that need Kafka compatibility. More operational overhead than Kinesis — use only when you specifically need the Kafka protocol or ecosystem.

Kinesis Data Streams — Deep Dive

Each shard provides:

  • 1 MB/s ingestion (write)
  • 2 MB/s consumption (read)
  • 1,000 records/s write limit

Shard capacity formula: shards needed = MAX(data_MB_per_sec / 1, records_per_sec / 1000)

Key limitation: KDS does NOT deliver to S3 directly. You need Firehose as a consumer to land data in S3.

Kinesis Data Firehose — Deep Dive

Three powerful capabilities make Firehose the workhorse for streaming delivery:

  1. Lambda transformation: Transform records in-flight using a Lambda function
  2. Source record backup: Keep the original untransformed records in a separate S3 location
  3. Format conversion: Convert JSON to Parquet or ORC using the Glue Data Catalog schema

For delivering to both S3 and OpenSearch, use a single Firehose stream with S3 backup enabled — not two separate streams.

Kinesis Data Analytics — Deep Dive

KDA supports two runtimes:

  • SQL: Write SQL queries over streaming data (tumbling windows, sliding windows)
  • Apache Flink: Full-featured stream processing with Java or Python

KDA has a built-in Random Cut Forest (RCF) function for real-time anomaly detection on streams — this is a key pattern for detecting unusual behavior in streaming data.

Kinesis Decision Guide

ScenarioService
Deliver streaming data to S3 with transformsFirehose (+ Lambda for transforms)
Custom consumers with replay and orderingData Streams
SQL on streaming data or windowed aggregationData Analytics
Least operational overhead for S3 deliveryFirehose (always beats Streams)
Real-time anomaly detection on a streamKDA + Random Cut Forest
Need Kafka protocol compatibilityAmazon MSK

When to Use

Use Kinesis when your ML pipeline needs to process data in real time — whether that means delivering clickstream data to S3 for batch training, running streaming anomaly detection, or computing real-time aggregations as features for inference.

Flashcards

1 / 8
Question

What is the per-shard capacity of Kinesis Data Streams?

Click to reveal
Answer

1 MB/s in (write), 2 MB/s out (read), 1,000 records/s write limit. Calculate shards needed as MAX(data_MB/1, records/1000).

Key Insight

For streaming format conversion (e.g., JSON to Parquet), use Firehose's native conversion capability backed by the Glue Data Catalog — it is simpler and more efficient than writing a custom Lambda transformation for the same purpose.