Streaming and Real-Time
Real-time data processing is critical for ML workloads that need to act on data as it arrives — fraud detection, IoT telemetry, clickstream analysis, and real-time recommendations. AWS provides the Kinesis family of services plus managed Kafka (MSK) for streaming architectures.
Overview
Amazon Kinesis Family
| Service | What It Does | When to Use |
|---|---|---|
| Kinesis Data Streams (KDS) | Real-time data streaming with manual shard management and custom consumers | Custom processing logic, ordering guarantees, data replay |
| Kinesis Data Firehose | Fully managed delivery to S3, Redshift, OpenSearch, Splunk. Auto-scales with near-real-time delivery (60s minimum buffer) | Stream data to storage destinations with optional transformation |
| Kinesis Data Analytics (KDA) | Real-time SQL or Apache Flink analytics on streams | Windowed aggregations, streaming SQL, real-time anomaly detection |
| Kinesis Video Streams | Ingest, store, and process video/audio streams | Video analytics, live monitoring. Integrates with Rekognition Video |
Amazon MSK (Managed Kafka)
Fully managed Apache Kafka for teams that need Kafka compatibility. More operational overhead than Kinesis — use only when you specifically need the Kafka protocol or ecosystem.
Kinesis Data Streams — Deep Dive
Each shard provides:
- 1 MB/s ingestion (write)
- 2 MB/s consumption (read)
- 1,000 records/s write limit
Shard capacity formula: shards needed = MAX(data_MB_per_sec / 1, records_per_sec / 1000)
Key limitation: KDS does NOT deliver to S3 directly. You need Firehose as a consumer to land data in S3.
Kinesis Data Firehose — Deep Dive
Three powerful capabilities make Firehose the workhorse for streaming delivery:
- Lambda transformation: Transform records in-flight using a Lambda function
- Source record backup: Keep the original untransformed records in a separate S3 location
- Format conversion: Convert JSON to Parquet or ORC using the Glue Data Catalog schema
For delivering to both S3 and OpenSearch, use a single Firehose stream with S3 backup enabled — not two separate streams.
Kinesis Data Analytics — Deep Dive
KDA supports two runtimes:
- SQL: Write SQL queries over streaming data (tumbling windows, sliding windows)
- Apache Flink: Full-featured stream processing with Java or Python
KDA has a built-in Random Cut Forest (RCF) function for real-time anomaly detection on streams — this is a key pattern for detecting unusual behavior in streaming data.
Kinesis Decision Guide
| Scenario | Service |
|---|---|
| Deliver streaming data to S3 with transforms | Firehose (+ Lambda for transforms) |
| Custom consumers with replay and ordering | Data Streams |
| SQL on streaming data or windowed aggregation | Data Analytics |
| Least operational overhead for S3 delivery | Firehose (always beats Streams) |
| Real-time anomaly detection on a stream | KDA + Random Cut Forest |
| Need Kafka protocol compatibility | Amazon MSK |
When to Use
Use Kinesis when your ML pipeline needs to process data in real time — whether that means delivering clickstream data to S3 for batch training, running streaming anomaly detection, or computing real-time aggregations as features for inference.
Flashcards
What is the per-shard capacity of Kinesis Data Streams?
Click to reveal1 MB/s in (write), 2 MB/s out (read), 1,000 records/s write limit. Calculate shards needed as MAX(data_MB/1, records/1000).
For streaming format conversion (e.g., JSON to Parquet), use Firehose's native conversion capability backed by the Glue Data Catalog — it is simpler and more efficient than writing a custom Lambda transformation for the same purpose.