Streaming and Real-Time

Real-time data processing is critical for ML workloads that need to act on data as it arrives — fraud detection, IoT telemetry, clickstream analysis, and real-time recommendations. AWS provides the Kinesis family of services plus managed Kafka (MSK) for streaming architectures.

Overview

Amazon Kinesis Family

Service	What It Does	When to Use
Kinesis Data Streams (KDS)	Real-time data streaming with manual shard management and custom consumers	Custom processing logic, ordering guarantees, data replay
Kinesis Data Firehose	Fully managed delivery to S3, Redshift, OpenSearch, Splunk. Auto-scales with near-real-time delivery (60s minimum buffer)	Stream data to storage destinations with optional transformation
Kinesis Data Analytics (KDA)	Real-time SQL or Apache Flink analytics on streams	Windowed aggregations, streaming SQL, real-time anomaly detection
Kinesis Video Streams	Ingest, store, and process video/audio streams	Video analytics, live monitoring. Integrates with Rekognition Video

Amazon MSK (Managed Kafka)

Fully managed Apache Kafka for teams that need Kafka compatibility. More operational overhead than Kinesis — use only when you specifically need the Kafka protocol or ecosystem.

Kinesis Data Streams — Deep Dive

Each shard provides:

1 MB/s ingestion (write)
2 MB/s consumption (read)
1,000 records/s write limit

Shard capacity formula: shards needed = MAX(data_MB_per_sec / 1, records_per_sec / 1000)

Key limitation: KDS does NOT deliver to S3 directly. You need Firehose as a consumer to land data in S3.

Kinesis Data Firehose — Deep Dive

Three powerful capabilities make Firehose the workhorse for streaming delivery:

Lambda transformation: Transform records in-flight using a Lambda function
Source record backup: Keep the original untransformed records in a separate S3 location
Format conversion: Convert JSON to Parquet or ORC using the Glue Data Catalog schema

For delivering to both S3 and OpenSearch, use a single Firehose stream with S3 backup enabled — not two separate streams.

Kinesis Data Analytics — Deep Dive

KDA supports two runtimes:

SQL: Write SQL queries over streaming data (tumbling windows, sliding windows)
Apache Flink: Full-featured stream processing with Java or Python

KDA has a built-in Random Cut Forest (RCF) function for real-time anomaly detection on streams — this is a key pattern for detecting unusual behavior in streaming data.

Kinesis Decision Guide

Scenario	Service
Deliver streaming data to S3 with transforms	Firehose (+ Lambda for transforms)
Custom consumers with replay and ordering	Data Streams
SQL on streaming data or windowed aggregation	Data Analytics
Least operational overhead for S3 delivery	Firehose (always beats Streams)
Real-time anomaly detection on a stream	KDA + Random Cut Forest
Need Kafka protocol compatibility	Amazon MSK

When to Use

Use Kinesis when your ML pipeline needs to process data in real time — whether that means delivering clickstream data to S3 for batch training, running streaming anomaly detection, or computing real-time aggregations as features for inference.

Flashcards

1 / 8

Question

What is the per-shard capacity of Kinesis Data Streams?

Click to reveal

Answer

1 MB/s in (write), 2 MB/s out (read), 1,000 records/s write limit. Calculate shards needed as MAX(data_MB/1, records/1000).

Key Insight

For streaming format conversion (e.g., JSON to Parquet), use Firehose's native conversion capability backed by the Glue Data Catalog — it is simpler and more efficient than writing a custom Lambda transformation for the same purpose.

Overview​

Amazon Kinesis Family​

Amazon MSK (Managed Kafka)​

Kinesis Data Streams — Deep Dive​

Kinesis Data Firehose — Deep Dive​

Kinesis Data Analytics — Deep Dive​

Kinesis Decision Guide​

When to Use​

Flashcards​