ETL and Data Processing

Before data can be used for ML training, it needs to be cleaned, transformed, and formatted. AWS offers a spectrum of ETL tools ranging from no-code visual editors to full Spark cluster management. Choosing the right tool depends on the complexity of the transformation, the team's technical skills, and the scale of the data.

Overview

Service	What It Does	When to Use
AWS Glue	Serverless ETL with PySpark/Scala. Includes Crawlers for schema discovery, Data Catalog for metadata, and FindMatches for fuzzy dedup	Batch ETL at scale, format conversion, schema discovery, data cataloging
AWS Glue DataBrew	No-code visual data preparation with 250+ built-in transformations	Data prep by non-technical users. Resampling, filling missing values, cleaning
Amazon EMR	Managed Hadoop/Spark clusters for big data processing	Massive-scale batch processing, existing Spark/Hadoop workloads, full cluster control
AWS Lambda	Serverless event-driven functions (max 15 min, 10 GB memory, no GPU)	Lightweight transforms, event triggers, glue between services
AWS Step Functions	General-purpose serverless workflow orchestration	Orchestrate multi-step workflows across AWS services
AWS Data Pipeline	Scheduled batch data movement (legacy service)	Move data from RDS/DynamoDB to S3 on a schedule
AWS DMS	Database migration and ongoing replication (CDC)	Database migration, change data capture
AWS DataSync	Transfer data between on-premises and AWS with encryption, scheduling, and integrity validation	On-premises to S3 migration with automation

AWS Glue — Deep Dive

Glue is the most versatile batch ETL tool on AWS. Its components serve different purposes:

Component	Purpose
Crawlers	Discover schema from S3 data and populate the Data Catalog. Crawlers discover, they do not transform
Data Catalog	Central metadata repository — stores table definitions, schemas, and partitions
ETL Jobs	PySpark or Scala jobs that read, transform, and write data
FindMatches	ML-based fuzzy deduplication — matches records that are similar but not identical

Glue DataBrew vs. Glue ETL

	DataBrew	Glue ETL
Target user	Business analysts, no-code	Data engineers
Interface	Visual, point-and-click	PySpark/Scala code
Database access	Connects directly to PostgreSQL, MySQL, etc. — no DMS needed	Via JDBC connections
Scale	Medium	Large-scale (distributed Spark)

ETL Tool Decision Guide

Scenario	Best Choice
Serverless batch ETL with PySpark	AWS Glue
No-code data preparation	Glue DataBrew
Full Spark cluster control or existing Hadoop	Amazon EMR
Event-driven lightweight processing	AWS Lambda (15-min timeout, no GPU)
ML workflow orchestration	SageMaker Pipelines
General AWS workflow orchestration	AWS Step Functions
On-premises to S3 with encryption and scheduling	AWS DataSync

Effort ranking (least to most): DataBrew → Glue → EMR → Custom EC2

Format Conversion Reference

Conversion	Tool
CSV → Parquet (batch)	Glue ETL job
JSON → Parquet (streaming)	Kinesis Firehose (native conversion via Glue Catalog)
Images → RecordIO	im2rec (MXNet utility)

When to Use

Start with Glue DataBrew for simple, no-code data prep. Move to Glue ETL for complex PySpark transformations at scale. Use EMR only when you need full Spark cluster control or have existing Hadoop workloads. Use Lambda for lightweight, event-driven processing — but not for heavy ETL or anything requiring GPU.

Flashcards

1 / 8

Question

What is the difference between Glue Crawlers and Glue ETL Jobs?

Click to reveal

Answer

Crawlers discover schema from data sources and populate the Data Catalog (metadata only). ETL Jobs actually transform the data — read, process, and write.

Key Insight

Glue DataBrew and SageMaker Canvas together form a complete no-code ML pipeline: DataBrew handles data preparation and Canvas handles model building. This is the simplest path for teams without ML or data engineering expertise.

Overview​

AWS Glue — Deep Dive​

Glue DataBrew vs. Glue ETL​

ETL Tool Decision Guide​

Format Conversion Reference​

When to Use​

Flashcards​