ETL and Data Processing
Before data can be used for ML training, it needs to be cleaned, transformed, and formatted. AWS offers a spectrum of ETL tools ranging from no-code visual editors to full Spark cluster management. Choosing the right tool depends on the complexity of the transformation, the team's technical skills, and the scale of the data.
Overview
| Service | What It Does | When to Use |
|---|---|---|
| AWS Glue | Serverless ETL with PySpark/Scala. Includes Crawlers for schema discovery, Data Catalog for metadata, and FindMatches for fuzzy dedup | Batch ETL at scale, format conversion, schema discovery, data cataloging |
| AWS Glue DataBrew | No-code visual data preparation with 250+ built-in transformations | Data prep by non-technical users. Resampling, filling missing values, cleaning |
| Amazon EMR | Managed Hadoop/Spark clusters for big data processing | Massive-scale batch processing, existing Spark/Hadoop workloads, full cluster control |
| AWS Lambda | Serverless event-driven functions (max 15 min, 10 GB memory, no GPU) | Lightweight transforms, event triggers, glue between services |
| AWS Step Functions | General-purpose serverless workflow orchestration | Orchestrate multi-step workflows across AWS services |
| AWS Data Pipeline | Scheduled batch data movement (legacy service) | Move data from RDS/DynamoDB to S3 on a schedule |
| AWS DMS | Database migration and ongoing replication (CDC) | Database migration, change data capture |
| AWS DataSync | Transfer data between on-premises and AWS with encryption, scheduling, and integrity validation | On-premises to S3 migration with automation |
AWS Glue — Deep Dive
Glue is the most versatile batch ETL tool on AWS. Its components serve different purposes:
| Component | Purpose |
|---|---|
| Crawlers | Discover schema from S3 data and populate the Data Catalog. Crawlers discover, they do not transform |
| Data Catalog | Central metadata repository — stores table definitions, schemas, and partitions |
| ETL Jobs | PySpark or Scala jobs that read, transform, and write data |
| FindMatches | ML-based fuzzy deduplication — matches records that are similar but not identical |
Glue DataBrew vs. Glue ETL
| DataBrew | Glue ETL | |
|---|---|---|
| Target user | Business analysts, no-code | Data engineers |
| Interface | Visual, point-and-click | PySpark/Scala code |
| Database access | Connects directly to PostgreSQL, MySQL, etc. — no DMS needed | Via JDBC connections |
| Scale | Medium | Large-scale (distributed Spark) |
ETL Tool Decision Guide
| Scenario | Best Choice |
|---|---|
| Serverless batch ETL with PySpark | AWS Glue |
| No-code data preparation | Glue DataBrew |
| Full Spark cluster control or existing Hadoop | Amazon EMR |
| Event-driven lightweight processing | AWS Lambda (15-min timeout, no GPU) |
| ML workflow orchestration | SageMaker Pipelines |
| General AWS workflow orchestration | AWS Step Functions |
| On-premises to S3 with encryption and scheduling | AWS DataSync |
Effort ranking (least to most): DataBrew → Glue → EMR → Custom EC2
Format Conversion Reference
| Conversion | Tool |
|---|---|
| CSV → Parquet (batch) | Glue ETL job |
| JSON → Parquet (streaming) | Kinesis Firehose (native conversion via Glue Catalog) |
| Images → RecordIO | im2rec (MXNet utility) |
When to Use
Start with Glue DataBrew for simple, no-code data prep. Move to Glue ETL for complex PySpark transformations at scale. Use EMR only when you need full Spark cluster control or have existing Hadoop workloads. Use Lambda for lightweight, event-driven processing — but not for heavy ETL or anything requiring GPU.
Flashcards
What is the difference between Glue Crawlers and Glue ETL Jobs?
Click to revealCrawlers discover schema from data sources and populate the Data Catalog (metadata only). ETL Jobs actually transform the data — read, process, and write.
Glue DataBrew and SageMaker Canvas together form a complete no-code ML pipeline: DataBrew handles data preparation and Canvas handles model building. This is the simplest path for teams without ML or data engineering expertise.