Skip to main content

ETL and Data Processing

Before data can be used for ML training, it needs to be cleaned, transformed, and formatted. AWS offers a spectrum of ETL tools ranging from no-code visual editors to full Spark cluster management. Choosing the right tool depends on the complexity of the transformation, the team's technical skills, and the scale of the data.

Overview

ServiceWhat It DoesWhen to Use
AWS GlueServerless ETL with PySpark/Scala. Includes Crawlers for schema discovery, Data Catalog for metadata, and FindMatches for fuzzy dedupBatch ETL at scale, format conversion, schema discovery, data cataloging
AWS Glue DataBrewNo-code visual data preparation with 250+ built-in transformationsData prep by non-technical users. Resampling, filling missing values, cleaning
Amazon EMRManaged Hadoop/Spark clusters for big data processingMassive-scale batch processing, existing Spark/Hadoop workloads, full cluster control
AWS LambdaServerless event-driven functions (max 15 min, 10 GB memory, no GPU)Lightweight transforms, event triggers, glue between services
AWS Step FunctionsGeneral-purpose serverless workflow orchestrationOrchestrate multi-step workflows across AWS services
AWS Data PipelineScheduled batch data movement (legacy service)Move data from RDS/DynamoDB to S3 on a schedule
AWS DMSDatabase migration and ongoing replication (CDC)Database migration, change data capture
AWS DataSyncTransfer data between on-premises and AWS with encryption, scheduling, and integrity validationOn-premises to S3 migration with automation

AWS Glue — Deep Dive

Glue is the most versatile batch ETL tool on AWS. Its components serve different purposes:

ComponentPurpose
CrawlersDiscover schema from S3 data and populate the Data Catalog. Crawlers discover, they do not transform
Data CatalogCentral metadata repository — stores table definitions, schemas, and partitions
ETL JobsPySpark or Scala jobs that read, transform, and write data
FindMatchesML-based fuzzy deduplication — matches records that are similar but not identical

Glue DataBrew vs. Glue ETL

DataBrewGlue ETL
Target userBusiness analysts, no-codeData engineers
InterfaceVisual, point-and-clickPySpark/Scala code
Database accessConnects directly to PostgreSQL, MySQL, etc. — no DMS neededVia JDBC connections
ScaleMediumLarge-scale (distributed Spark)

ETL Tool Decision Guide

ScenarioBest Choice
Serverless batch ETL with PySparkAWS Glue
No-code data preparationGlue DataBrew
Full Spark cluster control or existing HadoopAmazon EMR
Event-driven lightweight processingAWS Lambda (15-min timeout, no GPU)
ML workflow orchestrationSageMaker Pipelines
General AWS workflow orchestrationAWS Step Functions
On-premises to S3 with encryption and schedulingAWS DataSync

Effort ranking (least to most): DataBrew → Glue → EMR → Custom EC2

Format Conversion Reference

ConversionTool
CSV → Parquet (batch)Glue ETL job
JSON → Parquet (streaming)Kinesis Firehose (native conversion via Glue Catalog)
Images → RecordIOim2rec (MXNet utility)

When to Use

Start with Glue DataBrew for simple, no-code data prep. Move to Glue ETL for complex PySpark transformations at scale. Use EMR only when you need full Spark cluster control or have existing Hadoop workloads. Use Lambda for lightweight, event-driven processing — but not for heavy ETL or anything requiring GPU.

Flashcards

1 / 8
Question

What is the difference between Glue Crawlers and Glue ETL Jobs?

Click to reveal
Answer

Crawlers discover schema from data sources and populate the Data Catalog (metadata only). ETL Jobs actually transform the data — read, process, and write.

Key Insight

Glue DataBrew and SageMaker Canvas together form a complete no-code ML pipeline: DataBrew handles data preparation and Canvas handles model building. This is the simplest path for teams without ML or data engineering expertise.