Analytics and Query
Before training a model, you often need to explore, query, and understand your data. AWS provides serverless SQL engines, data warehousing, BI dashboards, and data lake governance tools that integrate with ML pipelines.
Overview
| Service | What It Does | When to Use |
|---|---|---|
| Amazon Athena | Serverless SQL queries directly on S3 data | Ad-hoc SQL exploration of data lake files. Least-effort SQL on S3 |
| Amazon Redshift | Data warehouse with columnar storage and massively parallel processing | Structured analytics at scale, BI reporting. Redshift ML can run SageMaker from SQL |
| Amazon QuickSight | BI dashboards and visualizations | Business intelligence, dashboards, reporting. Connects to DynamoDB, Redshift, S3, Athena |
| Amazon OpenSearch Service | Search and log analytics (formerly Elasticsearch) | Log analytics, full-text search, real-time dashboards. Firehose can deliver directly to OpenSearch |
| AWS Lake Formation | Data lake governance with fine-grained access control on top of Glue Data Catalog | Governed data lake with column-level and row-level security |
Athena — Deep Dive
Athena is a serverless query engine that runs standard SQL on data stored in S3. Key facts:
- Pricing: $5 per TB scanned
- Cost optimization: Use Parquet format and partitioning to reduce the amount of data scanned
- Use case: Ad-hoc exploration, data quality checks, feature engineering queries
Athena queries data at rest in S3. For SQL on streaming data, use Kinesis Data Analytics instead.
Athena is NOT a data catalog — it reads metadata from the Glue Data Catalog.
Lake Formation — Deep Dive
Lake Formation sits on top of the Glue Data Catalog and adds governance capabilities that IAM alone cannot provide:
| Capability | What It Does |
|---|---|
| Column-level security | Restrict access to specific columns within a table |
| Row-level security | Filter rows based on user identity |
| Tag-based access control | Assign metadata tags to data and grant access based on tags |
| Central permissions | Single place to manage access across Athena, Redshift Spectrum, Glue, and EMR |
Use Lake Formation when you need granular access control combined with a central data catalog — IAM policies alone cannot enforce column-level restrictions.
QuickSight Integrations
QuickSight can connect directly to many data sources without intermediate ETL:
- Amazon DynamoDB (direct connection — no pipeline needed for dashboards)
- Amazon Redshift
- Amazon S3
- Amazon Athena
- Amazon RDS
When to Use
Use Athena for ad-hoc SQL exploration of S3 data during the data understanding phase. Use Lake Formation when you need governed, fine-grained access to your data lake. Use QuickSight for visualization and stakeholder reporting. Use Redshift when you need a persistent data warehouse with complex analytics.
Flashcards
What is the pricing model for Amazon Athena?
Click to reveal$5 per TB of data scanned. Reduce cost by using columnar formats (Parquet/ORC) and partitioning your data so queries scan less.
When querying S3 data with Athena, always store your data in Parquet or ORC format with partitioning by common query dimensions (date, region, etc.). This can reduce both query time and cost by 10-100x compared to scanning raw CSV or JSON files.