Analytics and Query

Before training a model, you often need to explore, query, and understand your data. AWS provides serverless SQL engines, data warehousing, BI dashboards, and data lake governance tools that integrate with ML pipelines.

Overview

Service	What It Does	When to Use
Amazon Athena	Serverless SQL queries directly on S3 data	Ad-hoc SQL exploration of data lake files. Least-effort SQL on S3
Amazon Redshift	Data warehouse with columnar storage and massively parallel processing	Structured analytics at scale, BI reporting. Redshift ML can run SageMaker from SQL
Amazon QuickSight	BI dashboards and visualizations	Business intelligence, dashboards, reporting. Connects to DynamoDB, Redshift, S3, Athena
Amazon OpenSearch Service	Search and log analytics (formerly Elasticsearch)	Log analytics, full-text search, real-time dashboards. Firehose can deliver directly to OpenSearch
AWS Lake Formation	Data lake governance with fine-grained access control on top of Glue Data Catalog	Governed data lake with column-level and row-level security

Athena — Deep Dive

Athena is a serverless query engine that runs standard SQL on data stored in S3. Key facts:

Pricing: $5 per TB scanned
Cost optimization: Use Parquet format and partitioning to reduce the amount of data scanned
Use case: Ad-hoc exploration, data quality checks, feature engineering queries

Athena queries data at rest in S3. For SQL on streaming data, use Kinesis Data Analytics instead.

Athena is NOT a data catalog — it reads metadata from the Glue Data Catalog.

Lake Formation — Deep Dive

Lake Formation sits on top of the Glue Data Catalog and adds governance capabilities that IAM alone cannot provide:

Capability	What It Does
Column-level security	Restrict access to specific columns within a table
Row-level security	Filter rows based on user identity
Tag-based access control	Assign metadata tags to data and grant access based on tags
Central permissions	Single place to manage access across Athena, Redshift Spectrum, Glue, and EMR

Use Lake Formation when you need granular access control combined with a central data catalog — IAM policies alone cannot enforce column-level restrictions.

QuickSight Integrations

QuickSight can connect directly to many data sources without intermediate ETL:

Amazon DynamoDB (direct connection — no pipeline needed for dashboards)
Amazon Redshift
Amazon S3
Amazon Athena
Amazon RDS

When to Use

Use Athena for ad-hoc SQL exploration of S3 data during the data understanding phase. Use Lake Formation when you need governed, fine-grained access to your data lake. Use QuickSight for visualization and stakeholder reporting. Use Redshift when you need a persistent data warehouse with complex analytics.

Flashcards

1 / 7

Question

What is the pricing model for Amazon Athena?

Click to reveal

Answer

$5 per TB of data scanned. Reduce cost by using columnar formats (Parquet/ORC) and partitioning your data so queries scan less.

Pro Tip

When querying S3 data with Athena, always store your data in Parquet or ORC format with partitioning by common query dimensions (date, region, etc.). This can reduce both query time and cost by 10-100x compared to scanning raw CSV or JSON files.

Overview​

Athena — Deep Dive​

Lake Formation — Deep Dive​

QuickSight Integrations​

When to Use​

Flashcards​