Skip to main content

Analytics and Query

Before training a model, you often need to explore, query, and understand your data. AWS provides serverless SQL engines, data warehousing, BI dashboards, and data lake governance tools that integrate with ML pipelines.

Overview

ServiceWhat It DoesWhen to Use
Amazon AthenaServerless SQL queries directly on S3 dataAd-hoc SQL exploration of data lake files. Least-effort SQL on S3
Amazon RedshiftData warehouse with columnar storage and massively parallel processingStructured analytics at scale, BI reporting. Redshift ML can run SageMaker from SQL
Amazon QuickSightBI dashboards and visualizationsBusiness intelligence, dashboards, reporting. Connects to DynamoDB, Redshift, S3, Athena
Amazon OpenSearch ServiceSearch and log analytics (formerly Elasticsearch)Log analytics, full-text search, real-time dashboards. Firehose can deliver directly to OpenSearch
AWS Lake FormationData lake governance with fine-grained access control on top of Glue Data CatalogGoverned data lake with column-level and row-level security

Athena — Deep Dive

Athena is a serverless query engine that runs standard SQL on data stored in S3. Key facts:

  • Pricing: $5 per TB scanned
  • Cost optimization: Use Parquet format and partitioning to reduce the amount of data scanned
  • Use case: Ad-hoc exploration, data quality checks, feature engineering queries

Athena queries data at rest in S3. For SQL on streaming data, use Kinesis Data Analytics instead.

Athena is NOT a data catalog — it reads metadata from the Glue Data Catalog.

Lake Formation — Deep Dive

Lake Formation sits on top of the Glue Data Catalog and adds governance capabilities that IAM alone cannot provide:

CapabilityWhat It Does
Column-level securityRestrict access to specific columns within a table
Row-level securityFilter rows based on user identity
Tag-based access controlAssign metadata tags to data and grant access based on tags
Central permissionsSingle place to manage access across Athena, Redshift Spectrum, Glue, and EMR

Use Lake Formation when you need granular access control combined with a central data catalog — IAM policies alone cannot enforce column-level restrictions.

QuickSight Integrations

QuickSight can connect directly to many data sources without intermediate ETL:

  • Amazon DynamoDB (direct connection — no pipeline needed for dashboards)
  • Amazon Redshift
  • Amazon S3
  • Amazon Athena
  • Amazon RDS

When to Use

Use Athena for ad-hoc SQL exploration of S3 data during the data understanding phase. Use Lake Formation when you need governed, fine-grained access to your data lake. Use QuickSight for visualization and stakeholder reporting. Use Redshift when you need a persistent data warehouse with complex analytics.

Flashcards

1 / 7
Question

What is the pricing model for Amazon Athena?

Click to reveal
Answer

$5 per TB of data scanned. Reduce cost by using columnar formats (Parquet/ORC) and partitioning your data so queries scan less.

Pro Tip

When querying S3 data with Athena, always store your data in Parquet or ORC format with partitioning by common query dimensions (date, region, etc.). This can reduce both query time and cost by 10-100x compared to scanning raw CSV or JSON files.