AWS TL;DR: Amazon Athena
Back to AWS TL;DR Hub

Amazon Athena

/tldr: Serverless, interactive SQL query service for data in S3.

Analytics Serverless Ad-Hoc SQL

1. Core Concepts: No Infrastructure

Athena uses the Trino (formerly Presto) engine to execute standard ANSI SQL queries directly against structured and unstructured data stored in Amazon S3. You never manage any compute infrastructure.

Key Principles

  • **Schema-on-Read:** Athena only applies the schema when you run the query, not when the data is loaded (unlike traditional databases).
  • **No ETL Needed:** Data is analyzed in place in S3. It is often referred to as a "query engine."
  • **Output to S3:** Query results are automatically saved to an S3 bucket you define.

The Role of AWS Glue

To query data, Athena needs to know the data structure. It gets this information from the **AWS Glue Data Catalog**.

  • **Data Catalog:** Glue stores the metadata (schema, column types, partitioning) for your S3 data.
  • **Glue Crawlers:** These can automatically scan S3 buckets to infer the schema and populate the Data Catalog, making the data instantly queryable by Athena.

2. The Pricing Model: Pay Per Scan

The primary cost driver for Athena is the amount of data scanned per query.

Cost Optimization is Crucial

Every byte scanned costs money. Optimization means reducing the data Athena has to read.

  • **Partitioning:** Organizing your S3 data by date, ID, or region. Athena uses partitions to eliminate unnecessary data scanning.
  • **Columnar Formats:** Convert data from row-based formats (like CSV/JSON) to columnar formats (like **Parquet** or **ORC**). These are significantly faster and cheaper as Athena only reads the columns requested in the `SELECT` statement.
  • **Compression:** Compressing files (e.g., GZIP) reduces the volume of data transferred and scanned.

3. Use Cases & Example Syntax

Athena is ideal for interactive analysis where you don't need a dedicated, always-on cluster.

Common Applications

  • **Log Analysis:** Quickly query years of web server or application logs stored in S3.
  • **Ad-Hoc Reporting:** Running one-off business intelligence queries without setting up a data warehouse.
  • **Data Lake Exploration:** Discovering patterns in raw data before deciding on full ETL jobs.

Example: Creating a Table in Athena

CREATE EXTERNAL TABLE IF NOT EXISTS web_logs (
  `time` timestamp,
  `user_id` string,
  `action` string
)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
LOCATION 's3://my-raw-data-bucket/web-logs/';

SELECT count(user_id) 
FROM web_logs 
WHERE action = 'login' AND time > now() - interval '7' day;
            

Note: The `CREATE EXTERNAL TABLE` command defines the schema in Glue; it does not move or modify the data in S3.

Athena turns your S3 bucket into a fully queryable, serverless database.

AWS Fundamentals Series: Athena