AWS TL;DR: Glue Crawler
Back to AWS TL;DR Hub

AWS Glue Crawler

/tldr: An automatic data discovery agent that infers schema and partitions.

ETL Data Catalog Schema Inference

1. Core Concepts: Data Discovery

The Glue Crawler is a serverless component of the AWS Glue service. Its sole job is to connect to a data store (like S3, DynamoDB, or RDS), examine the data, and automatically extract the .

Crawler Process

  • **Connect:** The Crawler assumes an IAM role to connect to the source data (e.g., an S3 path).
  • **Sample & Infer:** It samples a subset of the data to guess the data format (CSV, JSON, Parquet), schema (column names, types), and potential compression (GZIP, Snappy).
  • **Partition Discovery:** Crucially, it identifies how the data is logically grouped (e.g., `s3://bucket/year=2024/month=01/`). It creates partition keys so services like Athena can query subsets of the data efficiently.
  • **Catalog Update:** It writes the discovered tables and schemas into the **AWS Glue Data Catalog**.

2. The Output: Glue Data Catalog

The Data Catalog is the central, persistent metadata store for your Data Lake (S3).

Why it Matters

Without a schema in the Data Catalog, S3 data is just raw files. With the schema, it becomes a queryable database table.

  • **Centralized Access:** The same table definitions created by the Crawler can be used by Athena, Redshift Spectrum, EMR, and Glue ETL jobs.
  • **Eliminates Code:** You don't have to write code to define the schema; the Crawler does it for you.
  • **Handles Change:** Crawlers can be scheduled to run regularly, automatically detecting new partitions or changes in the schema of the source data.

3. Configuration Notes

Grouping Behavior

  • **Table Creation:** Crawlers look for common structures. By default, S3 prefixes often result in a single table, but this can be fine-tuned.
  • **Custom Classifiers:** If your data format is non-standard (e.g., a highly specific log format), you can write Grok or XML classifiers to guide the Crawler's inference engine.

Cost Control

  • **Time-Based Runs:** Running Crawlers on a schedule (e.g., daily) is common.
  • **Incremental Crawls:** You can configure the Crawler to scan only new or modified files in S3 by checking the modification date, which saves time and money.

The Relationship (S3 to Query)

S3 (Raw Data) --> Glue Crawler (Infers Schema) --> 
Glue Data Catalog (Stores Metadata) --> Athena (Queries Data)
            

The Glue Crawler turns disorganized files in S3 into usable, queryable tables.

AWS Fundamentals Series: Glue Crawler