dbt (Data Build Tool)
/tldr: The transformation tool that brings software engineering best practices to data work.
1. Models: The Heart of dbt
A **dbt Model** is simply a single SELECT statement defined in a .sql file. When you run dbt, it compiles the SQL and executes it against your data warehouse (e.g., Snowflake, BigQuery), creating a table or view in the process.
Materializations
This defines how your model's SQL is converted into an object in the warehouse.
- **View (Default):** The SQL runs every time the view is queried. Fast compile, low cost, but slow query runtime.
- **Table:** The SQL is executed and a physical table is built. Slow compile, high cost, but fast query runtime.
- **Incremental:** Only new or changed rows since the last run are processed and appended/merged into the existing table. Optimal for large datasets.
- **Ephemeral:** Used for intermediate logic; models are compiled into Common Table Expressions (CTEs) and never materialized physically.
The ref() Function
The key to dependency management. Instead of hardcoding table names, you use {{ ref('other_model_name') }}.
dbt uses these ref() calls to automatically build a dependency graph (DAG) of your models, ensuring that upstream tables are built before downstream tables use them.
2. Jinja: SQL Superpower
dbt uses Jinja, a Python templating language, to introduce programming concepts (like loops, macros, and variables) into SQL, making it modular and reusable.
Why Jinja?
- **Macros:** Reusable code snippets (like functions in Python) for common logic (e.g., standardizing a date format across all models).
- **Variables:** Define parameters that change based on environment (e.g., development vs. production) or run time.
- **Loops/Conditionals:** Generate repetitive SQL boilerplate (e.g., creating 50 similar columns) or dynamically choose SQL based on logic.
Example: SELECT {{ dbt_utils.surrogate_key(['id', 'name']) }} as pk_id, * FROM source
3. Tests, Docs, and Exposures
dbt's governance features ensure data quality, transparency, and traceability from model creation to end-user consumption.
Tests
Assertions about your data (defined in YAML). dbt runs these tests as SQL queries to check data quality.
- **Generic:**
unique,not_null,accepted_values. - **Custom:** Write your own test SQL (e.g., to check if total sales equal total refunds).
Docs (Documentation)
Central place to define model, column, and source descriptions using YAML.
dbt compiles this metadata into a navigable, searchable website that visually maps the entire data lineage (DAG).
Exposures
Links dbt models to their downstream consumers (e.g., Tableau dashboard, ML model, or external application).
Crucial for impact analysis: if a model fails a test, you know exactly which business-critical reports are affected.
4. Snapshots
Snapshots allow you to track how data in a source table changes over time, giving you a historical record of every record's state.
Snapshot Mechanics (SCD Type 2)
dbt creates a **Slowly Changing Dimension Type 2 (SCD Type 2)** table by applying two key columns to the source data:
- **
dbt_valid_from:** Timestamp when the record became valid (or was first inserted). - **
dbt_valid_to:** Timestamp when the record was superseded by a change (orNULLif it is the current valid record).
This is essential for analyzing metrics based on historical business logic, such as viewing a customer's segment *at the time* they made a purchase.
dbt is the standardization layer that turns chaotic scripts into version-controlled, testable data products.