Medallion Architecture
/tldr: A multi-hop pattern for reliable, layered data quality and refinement.
The Medallion Architecture organizes data processing into multiple layers, with each layer representing a specific level of data quality, cleanliness, and aggregation. The goal is to maximize flexibility while ensuring trust in the data consumed by the business.
1. Bronze Layer (Raw)
Purpose: Ingest Data
This layer holds the raw, source-system data exactly as it arrived. It serves as an immutable, audited history of the data.
- **Format:** Often includes mixed formats (JSON, CSV, AVRO) or modern Lakehouse formats (Delta, Iceberg).
- **Schema:** Unstructured or semi-structured; the schema matches the source system's payload.
- **Transformations:** Minimal. Only necessary transformations for data type compatibility or adding source metadata (timestamps, source file name).
- **Users:** Data Engineers for auditing, debugging, and reprocessing.
2. Silver Layer (Cleansed & Conformed)
Purpose: Cleanse and Integrate
The silver layer applies essential quality and harmonization rules. Data here is structurally sound and ready for business analysis.
- **Transformations:** Filtering invalid records, de-duplication, handling null values, parsing timestamps, and standardizing categorical fields.
- **Schema:** Structured, normalized, and enforced. This is where you typically find entity tables (like customer, product, order).
- **Conforming:** Keys across different source systems are resolved (e.g., matching a web user ID to a CRM customer ID).
- **Users:** Data Scientists for feature engineering, and advanced Data Analysts.
3. Gold Layer (Curated)
Purpose: Business-Ready Reporting
The most refined layer, optimized for fast analytical consumption and containing final business logic.
- **Transformations:** Aggregation (daily sales totals, monthly active users), complex metric calculation, and application of star/snowflake schema modeling.
- **Schema:** De-normalized (Dimensional Models) to optimize BI query speed (Fact and Dimension tables).
- **Granularity:** Reduced (aggregated) compared to the row-level detail in Silver.
- **Users:** BI Tools, Executive Dashboards, and end-user Analysts.
4. Quality Gates & Delta Live Tables (DLT)
The "Medallion" concept relies on **Quality Gates**—defined checks that data must pass before promoting it to the next layer.
Quality Gates
Checks applied at the Silver and Gold layer boundaries to ensure data integrity.
- **Expectations:** Defined constraints (e.g., "Customer ID must not be NULL," "Sale Amount must be positive").
- **Handling:** Data that fails a quality gate can be:
- **Dropped:** The record is silently discarded (rare).
- **Quarantined:** The record is moved to an error table for later review.
- **Failed:** The entire pipeline stops until the source issue is resolved.
Delta Live Tables (DLT)
A modern framework (Databricks) designed specifically to build and manage Medallion pipelines with built-in quality rules.
- **Simplified ETL:** Automatically manages table creation, dependencies, schema evolution, and updates.
- **Declarative Pipelines:** You define the desired state (e.g., "Silver is a clean version of Bronze"), not the complex execution logic.
- **Quality Integration:** Quality gates are defined directly in the table definition using SQL or Python (e.g.,
CONSTRAINT valid_id EXPECT (id IS NOT NULL) ON VIOLATION DROP ROW).
The Medallion approach ensures that every hop increases trust and value.