Data Engineering TL;DR: Medallion Architecture

Medallion Architecture

/tldr: A multi-hop pattern for reliable, layered data quality and refinement.

Data Lakehouse ETL/ELT Data Quality Refinement

The Medallion Architecture organizes data processing into multiple layers, with each layer representing a specific level of data quality, cleanliness, and aggregation. The goal is to maximize flexibility while ensuring trust in the data consumed by the business.

1. Bronze Layer (Raw)

Purpose: Ingest Data

This layer holds the raw, source-system data exactly as it arrived. It serves as an immutable, audited history of the data.

  • **Format:** Often includes mixed formats (JSON, CSV, AVRO) or modern Lakehouse formats (Delta, Iceberg).
  • **Schema:** Unstructured or semi-structured; the schema matches the source system's payload.
  • **Transformations:** Minimal. Only necessary transformations for data type compatibility or adding source metadata (timestamps, source file name).
  • **Users:** Data Engineers for auditing, debugging, and reprocessing.

2. Silver Layer (Cleansed & Conformed)

Purpose: Cleanse and Integrate

The silver layer applies essential quality and harmonization rules. Data here is structurally sound and ready for business analysis.

  • **Transformations:** Filtering invalid records, de-duplication, handling null values, parsing timestamps, and standardizing categorical fields.
  • **Schema:** Structured, normalized, and enforced. This is where you typically find entity tables (like customer, product, order).
  • **Conforming:** Keys across different source systems are resolved (e.g., matching a web user ID to a CRM customer ID).
  • **Users:** Data Scientists for feature engineering, and advanced Data Analysts.

3. Gold Layer (Curated)

Purpose: Business-Ready Reporting

The most refined layer, optimized for fast analytical consumption and containing final business logic.

  • **Transformations:** Aggregation (daily sales totals, monthly active users), complex metric calculation, and application of star/snowflake schema modeling.
  • **Schema:** De-normalized (Dimensional Models) to optimize BI query speed (Fact and Dimension tables).
  • **Granularity:** Reduced (aggregated) compared to the row-level detail in Silver.
  • **Users:** BI Tools, Executive Dashboards, and end-user Analysts.

4. Quality Gates & Delta Live Tables (DLT)

The "Medallion" concept relies on **Quality Gates**—defined checks that data must pass before promoting it to the next layer.

Quality Gates

Checks applied at the Silver and Gold layer boundaries to ensure data integrity.

  • **Expectations:** Defined constraints (e.g., "Customer ID must not be NULL," "Sale Amount must be positive").
  • **Handling:** Data that fails a quality gate can be:
    1. **Dropped:** The record is silently discarded (rare).
    2. **Quarantined:** The record is moved to an error table for later review.
    3. **Failed:** The entire pipeline stops until the source issue is resolved.

Delta Live Tables (DLT)

A modern framework (Databricks) designed specifically to build and manage Medallion pipelines with built-in quality rules.

  • **Simplified ETL:** Automatically manages table creation, dependencies, schema evolution, and updates.
  • **Declarative Pipelines:** You define the desired state (e.g., "Silver is a clean version of Bronze"), not the complex execution logic.
  • **Quality Integration:** Quality gates are defined directly in the table definition using SQL or Python (e.g., CONSTRAINT valid_id EXPECT (id IS NOT NULL) ON VIOLATION DROP ROW).

The Medallion approach ensures that every hop increases trust and value.

Data Engineering Fundamentals: Medallion Architecture