Databricks TL;DR: Delta Live Tables

Delta Live Tables (DLT)

/tldr: A framework for building reliable, maintainable, and testable data pipelines using a simple declarative approach (SQL or Python).

Declarative ETL Data Quality Automated Dependencies Observability

1. The DLT Pipeline: Automated ETL/ELT

DLT abstracts away the complexity of Spark/Delta Lake operations, cluster management, and error handling. You simply *declare* the desired state of your output tables, and DLT handles the incremental computation and orchestration.

How Pipelines Work

Declarative Code

Instead of defining *how* to process data, you use SQL or Python (with decorators) to define the dependency graph: CREATE OR REFRESH STREAMING TABLE target AS SELECT ... FROM source.

Automated Orchestration

DLT automatically manages the execution order, handles job failures, retries, and deploys the necessary infrastructure (cluster and scaling).

DLT is typically used to implement the Medallion Architecture (Bronze, Silver, Gold layers) through a series of dependent data sets.

2. Data Sets: Streaming vs. Materialized

DLT defines two main types of data sets, which are simply Delta tables with auto-managed refresh logic.

Streaming Tables (ST)

**Purpose:** Ingesting and processing data incrementally from streaming sources (like Kafka/Kinesis) or other Streaming Tables. They are designed to continuously process new, arriving data using **Structured Streaming**.

**Code Example (Python):**

@dlt.table(comment="Raw streaming events")
def bronze_events():
    return dlt.read_stream("cloud_files")

Materialized Views (MV)

**Purpose:** Handling transformations that require **batch processing** of the full input data, such as aggregations (e.g., counting unique users over all time). They are recomputed fully or incrementally based on the source data when the pipeline runs.

**Code Example (SQL):**

CREATE OR REFRESH LIVE TABLE gold_summary AS
  SELECT count(*) FROM silver_processed

3. Data Quality with Expectations

Expectations are assertions about data quality defined directly in the table definition. DLT monitors and enforces these rules and reports quality metrics for every pipeline run.

Expectation Example & Enforcement Modes

                    
@dlt.expect("valid_sku", "sku_id IS NOT NULL")
@dlt.expect_or_drop("valid_price", "price > 0")
def clean_products():
    # ... logic here

**FAIL:** The entire pipeline run will immediately fail if the expectation is violated. (Highest strictness)
**DROP:** Records that violate the expectation are discarded, but the pipeline continues. (Standard approach for bad records)
**ALERT:** The records are kept, but a metric is logged indicating the violation count. (Lowest strictness, for monitoring)

DLT provides data engineers with a 'set and forget' approach to building production-grade data pipelines.