Databricks TL;DR: Delta Live Tables Maintenance
Back to Databricks main page

DLT Pipeline Maintenance

/tldr: Managing Delta Live Tables (DLT) updates, retention policies, and data quality expectations.

Data Quality ETL Automation Cost Control Retention

1. Pipeline Updates: Refresh vs. Full

A DLT pipeline run is called an "Update." DLT automatically manages incremental processing, ensuring only new data is read and processed in most runs. However, there are two types of pipeline updates:

Standard Refresh (Default)

This performs an **incremental update**. It reads all available new data from configured sources since the last successful update. For most production environments (especially streaming), this is the optimal and most cost-effective method.

  • Action: Efficiently processes new files/records.
  • Use Case: Scheduled jobs, continuously running streams.

Full Refresh (Full Update)

This reprocesses *all* data from the configured sources, recreating the output tables entirely. This is essential for schema changes, logic fixes, or historical backfills.

  • Action: Deletes output tables and re-runs the entire ETL/ELT logic.
  • Use Case: Applying a bug fix to historical data, recovering from a major failure, or schema migration.

**Cost Impact:** Full refreshes consume significantly more DBU and compute resources as they process the entire dataset. They should only be used when necessary.

2. Data Quality: Expectations and Alerting

DLT's major selling point is declarative Data Quality using **Expectations**. An expectation is a rule defined on an input data set (e.g., "column `user_id` must be NOT NULL").

Expectation Failure Actions

1. ALERT (Monitor)

The simplest action. It records the metric and flags the violation in the DLT event log. The data is still passed through the pipeline. CONSTRAINT valid_id EXPECT (user_id IS NOT NULL) ON VIOLATION **ALERT**

2. DROP ROW (Quarantine)

The violating record is dropped from the pipeline but logged and tracked in a dedicated quarantine table. This prevents bad data from corrupting downstream tables. CONSTRAINT valid_date EXPECT (ingestion_date IS NOT NULL) ON VIOLATION **DROP ROW**

3. FAIL BATCH (Halt)

This is the strictest action. If a single record violates the expectation, the entire pipeline update is halted immediately. This is suitable for mission-critical Silver/Gold tables. CONSTRAINT unique_key EXPECT (count(unique_key) = 1) ON VIOLATION **FAIL BATCH**

3. Pipeline Retention and Housekeeping

Pipeline Checkpoints & Metadata

DLT maintains checkpoints and logs in the storage location defined for the pipeline. This metadata includes the state of the streaming queries and is critical for recovery and incremental processing.

  • **Action:** Do not manually delete the checkpoint or log folders; this will break incremental processing.
  • **Cost:** Metadata storage contributes minimally to overall cloud costs but is vital for integrity.

Table `VACUUM` Policy

DLT tables are Delta tables. You must apply standard Delta Lake retention best practices. The `VACUUM` command physically removes old, non-referenced data files to save storage cost.

  • **Rule:** Ensure Delta Lake's `logRetentionDuration` is longer than the DLT checkpoint retention to allow for reliable Time Travel queries.
  • **Tip:** DLT usually manages this automatically, but manually running `VACUUM` on the resulting Delta tables (not the pipeline itself) may be necessary for cost control.

DLT shifts the focus from managing infrastructure to enforcing data quality via declarative expectations.

Databricks Fundamentals Series: DLT Maintenance