Data Lineage
/tldr: The historical map of data flow: tracing data from source to consumption.
1. Core Concept & Directions
Data Lineage is the documentation of the data lifecycle, including where it originated, where it went, and what transformations were applied to it at each hop. It is the "who, what, when, and how" for every dataset.
Forward Lineage
Tracks data flow from **source to consumers**. Used for **impact analysis**.
*If I change column A in the staging table, which downstream dashboards will break?*
Reverse Lineage (or Backward)
Tracks data flow from **consumers back to the source**. Used for **auditing and root cause analysis**.
*This final report has an incorrect total. Where in the Bronze or Silver layers did the bad data originate?*
2. Unity Catalog (Databricks)
Unity Catalog is the centralized governance layer for the Databricks Lakehouse. Crucially, it automatically captures lineage for operations performed across different compute engines.
Unity Catalog's Role
- **Automatic Capture:** Lineage is captured automatically for any table-to-table operation executed via Databricks (e.g., Spark, DLT, SQL queries).
- **Granularity:** Tracks lineage at the **column level** (e.g.,
Silver.sale_amountcomes fromBronze.tx_value). - **Central Metadata:** Stores all lineage information alongside access permissions and audit logs in a single metastore.
- **Access:** This metadata is exposed via the Unity Catalog UI and REST APIs for external consumption.
3. Open Lineage (Interoperability)
Open Lineage is an open standard designed to enable data tools and platforms to exchange lineage metadata consistently.
Key Features
- **Standardized Format:** Defines a universal format (JSON) for capturing lineage events, including the input and output datasets and the transformation run.
- **Interoperability:** Allows lineage gathered by different tools (e.g., Airflow, Spark, Flink) to be integrated and viewed cohesively in a single metadata catalog.
- **Extensibility:** Uses "facets" (metadata properties) to add custom information about the data quality or job environment.
4. Lineage for Data Trust (Monte Carlo)
Lineage is an indispensable component of **Data Observability**, providing the context needed for anomaly detection and rapid incident response.
Anomaly Detection
Tools like **Monte Carlo** use machine learning to detect unexpected changes in data (e.g., a column suddenly goes 50% NULL).
When an anomaly is found in a final report, the lineage graph is instantly queried to find the upstream source that caused the issue, providing immediate root cause analysis.
Impact Analysis & Incident Response
When data is deemed faulty, lineage helps Data Engineers quickly identify:
- **Who to Alert:** Which downstream consumers (users, dashboards, models) are impacted.
- **What to Re-run:** Which upstream jobs need to be fixed and re-executed to correct the data (Reverse Lineage).
Lineage transforms your data pipelines from black boxes into traceable, auditable systems.