Data Engineering TL;DR: Data Modeling Patterns

Data Modeling Patterns

/tldr: Structuring data for performance, flexibility, and historical traceability.

Data Warehousing Analytics Normalization History Tracking

1. Dimensional Modeling (Star Schema)

Pioneered by Ralph Kimball, Dimensional Modeling is the dominant approach for analytical data stores. It prioritizes **readability and query performance** for business intelligence.

The Star Schema

A simple, de-normalized structure consisting of a central **Fact Table** surrounded by multiple **Dimension Tables**.

Fact Tables

Contain numerical, measurable data (metrics) like sales amount, quantity, or duration.
Contain foreign keys pointing to dimension tables.
Tend to be huge and constantly growing.

Dimension Tables

Contain descriptive attributes (textual context) like product name, customer address, or date components.
Are relatively small and static.
Enable slicing and dicing of the fact data.

**Benefit:** Simple joins (only one join step needed), resulting in faster analytical queries.

2. Slowly Changing Dimensions (SCDs)

SCDs are techniques used within Dimension Tables to handle changes in descriptive attributes over time, ensuring accurate historical reporting.

SCD Type 1 (Overwrite)

The simplest method: new data overwrites old data.

**History:** No history is preserved.
**Use Case:** Correcting errors or tracking metadata where history is irrelevant (e.g., correcting a misspelling).

SCD Type 2 (New Row)

The preferred method for tracking history: a new row is created whenever an attribute changes.

**History:** Full history is preserved.
**Mechanism:** Uses effective dating columns (start_date, end_date) and/or a current flag (is_current) to identify the valid record for any given time.
**Use Case:** Tracking customer address changes, or product price modifications over time.

SCD Type 3 (New Column)

Tracks only the current and one previous state using extra columns in the same row.

**History:** Limited history (current + one prior value).
**Use Case:** Tracking a status that frequently changes but only the most recent transition matters (e.g., previous department).

3. Data Vault Modeling

Data Vault, developed by Dan Linstedt, is a highly normalized, detail-oriented modeling method optimized for **auditability, flexibility, and integration** of data from multiple operational systems.

Data Vault Components

Data Vault separates data into three core component tables:

Hubs

Represent unique **Business Keys** (Customer ID, Order ID).
Contain only the key and load metadata.
The 'who' or 'what'.

Links

Model the **relationship** between two or more Hubs (e.g., Customer places Order).
Only contain foreign keys to the connected Hubs.
The 'how'.

Satellites

Contain the descriptive attributes and history (the context).
Attach to Hubs or Links.
The 'when, where, and why'.

**Benefit:** Highly scalable ingest, easy integration of new sources, and excellent auditability because history is captured in every Satellite record.

4. Comparison: Where to Use Which?

Feature	Star Schema (Dimensional)	Data Vault
Goal	Analytic Performance & BI Reporting	Historical Auditability & Source Integration
Normalization Level	De-normalized (optimized for reads)	Highly Normalized (optimized for writes/ingest)
Schema Change	Rigid, requires impact to fact and dimension tables.	Flexible, new attributes just mean a new Satellite table.
Use Case	Data Marts, BI Dashboards, Standard Reports	Raw Data Layer, Enterprise Data Warehouse (EDW)

**Conclusion:** In a modern data environment (like the Lakehouse), Data Vault is often used as the **integration layer** (the core EDW), and Star Schemas are created **on top of it** to serve specific business intelligence (BI) needs.

Data modeling dictates the speed, cost, and complexity of all downstream analytics.