Data Modeling Patterns
/tldr: Structuring data for performance, flexibility, and historical traceability.
1. Dimensional Modeling (Star Schema)
Pioneered by Ralph Kimball, Dimensional Modeling is the dominant approach for analytical data stores. It prioritizes **readability and query performance** for business intelligence.
The Star Schema
A simple, de-normalized structure consisting of a central **Fact Table** surrounded by multiple **Dimension Tables**.
Fact Tables
- Contain numerical, measurable data (metrics) like sales amount, quantity, or duration.
- Contain foreign keys pointing to dimension tables.
- Tend to be huge and constantly growing.
Dimension Tables
- Contain descriptive attributes (textual context) like product name, customer address, or date components.
- Are relatively small and static.
- Enable slicing and dicing of the fact data.
**Benefit:** Simple joins (only one join step needed), resulting in faster analytical queries.
2. Slowly Changing Dimensions (SCDs)
SCDs are techniques used within Dimension Tables to handle changes in descriptive attributes over time, ensuring accurate historical reporting.
SCD Type 1 (Overwrite)
The simplest method: new data overwrites old data.
- **History:** No history is preserved.
- **Use Case:** Correcting errors or tracking metadata where history is irrelevant (e.g., correcting a misspelling).
SCD Type 2 (New Row)
The preferred method for tracking history: a new row is created whenever an attribute changes.
- **History:** Full history is preserved.
- **Mechanism:** Uses effective dating columns (
start_date,end_date) and/or a current flag (is_current) to identify the valid record for any given time. - **Use Case:** Tracking customer address changes, or product price modifications over time.
SCD Type 3 (New Column)
Tracks only the current and one previous state using extra columns in the same row.
- **History:** Limited history (current + one prior value).
- **Use Case:** Tracking a status that frequently changes but only the most recent transition matters (e.g., previous department).
3. Data Vault Modeling
Data Vault, developed by Dan Linstedt, is a highly normalized, detail-oriented modeling method optimized for **auditability, flexibility, and integration** of data from multiple operational systems.
Data Vault Components
Data Vault separates data into three core component tables:
Hubs
- Represent unique **Business Keys** (Customer ID, Order ID).
- Contain only the key and load metadata.
- The 'who' or 'what'.
Links
- Model the **relationship** between two or more Hubs (e.g., Customer places Order).
- Only contain foreign keys to the connected Hubs.
- The 'how'.
Satellites
- Contain the descriptive attributes and history (the context).
- Attach to Hubs or Links.
- The 'when, where, and why'.
**Benefit:** Highly scalable ingest, easy integration of new sources, and excellent auditability because history is captured in every Satellite record.
4. Comparison: Where to Use Which?
| Feature | Star Schema (Dimensional) | Data Vault |
|---|---|---|
| Goal | Analytic Performance & BI Reporting | Historical Auditability & Source Integration |
| Normalization Level | De-normalized (optimized for reads) | Highly Normalized (optimized for writes/ingest) |
| Schema Change | Rigid, requires impact to fact and dimension tables. | Flexible, new attributes just mean a new Satellite table. |
| Use Case | Data Marts, BI Dashboards, Standard Reports | Raw Data Layer, Enterprise Data Warehouse (EDW) |
**Conclusion:** In a modern data environment (like the Lakehouse), Data Vault is often used as the **integration layer** (the core EDW), and Star Schemas are created **on top of it** to serve specific business intelligence (BI) needs.
Data modeling dictates the speed, cost, and complexity of all downstream analytics.