Data Quality & Testing
/tldr: Defining, measuring, and enforcing trust in data pipelines.
1. The 5 Core DQ Dimensions
Data quality is defined across several key metrics. A good data engineering team builds checks for all five.
1. Completeness
Is all the expected data present? (Absence of NULLs).
Test Example: Ensure that “customer_id” is never NULL and that the daily file has a record count > 10,000.
2. Consistency
Is the data uniform across all systems and following defined rules?
Test Example: Ensure that a “purchase_date” is never older than the corresponding “account_creation_date” in the same record.
3. Validity
Does the data conform to the syntax, domain, or format of its definition?
Test Example: Ensure that “email_address” matches a regex pattern (contains an @ symbol) and “country_code” is only one of the accepted ISO 3166-1 alpha-2 codes.
4. Accuracy
Does the data correctly reflect the real-world value it is intended to represent? (Hardest to measure).
Test Example: Compare the total sum of “daily_sales_amount” in the lakehouse table against the reported total from the source system's API/dashboard.
5. Timeliness
Is the data available when expected (latency) and current enough for the task?
Test Example: Ensure the pipeline run time is less than 5 minutes and that the maximum “event_timestamp” in the table is no more than 1 hour behind the current time.
2. Core Data Testing Types
Testing is the active process of validating data quality by applying specific checks at different points in the pipeline lifecycle.
Unit Tests
Focus on the smallest transformation functions (e.g., clean_name('john smith ') returns 'John Smith'). Ensures individual logic blocks are bug-free.
Integration Tests (Functional)
Verify the entire pipeline or job executes correctly by comparing input data against the expected output data set. Checks the flow and integration of components.
Acceptance Tests (Data Contract)
Define checks on the final, consumed table to ensure it meets the downstream user's explicit requirements (the Data Contract). Checks completeness, freshness, and final schema.
3. Tools and Enforcement
Modern tools focus on declarative testing and automated monitoring.
Declarative Testing (dbt, Great Expectations)
Instead of writing complex test code, you declare what the data should look like.
Example (dbt): Use YAML to define tests like - unique, - not_null, and - relationships between tables.
Data Observability (Monte Carlo, Datafold)
Uses **machine learning** to monitor data automatically and detect anomalies (unexpected spikes, drops, or schema changes) without requiring explicit test configuration for every scenario.
Shift Left Testing
The principle of moving testing **earlier** in the pipeline (closer to the source). Implementing DQ checks in the Bronze/Silver layers ensures bad data is isolated immediately, preventing downstream corruption.
Data Quality is not a feature; it's a prerequisite for any valuable data product.