Data Engineering TL;DR: Data Quality and Testing

Data Quality & Testing

/tldr: Defining, measuring, and enforcing trust in data pipelines.

Data Trust Assertions Anomaly Detection Validation

1. The 5 Core DQ Dimensions

Data quality is defined across several key metrics. A good data engineering team builds checks for all five.

1. Completeness

Is all the expected data present? (Absence of NULLs).
Test Example: Ensure that “customer_id” is never NULL and that the daily file has a record count > 10,000.

2. Consistency

Is the data uniform across all systems and following defined rules?
Test Example: Ensure that a “purchase_date” is never older than the corresponding “account_creation_date” in the same record.

3. Validity

Does the data conform to the syntax, domain, or format of its definition?
Test Example: Ensure that “email_address” matches a regex pattern (contains an @ symbol) and “country_code” is only one of the accepted ISO 3166-1 alpha-2 codes.

4. Accuracy

Does the data correctly reflect the real-world value it is intended to represent? (Hardest to measure).
Test Example: Compare the total sum of “daily_sales_amount” in the lakehouse table against the reported total from the source system's API/dashboard.

5. Timeliness

Is the data available when expected (latency) and current enough for the task?
Test Example: Ensure the pipeline run time is less than 5 minutes and that the maximum “event_timestamp” in the table is no more than 1 hour behind the current time.

[Image of Data Quality Dimensions diagram showing the five core dimensions: Completeness, Consistency, Validity, Accuracy, and Timeliness]

2. Core Data Testing Types

Testing is the active process of validating data quality by applying specific checks at different points in the pipeline lifecycle.

Unit Tests

Focus on the smallest transformation functions (e.g., clean_name('john smith ') returns 'John Smith'). Ensures individual logic blocks are bug-free.

Integration Tests (Functional)

Verify the entire pipeline or job executes correctly by comparing input data against the expected output data set. Checks the flow and integration of components.

Acceptance Tests (Data Contract)

Define checks on the final, consumed table to ensure it meets the downstream user's explicit requirements (the Data Contract). Checks completeness, freshness, and final schema.

3. Tools and Enforcement

Modern tools focus on declarative testing and automated monitoring.

Declarative Testing (dbt, Great Expectations)

Instead of writing complex test code, you declare what the data should look like.
Example (dbt): Use YAML to define tests like - unique, - not_null, and - relationships between tables.

Data Observability (Monte Carlo, Datafold)

Uses **machine learning** to monitor data automatically and detect anomalies (unexpected spikes, drops, or schema changes) without requiring explicit test configuration for every scenario.

Shift Left Testing

The principle of moving testing **earlier** in the pipeline (closer to the source). Implementing DQ checks in the Bronze/Silver layers ensures bad data is isolated immediately, preventing downstream corruption.

Data Quality is not a feature; it's a prerequisite for any valuable data product.