Data Platform Monitoring and Alerting Strategy

Monitoring & Alerting for Data Platforms

/tldr: Defining data contracts, tracking system health, and prioritizing alerts to ensure data quality and system reliability.

System Metrics Data Quality SLAs & RPO Alert Triage

1. The Pillars of Data Platform Observability

A robust monitoring strategy covers system health, application behavior, and data outcomes.

Metrics (The "What")

Numeric time-series data used for dashboarding and alerting (e.g., CPU utilization, job duration, queue size).

Logs (The "Why")

Detailed, unstructured text events used for debugging and forensic analysis (e.g., Spark driver exceptions, network errors).

Traces (The "How")

Records the end-to-end path of a request or operation across multiple services (less common in pure ETL but useful for microservices).

2. Monitoring Platform Health (Compute & Cost)

Focus on resource utilization and cost deviation to maintain efficiency and stability.

Compute Resource Alerts

**High CPU/Memory:** Persistent 90%+ CPU or memory use indicates under-sizing or inefficient code (potential job failure).
**Low Utilization:** Sustained 20%- CPU use on a static cluster indicates over-sizing (wasted cost).
**Scaling Failure:** Failure of the autoscaling group to procure new nodes (critical for elasticity).
**Driver Health:** Failure of the Spark driver/head node.

Cost and Efficiency Alerts

**Spend Spike:** Daily cost exceeds historical 95th percentile (e.g., an infinitely looping job).
**Idle Clusters:** Cluster running for >X hours with zero active jobs (violation of auto-termination policy).
**Spot Instance Loss Rate:** High rate of Spot instance interruptions (may require shifting to On-Demand or a different instance family).
**Expensive Queries:** Alert on queries consuming more than $Y of compute time per run.

3. Pipeline and Data Quality Monitoring

This shifts monitoring focus from the infrastructure to the integrity and timeliness of the data itself.

Pipeline Latency (RPO)

**Job Success/Failure:** Basic alert on scheduler failure.
**Duration Drift:** Job duration exceeds 99th percentile (slow query/bottleneck).
**Data Freshness:** The maximum age of the latest record in a critical table exceeds the defined Service Level Agreement (SLA).

Data Quality Checks

**Null/Completeness:** Percentage of Null values in a required column exceeds 1% threshold.
**Validity/Uniqueness:** Primary key column fails a uniqueness test.
**Schema Drift:** An unexpected column appears or a required column disappears (critical alert).
**Referential Integrity:** Foreign key columns fail to match entries in the lookup table.

Data Volume and Drift

**Volume Drop:** Daily ingested row count is less than 50% of the daily average (upstream failure).
**Volume Spike:** Daily ingested row count is more than 300% of the daily average (duplication issue).
**Value Drift:** Statistical properties of a column (mean, median, standard deviation) shift significantly from historical norms (e.g., average sale price drops 90%).

4. Alerting Strategy and Triage

Alerts must be prioritized by severity to prevent "alert fatigue" and ensure immediate response to critical issues.

P1: Critical/Immediate (PagerDuty/On-Call)

**Impact:** Complete service outage or severe data corruption impacting external customers. Requires immediate stop-the-line action.

**Examples:** All production ETL jobs failed for 30+ minutes, production metastore is down, critical data freshness SLA violation.

P2: High/Urgent (Dedicated Slack Channel)

**Impact:** Isolated job failures, performance degradation, or data quality issues in non-critical systems. Requires investigation during business hours.

**Examples:** Single major ETL job failed, high CPU on a non-critical cluster, persistent low cluster utilization (cost alert), schema drift detected on a staging table.

P3: Low/Informational (Triage Channel/Dashboard)

**Impact:** Noise or potential future problems. For monitoring and long-term analysis, no immediate action required.

**Examples:** Single Spot instance loss, job duration increased by 10% (minor drift), daily cost spike within acceptable margin, minor completeness checks failed.

A healthy data platform is an observable data platform. Prioritize data quality checks as highly as system health checks.