Pipeline Design Patterns
/tldr: Architectural paradigms for processing data reliably at scale, from batch to real-time.
1. Batch vs. Streaming (The Latency Trade-Off)
The core decision in pipeline design is how to handle time: process data in large fixed chunks (Batch) or continuously as it arrives (Streaming).
Batch Processing
High Latency, High Throughput
- **Data Model:** Bounded (finite dataset size).
- **Latency:** Hours to days (reports, ETL).
- **Complexity:** Easier to manage and debug data quality.
- **Tooling:** Apache Spark, Snowflake, Data Warehouses.
Streaming Processing
Low Latency, Real-Time Response
- **Data Model:** Unbounded (infinite stream).
- **Latency:** Milliseconds to seconds (alerts, dashboards).
- **Complexity:** Requires strong handling of ordering and state management.
- **Tooling:** Kafka, Flink, Spark Streaming, Pulsar.
2. Core Design Principles
Idempotency
The property that an operation can be applied multiple times without changing the result beyond the initial application. This is **critical** for safe retries and fault tolerance.
**Example:** Instead of an `INSERT` (which duplicates on retry), use an `UPSERT` (update or insert) based on a unique key.
Schema Evolution
The ability of a system to handle changes to the data structure (schema) over time without breaking downstream consumers.
**Strategy:** Use data formats that support schema tracking (e.g., Apache Avro, Parquet, Delta Lake) and allow for adding columns without failure.
3. Handling Late Data
In stream processing, data often arrives out of order or after its intended processing window.
Late Arriving Data
Late data is when the **Event Time** (when the event actually happened) is much earlier than the **Processing Time** (when the system receives it).
- **Watermarks:** A mechanism in stream processors (like Flink or Spark) to define a threshold of lateness. Any data arriving before the watermark is considered timely; anything after is dropped or flagged.
- **Windowing:** Defining time-based aggregations (e.g., 5-minute totals). Late data requires the window to be re-opened, processed, and the result corrected.
4. Architectural Paradigms
These two models define how data is processed to ensure both high accuracy and low latency.
Lambda Architecture
Dual-Layer Processing
This pattern uses two separate paths for data processing:
- **Batch Layer (Accuracy):** Processes historical data slowly to ensure high accuracy, correcting any errors from the speed layer.
- **Speed Layer (Latency):** Processes real-time data quickly, providing low-latency, approximate results.
- **Serving Layer:** Merges results from both layers for the final view.
Kappa Architecture
Stream-First Simplicity
Developed to simplify the complexity of the Lambda Architecture.
- **Single Stream Layer:** All data (historical and real-time) is treated as an infinite stream via a central message queue (e.g., Kafka).
- **Reprocessing:** To re-calculate historical views or correct errors, you simply re-run the stream processing logic on the raw data log from the beginning.
- **Benefit:** Eliminates the need to maintain two separate codebases (Batch and Stream).
**Modern Reality:** Most production systems are moving towards a **Modified Kappa** or a **Hybrid Batch/Stream** approach (like the modern Lakehouse) where batch is used for cost-effective historical clean-up, and stream is used for low-latency delivery.
Effective pipeline design is about balancing data latency, complexity, and accuracy.