Data Engineering TL;DR: Structured Streaming

Structured Streaming (Apache Spark)

/tldr: Processing real-time data using the same simple DataFrame/SQL APIs as batch processing.

Micro-Batching Fault Tolerance Stream-Batch Unification Exactly Once

1. The Infinite Table Model

Structured Streaming abstracts stream processing by modeling the data stream as an **Input Table** that is continuously appended with new data. Operations (like filter, group by, join) are run on this virtual table, producing a continuously updating **Result Table**.

How it Works: Micro-Batching

The stream processing is executed as a series of tiny, fault-tolerant **micro-batches**. Spark reads a small chunk of new data (the micro-batch), processes it using standard Spark code, and writes the output.

This maintains the low-latency benefits of streaming while leveraging Spark's highly optimized batch engine and fault tolerance.

2. Watermarks: Handling Late Data

In real-world streaming, events arrive out of order or late. **Watermarks** are crucial for defining when Spark can consider an observed event "late" and safely discard the state associated with older events to prevent memory overflow.

Watermark Mechanism

  • **Definition:** A watermark is a time threshold defined by the user (e.g., .withWatermark("timestamp", "10 minutes")).
  • **Function:** When Spark observes the latest event time (say, 1:00 PM), the watermark is set to 1:00 PM minus the allowed delay (e.g., 12:50 PM).
  • **Discarding State:** Any state (e.g., in a running count or window) associated with a timestamp older than the current watermark is finalized and subsequently cleared from memory.

Watermarks are essential for **stateful operations** like aggregations (e.g., counting events in 10-minute windows) and stream-stream joins.

3. Checkpointing: Resilience and Recovery

Checkpointing is the mechanism that guarantees fault tolerance and **Exactly-Once** processing semantics.

What is Checkpointed?

  • **Progress:** The current progress of the query (which messages were successfully processed, usually by Kafka offset).
  • **Configuration:** The necessary configuration and schema information for the streaming query.
  • **State:** The intermediate state data (e.g., window counts, running totals) managed by watermarks.

If the streaming job fails, it reads the checkpoint directory and restarts precisely from the last recorded offset and state, ensuring no data loss or duplication. You must specify a checkpoint path in persistent storage (e.g., S3, ADLS, GCS).

4. Triggers and foreachBatch

These options give you precise control over how and when the micro-batch processing occurs.

Triggers

Defines the timing and execution mode of the micro-batches.

  • **Default (Continuous):** Runs micro-batches as fast as possible to process new data immediately.
  • **Fixed Interval:** .trigger(processingTime='5 seconds') runs a batch every 5 seconds, regardless of incoming data volume.
  • **Once:** .trigger(once=True) runs one complete batch over all available data and then shuts down (useful for scheduled batch ETL that uses streaming sources).

foreachBatch

Allows you to apply batch-style operations or write to external systems using the Spark DataFrame API on the data of each micro-batch.

  • **Upsert Logic:** Essential for implementing complex logic like Delta Lake MERGE INTO operations on the incoming data.
  • **External Writes:** Enables writing to systems that require batch APIs (e.g., custom JDBC writers, bulk load APIs).
  • **Batch Functions:** Allows the use of DataFrame features not available in pure streaming mode.

Structured Streaming provides the best of both worlds: simple DataFrame APIs with robust, fault-tolerant stream semantics.

Data Engineering Fundamentals: Structured Streaming