Apache Flink (Stateful Stream Processing)
/tldr: A true stream processing engine for high-speed, stateful, and event-time-aware computations.
1. Core Architecture and Dataflow
Flink is designed around a master/worker architecture that executes continuous dataflow graphs.
JobManager (Master)
The coordinator. It handles job submission, scheduling of tasks, managing checkpoints, and monitoring the cluster. If a TaskManager fails, the JobManager orchestrates recovery.
TaskManagers (Workers)
The worker processes that execute tasks (operator chains) of a dataflow, manage the allocated slots, and handle the flow of data between operators.
DataFlow Graph
A Flink program is mapped to a **DataFlow Graph** where operations (Source, Map, KeyBy, Window, Sink) are chained into **Operators**. The TaskManagers execute these operators in parallel based on the parallelism defined for the job.
2. Event Time and Watermarks
One of Flink's defining features is its reliance on **Event Time**, which is the timestamp recorded when the event occurred at its source, leading to accurate results regardless of when events arrive.
Event Time
The time recorded in the event itself. Flink uses this time for accurate aggregations and windowing, handling out-of-order arrival.
Processing Time
The time on the machine running the operator. Useful for very low-latency processing where strict order is less critical.
Watermarks
A mechanism used in Event Time to signal the progress of time. A watermark tells Flink: "I won't see any more events older than X time." This allows Flink to safely close windows and discard state.
Windowing
Windowing operators group streams of elements based on time (Event Time) or count. Common types include:
- **Tumbling Windows:** Fixed, non-overlapping time segments (e.g., 5-minute blocks).
- **Sliding Windows:** Fixed duration, but windows overlap (e.g., a 10-minute window computed every 5 minutes).
- **Session Windows:** Group events by user activity, closing the window when a period of inactivity (gap) is detected.
3. Exactly-Once Guarantees via Checkpointing
Flink manages application state (e.g., sums, counts, model parameters) directly, storing it locally in a **State Backend**. This state is consistently snapshotted to guarantee **Exactly-Once** processing.
Checkpointing
Flink uses a lightweight, distributed snapshot mechanism based on the **Chandy-Lamport algorithm** (asynchronous barrier snapshotting).
- **Barriers:** Checkpoint barriers are injected into the data stream and flow through the dataflow graph. When an operator receives a barrier, it takes a snapshot of its current state.
- **Asynchronous Snapshots:** State snapshots are written to persistent storage (like S3 or HDFS) in the background while data processing continues, minimizing latency impact.
- **Exactly-Once:** In case of failure, the job restarts from the latest successful checkpoint, ensuring input data and internal state are perfectly consistent.
Savepoints vs. Checkpoints
- **Checkpoint:** Automatically triggered, optimized for failure recovery.
- **Savepoint:** Manually triggered, primarily used for **planned upgrades, maintenance, or migrating job versions**. They are designed to be portable and flexible.
4. The APIs
Flink offers several abstraction layers for writing streaming applications.
DataStream API
The core, lowest-level API. Used for fine-grained control over state, time, and custom operators. Perfect for complex, event-time-aware operations and managing low-latency transformations.
Example: implementing a custom fraud detection algorithm that tracks state per user ID.
Table & SQL API
A higher-level, relational API for stream processing. It's often used for declarative queries and simple transformations, providing a unified SQL interface for both batch and streaming data.
Example: SELECT user_id, count(item) FROM orders GROUP BY HOP(event_time, '1 minute', '10 minutes'), user_id
Flink's maturity in state management and event-time semantics makes it the go-to choice for mission-critical, high-accuracy streaming applications.