Data Engineering TL;DR: Apache Kafka

Apache Kafka (Streaming Platform)

/tldr: A distributed, fault-tolerant, and highly scalable event-streaming platform.

Real-time High-Throughput Distributed Log Decoupling

1. Core Concepts: Topics & Partitions

Kafka is fundamentally a distributed commit log, managed by a cluster of **Brokers** (servers). It uses four core concepts:

Topics

A category or feed name where records (events) are published. Topics are logically named queues (e.g., user_clicks, order_updates).

Topics are persisted on disk and are highly durable, retaining messages for a configurable amount of time (retention policy).

Partitions

Topics are split into ordered, immutable sequences of records called Partitions. This is how Kafka achieves horizontal scalability.

  • **Ordering:** Records are only guaranteed to be ordered **within a single partition**, not across the entire topic.
  • **Keying:** Producers determine which partition to send a message to, usually based on a message key (e.g., user_id). All messages with the same key go to the same partition.
  • **Replication:** Each partition has multiple replicas across the cluster for fault tolerance.

2. Consumer Groups and Offsets

Consumers read data from topics. They are managed in groups to scale consumption and coordinate processing.

Consumer Groups

  • **Scalability:** All consumers within the same group share the workload of a topic's partitions. Each partition is assigned to exactly one consumer within the group.
  • **Fan-out:** If you have multiple distinct applications needing the same data, they can belong to different consumer groups, meaning each application gets a full copy of the data stream.
  • **Offset Tracking:** The consumer group tracks its progress (its **Offset**—the index of the last message read) for each partition. This is stored in a special topic called __consumer_offsets.

This offset tracking allows consumers to stop, restart, or crash and resume reading exactly where they left off without losing or re-reading all data.

3. Delivery Semantics (Exactly Once)

Kafka offers different guarantees regarding how messages are delivered, which is crucial for building reliable data pipelines.

At Most Once

Messages might be lost, but they are never duplicated.

Consumer commits offset before processing the message. If the consumer crashes, the message is lost. Fastest, but lowest reliability.

At Least Once

Messages are never lost, but they might be duplicated.

Consumer processes message, then commits offset. If the consumer crashes before committing, it re-reads and re-processes the message upon restart (resulting in duplicates).

Exactly Once

Every message is delivered and processed exactly once.

Achieved using **Idempotent Producers** (ensuring no duplicate messages are *sent*) and **Transactional Consumers** (ensuring processing and offset committing are atomic).

4. Schema Registry & Kafka Connect

Managing data structure changes (schema evolution) in a stream is complex. Kafka is often paired with its Confluent ecosystem tools.

Schema Registry

A centralized repository for managing schemas (usually in **Avro** format) for topic messages.

  • **Validation:** Producers register their schema; the Registry ensures compatibility rules are met (e.g., **Backward Compatibility** means a new consumer can read old messages).
  • **Data Quality:** Prevents corruption by enforcing that all messages flowing through Kafka adhere to the agreed-upon structure.

Kafka Connect

A framework for moving data in and out of Kafka reliably, without writing custom code.

  • **Source Connectors:** Ingest data from external systems (e.g., databases using Change Data Capture/CDC) into Kafka topics.
  • **Sink Connectors:** Deliver data from Kafka topics to external systems (e.g., S3, ElasticSearch, Data Warehouses).

Kafka provides the durable, scalable messaging layer that enables modern event-driven architectures.

Data Engineering Fundamentals: Apache Kafka