Data Engineering TL;DR: Apache Parquet

Apache Parquet (Columnar Storage)

/tldr: The optimized, self-describing, and compressed file format for analytical workloads.

Columnar Compression Schema Evolution Predicate Pushdown

1. Columnar Storage Principle

Parquet's fundamental advantage is its shift from traditional row-based storage (like CSV or JSON) to **columnar storage**.

Row-Based (e.g., CSV, JSON)

All values for a single record are stored together. Excellent for transactional systems (OLTP) where you often need to fetch the entire row instantly. In analytic queries (OLAP), this means reading and discarding unneeded column data.

Columnar (Parquet)

All values for a single column are stored together. This is ideal for analytical systems (OLAP) where queries typically only access a subset of columns (e.g., calculating the average price, not needing the user's address).

2. File Structure and Metadata

A Parquet file is highly structured and self-describing, organized to enable massive parallel processing.

File Header & Footer

The **Footer** (read last) contains the schema, compression details, and pointers to all Row Groups. This metadata allows the engine to intelligently skip data.

Row Groups

The primary division of data (typically 128MB or 512MB). Processing frameworks (like Spark) parallelize work across these Row Groups.

Column Chunks & Pages

Within a Row Group, data is stored column-wise in Column Chunks. Each Chunk is composed of Pages, which are the smallest unit of encoding and compression.

3. Key Performance Optimizations

The columnar format enables massive efficiency gains across storage and computation.

Compression

Since a column contains data of the same type and often similar values (high homogeneity), compression algorithms (like Snappy, Zstd, Gzip) are far more effective. This drastically reduces storage costs and I/O time.

Predicate Pushdown

This is the most critical feature. The engine (e.g., Spark) reads the metadata stored in the file footer (min/max values for columns) and skips reading entire Row Groups or Column Chunks if they cannot possibly satisfy the query's `WHERE` clause.

Example: If a query asks for `WHERE salary > 100000`, the engine checks the Row Group metadata. If the max salary in a Row Group is 95,000, it skips that entire chunk of disk I/O.

4. Data Types and Nested Data

Parquet supports a rich set of data types and elegantly handles complex, nested data structures.

Dremel Encoding (Repetition/Definition Levels)

Parquet uses Google's Dremel paper encoding to store complex structures like arrays (LIST) and objects (MAP) efficiently in a columnar layout. **Definition levels** track null values, and **Repetition levels** track where a new record or array element begins. This prevents the need to serialize the nested structure entirely, keeping the benefits of columnar compression.

This capability allows Parquet to process semi-structured data (like JSON) efficiently without flattening the entire structure.

For most analytical workloads involving large-scale data lakes, Parquet is the default and optimized format.