Partitioning & Bucketing
/tldr: Optimizing distributed data layout for faster queries and lower costs.
1. Partitioning (Pruning)
Partitioning is the process of dividing a table into smaller, more manageable parts based on the distinct values of one or more columns. It physically separates the data into different directories on the storage system.
Key Characteristics
- **Mechanism:** Creates a directory structure (e.g.,
/data/table/year=2024/month=01/). - **Use Case:** Highly effective for filtering on low-cardinality, frequently filtered columns like **date, region, or source system**.
- **Benefit (Pruning):** When a query filters on a partition key (e.g.,
WHERE date = '2024-01-15'), the engine only reads the files within that specific directory, dramatically reducing I/O.
Partitioning Pitfalls (Skew & Small Files)
- **Too Many Partitions:** If you partition on a high-cardinality column (e.g., User ID), you create thousands of tiny files, overwhelming the filesystem (the "small file problem").
- **Data Skew:** If one partition key value contains 90% of the data (e.g.,
country='US'is massive, all others are small), processing performance will be bottlenecked by that single, skewed partition.
2. Bucketing (Grouping)
Bucketing is a technique to divide the data within a partition (or the entire table) into a fixed number of buckets based on the **hash value** of a column.
Key Characteristics
- **Mechanism:** Data is distributed across a set number of files (buckets) using
hash(column) % num_buckets. - **Use Case:** Optimizing operations on high-cardinality columns, primarily **joins and aggregations**.
- **Benefit (Join Optimization):** When joining two tables on a bucketing column, the engine can perform a much faster **bucket-by-bucket join** (shuffling only the data within corresponding buckets), rather than a full table shuffle.
- **Example:** Bucketing on
customer_id(high cardinality) within adatepartition (low cardinality).
3. Z-Ordering (Data Skipping)
Z-Ordering (a feature typically found in Delta Lake and modern Parquet engines) is an advanced technique used to co-locate related values for *multiple* columns within the same set of files.
Mechanism and Benefit
- **How it Works:** It maps multidimensional data (multiple columns) into a single dimension using Z-curve space-filling curves, physically sorting the data based on the chosen columns.
- **Primary Goal:** To maximize **data skipping**. If a query filters on any Z-ordered column, the engine can quickly determine which files contain the relevant range of values and skip the rest.
- **Advantage over Partitioning:** You can Z-Order on high-cardinality columns (e.g., Latitude, User ID) without creating the small file problem, as the physical directory structure is unchanged.
**Command Example (Delta Lake/Databricks):** OPTIMIZE table_name ZORDER BY (user_id, product_category)
4. Hive vs. Delta Lake
While Partitioning is a legacy concept inherited from Hive, modern systems like Delta Lake offer more robust alternatives.
Hive (Traditional)
- Relies heavily on **Partitioning** to manage file sizes and prune scans.
- Requires manual commands (
MSCK REPAIR TABLE) to register new partitions. - **Bucketing** is complex to set up and is often avoided in newer implementations.
Delta Lake (Modern Lakehouse)
- **Partitioning** is used sparingly (e.g., only on high-volume date columns).
- **Z-Ordering** is the preferred technique for optimizing scans on multiple columns, replacing the need for complex bucketing.
- Features like **data skipping** (min/max stats stored in the transaction log) provide filter efficiency even without Z-Ordering or Partitioning.
In modern DE, we partition for directories and Z-Order for files.