Compaction vs. Vacuum (Lakehouse Maintenance)
/tldr: The two essential operations for optimizing query performance and managing cloud storage costs in Delta, Iceberg, and Hudi.
1. Compaction (e.g., Delta's OPTIMIZE)
Compaction is a **performance-enhancing** operation. Its primary goal is to address the **"small files problem"** created by continuous streaming and micro-batch writes. Query engines like Spark perform poorly when dealing with millions of tiny files due to high overhead in metadata management and file opening.
Goal: Optimize Query Speed
Compaction reads many small Parquet files, merges their contents, and writes them back as a few, optimally sized (typically 128MB to 1GB) large files. This significantly reduces metadata overhead, leading to faster query planning and execution.
Mechanism: File Rewriting (CoW)
This process is an **in-place rewrite**. The new, larger files are registered in the transaction log, and the old, smaller files are marked as stale, but not immediately deleted.
OPTIMIZE table_name
Advanced Compaction Strategies
Z-Ordering (Delta Lake Specific)
A technique used during compaction to co-locate related values in a multi-dimensional space (i.e., multiple columns). Z-Ordering physically reorganizes data to maximize data skipping when filtering on two or more columns simultaneously.
Example: If you Z-Order by `(CustomerKey, OrderDate)`, queries filtering on both fields will only need to read a small fraction of the files.
Bin Packing (General)
The simple strategy where small files are combined into files that are close to a target size. This is the default mechanism for improving I/O throughput by ensuring the query engine processes larger, contiguous blocks of data.
2. Vacuum (Data Retention and Cleanup)
Vacuuming is a **cost and compliance-focused** operation. Its primary goal is to physically delete the stale or "orphaned" data files that are no longer referenced by the table's transaction log (metadata).
Goal: Reduce Storage Costs & Risk
Every transaction (writes, updates, deletions, and compaction) creates new data files and marks old files as stale. Vacuuming physically removes these stale files from cloud storage (S3, ADLS, GCS), reducing storage bills and minimizing the risk of accessing obsolete data.
Mechanism: Physical Deletion
Vacuuming typically uses a **retention period** (e.g., 7 days). Files older than the retention period *and* not referenced by the current transaction log are physically deleted. **Caution:** Setting the retention too low can break long-running queries accessing older snapshots.
VACUUM table_name RETAIN 168 HOURS
Time Travel and Safety
Minimum Retention Time
Lakehouse formats enforce a minimum safe retention (e.g., 7 days or 168 hours in Delta Lake). This is a critical safety measure to ensure that concurrent, long-running read operations (which may have started based on an older snapshot) do not fail when files they need are suddenly deleted. If files are deleted, you lose the ability to **time travel** back to versions referencing those files.
3. Compaction vs. Vacuum: Key Differences
Compaction (OPTIMIZE)
- **Primary Goal:** Query Performance.
- **Operation:** Logical rewrite (merges small files).
- **Affects:** Logical data organization and metadata.
- **Result:** Better query speed, but temporarily increases file count before vacuum.
Vacuum (CLEAN)
- **Primary Goal:** Cost Management / Compliance.
- **Operation:** Physical deletion (removes stale files).
- **Affects:** Cloud storage usage (S3, ADLS, GCS).
- **Result:** Reduced storage costs, but limits Time Travel capability.
Compaction should run frequently to maintain high query performance, while Vacuum should run periodically (e.g., weekly) to clean up old data after a safe retention period.