Delta Lake
/tldr: The only table format you should ever use in 2025
ACID
Time Travel
Z-Order
OPTIMIZE
2025
2025 LAW
No Delta Lake = You’re doing it wrong.
Parquet alone is dead.
Parquet alone is dead.
Why Delta Wins Everything
ACID Transactions
Concurrent writes never corrupt
Time Travel
SELECT * FROM table VERSION AS OF 123
Schema Enforcement
Blocks bad data at write time
The 8 Commands That Rule Your Life
# 1. Create / Convert
spark.read.parquet("s3a://old/").write.format("delta").save("/delta/table")
CONVERT TO DELTA parquet.`s3a://path`
# 2. Time Travel
SELECT * FROM delta.`/path` VERSION AS OF 123
SELECT * FROM delta.`/path` TIMESTAMP AS OF "2025-04-01"
df = spark.read.format("delta").option("versionAsOf", 123).load("/path")
# 3. Z-Order + OPTIMIZE (100× faster filters)
OPTIMIZE delta.`/path` ZORDER BY (user_id, date)
VACUUM delta.`/path` RETAIN 168 HOURS -- 7 days default
# 4. Upsert (MERGE)
MERGE INTO prod_table AS target
USING new_data AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
# 5. Delete / Update
DELETE FROM prod_table WHERE date < "2024-01-01"
UPDATE prod_table SET status = "archived" WHERE age > 90
# 6. Change Data Feed (CDF)
ALTER TABLE prod_table SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
spark.readStream.format("delta").option("readChangeFeed", "true").table("prod_table")
Z-ORDER + OPTIMIZE = 10–100× Faster Queries
DO THIS WEEKLY
OPTIMIZE delta.`/path` ZORDER BY (high_cardinality_col)
-- e.g. user_id, event_type, country
NEVER DO
- ZORDER BY date only
- ZORDER on low-cardinality
- Skip OPTIMIZE forever
Production Checklist (Never Fail Again)
All tables = Delta format
Auto-optimize + auto-compaction ON (Databricks)
VACUUM weekly (retain 7–30 days)
ZORDER on most-filtered high-cardinality columns
Use MERGE for upserts, never overwrite
FINAL ANSWER:
Delta Lake is not optional.
It is the foundation of modern data lakes.
ACID • Time Travel • Z-Order • Streaming
All solved. Forever.