Caching Strategy
/tldr: Cache only what you reuse 3+ times
MEMORY_AND_DISK
Pitfalls
2025 Rules
2025 LAW
cache() → MEMORY_AND_DISK
Everything else is wrong.
Everything else is wrong.
Storage Levels — Only 2 Matter
MEMORY_AND_DISK
Default in 2025
Fast + safe + spills gracefully
MEMORY_ONLY
Kills jobs
OOM → recompute → cascade fail
Cache ONLY If:
Used in 3+ actions (joins, counts, ML training)
Expensive to recompute (joins, window functions)
Never cache raw input → waste of memory
Correct Caching Pattern (2025)
# Correct — safe + fast
df_clean = (spark.read.parquet("s3a://raw/")
.filter(...).join(dim_table, ...)
.cache()) # ← MEMORY_AND_DISK by default!
# Use it many times
df_clean.groupBy("user").count().write(...)
df_clean.join(events, ...).write(...)
df_clean.train_model()...
# Done? Unpersist!
df_clean.unpersist()
Top 5 Ways People Lose $100k
1. Forgetting unpersist()
Memory leak → OOM after 10 runs
2. Caching after shuffle (too big)
Spills everywhere → slower than recompute
3. Using MEMORY_ONLY on big data
Job dies → recompute 10x slower
4. Caching streaming DataFrame
Memory grows forever → crash
5. Caching inside loops
Exponential memory explosion
FINAL ANSWER:
cache() → MEMORY_AND_DISK
Reuse 3+ times → cache
Done → unpersist()
That’s it. Forever.