Caching Strategy

/tldr: Cache only what you reuse 3+ times

MEMORY_AND_DISK Pitfalls 2025 Rules

2025 LAW

cache() → MEMORY_AND_DISK
Everything else is wrong.

Storage Levels — Only 2 Matter

MEMORY_AND_DISK

Default in 2025

Fast + safe + spills gracefully

MEMORY_ONLY

Kills jobs

OOM → recompute → cascade fail

Cache ONLY If:

Used in 3+ actions (joins, counts, ML training)

Expensive to recompute (joins, window functions)

Never cache raw input → waste of memory

Correct Caching Pattern (2025)

# Correct — safe + fast
df_clean = (spark.read.parquet("s3a://raw/")
               .filter(...).join(dim_table, ...)
               .cache())                    # ← MEMORY_AND_DISK by default!

# Use it many times
df_clean.groupBy("user").count().write(...)
df_clean.join(events, ...).write(...)
df_clean.train_model()...

# Done? Unpersist!
df_clean.unpersist()

Top 5 Ways People Lose $100k

1. Forgetting unpersist()

Memory leak → OOM after 10 runs

2. Caching after shuffle (too big)

Spills everywhere → slower than recompute

3. Using MEMORY_ONLY on big data

Job dies → recompute 10x slower

4. Caching streaming DataFrame

Memory grows forever → crash

5. Caching inside loops

Exponential memory explosion

FINAL ANSWER:

cache() → MEMORY_AND_DISK
Reuse 3+ times → cache
Done → unpersist()

That’s it. Forever.

Caching Strategy

2025 LAW

Storage Levels — Only 2 Matter

Cache ONLY If:

Correct Caching Pattern (2025)

Top 5 Ways People Lose $100k

Resources

Company

Socials