Shuffle Deep Dive
/tldr: Shuffle = 90% of your runtime and 100% of your pain
Exchange
Spill
OOM
AQE
Partitions
WARNING
Shuffle is the #1 killer of Spark jobs.
Understand it — or it will destroy you.
Understand it — or it will destroy you.
What Happens During Shuffle (Step by Step)
1. Map Phase (Pre-Shuffle)
- groupByKey(), join(), distinct(), repartition()
- Each task writes shuffle files to local disk
- One file per downstream partition
2. Shuffle Write
- Data serialized + spilled if memory full
- Files: shuffle_0_1_0.data + .index
- External Shuffle Service holds them
3. Shuffle Read (Fetch)
- Next stage tasks pull their partition
- Over network → BlockManager → merge
- Slowest part of any job
The 3 Ways Shuffle Kills You
Spill
Memory → Disk → 100× slower
Skew
One task gets 90% data → dies
Too Many Partitions
10k tiny files → GC + task overhead
Golden Rules That Save Billions
200–2000 partitions total
Ideal: ~128MB per partition after shuffle
spark.sql.shuffle.partitions = auto (AQE)
Let AQE handle it in 2025
External Shuffle Service = mandatory
No more executor death on dynamic allocation
Configs That Fix 95% of Shuffle Pain
# 2025 MUST-HAVE
spark.sql.adaptive.enabled true
spark.sql.adaptive.coalescePartitions.enabled true
spark.sql.adaptive.skewJoin.enabled true
spark.shuffle.service.enabled true
# Optional but golden
spark.sql.adaptive.advisoryPartitionSizeInBytes 128MB
spark.sql.files.maxPartitionBytes 256MB
TRUTH:
Avoid shuffle when possible.
Broadcast • Bucket • Cache • Filter early.
Master shuffle.
Or it masters you.