Shuffle Deep Dive

/tldr: Shuffle = 90% of your runtime and 100% of your pain

Exchange Spill OOM AQE Partitions

WARNING

Shuffle is the #1 killer of Spark jobs.
Understand it — or it will destroy you.

What Happens During Shuffle (Step by Step)

1. Map Phase (Pre-Shuffle)

groupByKey(), join(), distinct(), repartition()
Each task writes shuffle files to local disk
One file per downstream partition

2. Shuffle Write

Data serialized + spilled if memory full
Files: shuffle_0_1_0.data + .index
External Shuffle Service holds them

3. Shuffle Read (Fetch)

Next stage tasks pull their partition
Over network → BlockManager → merge
Slowest part of any job

The 3 Ways Shuffle Kills You

Spill

Memory → Disk → 100× slower

Skew

One task gets 90% data → dies

Too Many Partitions

10k tiny files → GC + task overhead

Golden Rules That Save Billions

200–2000 partitions total

Ideal: ~128MB per partition after shuffle

spark.sql.shuffle.partitions = auto (AQE)

Let AQE handle it in 2025

External Shuffle Service = mandatory

No more executor death on dynamic allocation

Configs That Fix 95% of Shuffle Pain

# 2025 MUST-HAVE
spark.sql.adaptive.enabled                      true
spark.sql.adaptive.coalescePartitions.enabled   true
spark.sql.adaptive.skewJoin.enabled             true
spark.shuffle.service.enabled                   true

# Optional but golden
spark.sql.adaptive.advisoryPartitionSizeInBytes 128MB
spark.sql.files.maxPartitionBytes               256MB

TRUTH:

Avoid shuffle when possible.
Broadcast • Bucket • Cache • Filter early.

Master shuffle.
Or it masters you.

Shuffle Deep Dive

WARNING

What Happens During Shuffle (Step by Step)

The 3 Ways Shuffle Kills You

Golden Rules That Save Billions

Configs That Fix 95% of Shuffle Pain

Resources

Company

Socials