Spark SQL + AQE
/tldr: Catalyst + AQE = 2–10× faster queries with zero tuning
Catalyst
AQE
Dynamic Optimization
Runtime Magic
2025 Reality Check
Just turn AQE on.
Everything else is legacy.
Everything else is legacy.
Spark 3.0+ with AQE enabled → 80% of manual tuning is obsolete.
Catalyst Optimizer (Static Phase)
Predicate Pushdown
Filters before joins/reads
Column Pruning
Only read needed columns
Join Reordering
Smallest table first
Constant Folding
Pre-compute literals
Adaptive Query Execution (AQE) – Runtime Superpowers
Dynamic Join Strategy
Converts SortMerge → Broadcast at runtime if data shrinks
Skew Handling
Splits hot partitions automatically
Coalesce Partitions
Reduces shuffle partitions after filter
One-Line Config = 3–10× Speedup
# Spark 3.5+ (Databricks, EMR, GCP, Azure – all have this on by default)
spark.conf.set("spark.sql.adaptive.enabled", true) # Main switch
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", true) # Shrink after filter
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", true) # Auto-split hot keys
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "128mb") # Target size
# Optional: More aggressive (Databricks loves this)
spark.conf.set("spark.sql.adaptive.coalescePartitions.initialPartitionNum", 1000)
spark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionSize", "1mb")
Before vs After AQE (Real World)
2019 (No AQE)
- Manual broadcast hints
- Repartition() guessing
- Skew → job dies
- Tuning hell
2025 (AQE On)
- Zero hints needed
- Auto-optimizes at runtime
- Skew just works
- Write clean SQL → win
Spark SQL Pro Tips (2025)
Use ANSI SQL (Spark 3.4+) → better errors
CTAS + INSERT OVERWRITE = atomic tables
CACHE LAZY → only when used
Use Delta Lake → Z-Order + OPTIMIZE
Avoid SELECT * → Catalyst can’t prune
Use spark.sql() for complex logic
Final Answer:
AQE = ON
Everything else is optional
spark.sql.adaptive.enabled = true