Spark SQL + AQE TL;DR
Back to Apache Spark TL;DR Hub

Spark SQL + AQE

/tldr: Catalyst + AQE = 2–10× faster queries with zero tuning

Catalyst AQE Dynamic Optimization Runtime Magic

2025 Reality Check

Just turn AQE on.
Everything else is legacy.

Spark 3.0+ with AQE enabled → 80% of manual tuning is obsolete.

Catalyst Optimizer (Static Phase)

Predicate Pushdown

Filters before joins/reads

Column Pruning

Only read needed columns

Join Reordering

Smallest table first

Constant Folding

Pre-compute literals

Adaptive Query Execution (AQE) – Runtime Superpowers

Dynamic Join Strategy

Converts SortMerge → Broadcast at runtime if data shrinks

Skew Handling

Splits hot partitions automatically

Coalesce Partitions

Reduces shuffle partitions after filter

One-Line Config = 3–10× Speedup

# Spark 3.5+ (Databricks, EMR, GCP, Azure – all have this on by default)
spark.conf.set("spark.sql.adaptive.enabled", true)                    # Main switch
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", true) # Shrink after filter
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", true)           # Auto-split hot keys
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "128mb") # Target size

# Optional: More aggressive (Databricks loves this)
spark.conf.set("spark.sql.adaptive.coalescePartitions.initialPartitionNum", 1000)
spark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionSize", "1mb")

Before vs After AQE (Real World)

2019 (No AQE)

  • Manual broadcast hints
  • Repartition() guessing
  • Skew → job dies
  • Tuning hell

2025 (AQE On)

  • Zero hints needed
  • Auto-optimizes at runtime
  • Skew just works
  • Write clean SQL → win

Spark SQL Pro Tips (2025)

Use ANSI SQL (Spark 3.4+) → better errors
CTAS + INSERT OVERWRITE = atomic tables
CACHE LAZY → only when used
Use Delta Lake → Z-Order + OPTIMIZE
Avoid SELECT * → Catalyst can’t prune
Use spark.sql() for complex logic

Final Answer:

AQE = ON
Everything else is optional

spark.sql.adaptive.enabled = true

Spark 3.5+ • AQE enabled by default on Databricks, EMR, GCP Dataproc, Azure Synapse • 2025 standard