Spark Cluster Sizing & Tuning TL;DR
Back to Apache Spark TL;DR Hub

Cluster Sizing & Tuning

/tldr: Get 2–5× cost reduction with 10 minutes of math

Executors Cores Memory Autoscaling
-->

The Golden Sizing Rule (2025)

5 cores + 20–30 GB RAM per executor→ works 95% of the time

Why? HDFS block = 128–256 MB → 5 concurrent tasks → perfect GC + no OOM

Executor Sizing Cheat Sheet

Node TypevCPURAMExecutors/NodeCores/ExecutorMemory/Executor
r6i.4xlarge / r7g.4xlarge16128 GB3538 GB
r6i.8xlarge32256 GB6540 GB
i4i.16xlarge (storage-optimized)64512 GB12540 GB

Memory Breakdown per Executor

Execution Memory – shuffle, joins, aggregations (spill to 70% of heap)
Storage Memory – caching/persist (shared with execution)
User Memory – UDFs, your objects (~30% of heap)
Total Heap = 75% of container RAM
--executor-memory = round_down(0.75 × container_RAM)

Battle-Tested Configs (2024–2025)

Databricks / EMR (Recommended)

spark.executor.cores            5
spark.executor.memory           19g      # for 32GB node → 3 executors
spark.executor.memoryOverhead   4g
spark.driver.cores              5
spark.driver.memory             10g
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled   true

Kubernetes / Self-managed

--executor-cores 5
--executor-memory 30g
--conf spark.executor.memoryOverhead=6g
--conf spark.kubernetes.memoryOverheadFactor=0.2
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true

Autoscaling That Actually Works

Min executors: 2–5 (keep warm)
Max executors: data_size_in_TB × 15–25
Scale-up threshold: 75% executor busy
Scale-down after 5–10 min idle
Enable spark.dynamicAllocation.shuffleTracking.enabled=true

Avoid These (Cost You Millions)

  • 1 core per executor → insane GC pressure
  • >8 cores per executor → HDFS throughput drops
  • No memoryOverhead → OOM on yarn kills
  • Static clusters for bursty workloads
  • Forgetting spark.shuffle.service.enabled=true with dynamic allocation

5 cores • 20–40 GB RAM • dynamic allocation + shuffle service = happy cluster

Updated for Spark 3.5+ • Works on Databricks, EMR, Kubernetes, GCP, Azure • 2025 edition