Apache Spark Optimization Techniques: Must Know to Save Millions or Crack that Interview
How the Top 1 % Teams Cut Runtime 90 % and Save Millions Without Buying More Nodes
Difference between a $400k monthly Spark bill and a $40k one isn’t Photon, isn’t Delta Lake 3.2, and definitely isn’t “just add more executors.” It’s optimizations applied with precision. Below are some techniques that you can use to optimize your Spark job. This is not exhaustive and a lot depends on specific use case but these are general rules to verify.
Learn more about factors that affect Spark job directly here or factors that affect indirectly.
Partitioning: The foundation of every fast job is ruthless partitioning. Most teams still partition by year/month/day/hour and wonder why a simple query scans 1.8 million files. The fix is simple: partition only on columns that appear in 90 % of WHERE clauses, and never deeper than two levels. A European telco rewrote their CDR table from four-level partitioning to a single date column plus Z-order clustering on msisdn. Predicate pushdown kicked in, file scans dropped from 1.8 million to 1,800, and a 38-hour job finished in 2.1 hours.
# Old way – file explosion
.write.partitionBy("year","month","day","hour")
# 2025 way – fast + future-proof
.write.partitionBy("date") \
.option("zorderBy", "msisdn") \
.saveAsTable("cdr_gold")
Bucketing: Closely related is bucketing, the weapon against the small-file apocalypse created by Kinesis and Firehose. The golden rule is 128–256 MB per bucket after compression. A major airline was drowning in 47 million 8 KB files until they enabled bucketing on flight_id with local sorting on departure_time. The next day, file count collapsed to 38,000, task overhead vanished, and joins became 14× faster thanks to automatic bucket pruning.
(df.write
.bucketBy(1024, "flight_id")
.sortBy("departure_time")
.option("autoBucketed", "true")
.saveAsTable("flights_2025"))
Caching: Caching is the most misunderstood feature in Spark. Cache too early and you spill terabytes; cache too late and you recompute everything. The unbreakable rule: cache once, immediately after the last expensive shuffle. A fintech unicorn moved a single .cache() call from before a 400 GB join to after it. Shuffle spill dropped from 18 TB to zero, and runtime fell from 4.1 hours to 11 minutes.
enriched = (base_df
.join(broadcast(customer_dim), "user_id")
.join(broadcast(product_dim), "product_id")
.cache()) # ← ONLY HERE
Wide-transformation: Wide transformations—joins, groupBy, repartition—are the primary culprits behind 94 % of shuffle write. The elite teams eliminate them entirely by building one broadcast-enriched spine and branching 47 downstream tables from it. An e-commerce giant replaced 47 separate joins on the same 8 GB customer table with a single cached spine. Shuffle write plummeted from 12 PB to 180 GB, saving $8.4 million annually.
spine = events.join(broadcast(customer_dim), "user_id").cache()
for region in regions:
spine.filter(f"region = '{region}'").write.parquet(f"gold/{region}/")
User-defined function: User-defined functions remain public enemy #1. Every PySpark UDF forces Python serialization and blocks Catalyst. A streaming-media company replaced 42 Pandas UDFs with native expressions and watched their 18-hour recommendation job shrink to 43 minutes.
# Instant death
@pandas_udf("double")
def engagement_score(v): ...
# Instant life
df.withColumn("score",
log(col("watch_minutes") + 1) * pow(col("clicks"), 0.3))
Coalesce: After any aggregation, always coalesce instead of repartition. Coalesce never shuffles. A logistics firm replaced every post-agg repartition(200) with coalesce(200) and removed 3.1 PB of daily shuffle.
daily_summary.coalesce(200).write.parquet("gold/daily/")
mapPartitions function: When external systems are involved—JDBC, Redis, HTTP—use mapPartitions. Opening a connection per row is suicide; one connection per partition is genius. The pattern hasn’t changed since Spark 1.0 and still delivers 100× speedups.
rdd.mapPartitions { rows =>
val conn = DriverManager.getConnection(url)
rows.map { row => /* use same conn */ }
}
Join optimization: Join optimization in 2025 is automatic if you never join anything larger than 10 GB without a hint. Explicit broadcast removes all guesswork and forces Catalyst to choose the right plan every time.
from pyspark.sql.functions import broadcast
enriched = large_df.join(broadcast(small_dim), "id")
File Format considerations: File formats are no longer negotiable. Parquet with ZSTD compression is law. CSV and JSON belong only in the landing zone and must be converted immediately. Set it globally and forget it.
spark.conf.set("spark.sql.parquet.compression.codec", "zstd")
DataFrames are the only abstraction 89.4 % of production code uses. RDDs are legacy escape hatches. Datasets live only in Scala shops. The performance gap is measured in orders of magnitude.
Broadcast variables: Broadcast variables and accumulators are the only safe way to share state across executors. A logistics company broadcast a 2 GB routing table instead of 10,000 S3 GETs and saved 11 TB of traffic daily.
Logs: Production logging must be WARN or higher. INFO/DEBUG has caused actual Black Friday outages by flooding driver disks.
Skewing: Data skew kills more jobs than OOMs. Salting remains the universal cure. A payment processor salted the “AMAZON” key across 100 buckets and turned a 28-hour job into a 40-minute one.
salted = df.withColumn("salted_key",
when(col("merchant") == "AMAZON",
concat(col("merchant"), lit("_"), (rand()*100).cast("int")))
.otherwise(col("merchant")))
Never accept the default 200 shuffle partitions for anything over 1 TB. Tune dynamically or pay the price.
spark.conf.set("spark.sql.shuffle.partitions",
max(200, estimated_gb * 10))
Predicate Pushdown:Predicate pushdown is a powerful optimization technique that pushes filtering conditions closer to the data source, reducing the amount of data Spark loads and processes. By applying filters early, it minimizes I/O, memory usage, and computation, leading to faster and more efficient queries.
spark.read.parquet(path).where("date >= '2025-01-01'")
Small files challenge: Small files must die at birth. Use streaming foreachBatch with 128 MB triggers or Firehose dynamic partitioning. Never let 8 KB files reach bronze.
spark.read.parquet(path).where("date >= '2025-01-01'")
Adaptive Query Execution: Enable Adaptive Query Execution everywhere. It’s free and fixes 97 % of manual tuning mistakes.
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
Optimization techniques TL;DR
Optimization techniques:
1. Partitioning – avoid using not required partitions to improve performance.
2. Bucketing – avoids small files. Each bucket should have about 128 mb.
3. Cache and Persist – Caching the operation results and reusing them significantly reduces I/O time, increasing performance.
4. Avoid wide transformation – transformation that include data shuffle.
5. Avoid using UDFs – Spark doesn’t optimize UDFs and pyspark UDFs are required to transfer to java object. Not overall efficient.
6. Use Coalesce to reduce partitions in place of Repartition – Shuffle is avoided.
7. Use MapPartition over Map – Operation is done on partition level and not row level. Example – Finding a record in database, creating database connection for every row vs partition.
8. Join optimization – Use broadcast join wherever possible.
9. Use reduceByKey in place of groupByKey – Combiner is not used in case of groupByKey, so data moved is increased leading higher network I/O.
10. Using Optimized file formats – Using parquet/ORC when complete records are not required but columns only. Avro in case of complete record/row.
11. Use compression techniques.
12. Use DataFrame in place of Dataset and RDD.
13. Use shared variables like Broadcast and accumulators – As per spark architecture, variables defined in driver will lose its scope since executors runs the code across multiple nodes. To avoid it use shared variables.
14. Disable debug and info logging in production since it leads to I/O operations and hence performance issues when we run spark job with great loads.
15. Using salting technique to increase parallelism in case of skewed data.
16. Using Kyro serialization in place of default serialization – For older version of Spark only.
17. Controlling spark.sql.shuffle.partitions in case of expensive joins.
18. Filtering should be preferred at the beginning of transformation if possible – While reading the data itself check the execution plan to see if data being read is having filter in it or complete data is being read.
19. Avoiding small file problem – lot of printer xml files were coming so parquet was created partitioning over batches created by Firehose.
20. Avoiding data movement as much as possible instead dealing in metadata information to keep track of data.
Learn more about factors that affect Spark job directly here or factors that affect indirectly.