Partitioning & Repartitioning | Spark Practical Scenarios
← All Scenarios

Partitioning & Data Distribution.

Optimizing parallelism and storage layouts to eliminate bottlenecks and minimize data shuffle.

In-memory partitioning determines how many parallel tasks Spark will run. repartition() is a full shuffle that can increase or decrease partitions, while coalesce() is a shuffle-minimized way to reduce partition count.

# Increasing partitions for higher parallelism (Full Shuffle)
df_boosted = df.repartition(100)

# Reducing partitions for final output (No Full Shuffle)
df_final = df_boosted.coalesce(5)

# Repartitioning by a specific key to colocate data
df_by_user = df.repartition("user_id")
    

For data that is naturally ordered (like timestamps or IDs), repartitionByRange() ensures that rows with similar values are placed in the same partition, which is highly beneficial for range-based queries.

# Partitioning by transaction_date to optimize time-series queries
df_ranged = df.repartitionByRange(10, "transaction_date")
    

When writing data to a Data Lake (S3/ADLS), partitionBy() creates a folder structure. This allows Spark to perform "Partition Pruning"—skipping entire directories of data during a read.

# Saving data partitioned by year and month
df.write.partitionBy("year", "month") \
  .format("parquet") \
  .save("s3://analytics-bucket/sales_history/")
    
Q: When should you prefer coalesce over repartition? Use coalesce when you want to reduce the number of partitions (e.g., from 1000 to 100) before writing to disk. It avoids a full shuffle by merging local partitions, making it significantly faster than repartition.
Q: What is the "Small File Problem" in partitioning? If you use partitionBy on a column with too many unique values (high cardinality), Spark will create thousands of tiny files. This overwhelms the file system metadata and severely degrades read performance. Aim for files around 128MB to 1GB.
Q: How do you determine the ideal number of partitions? A good rule of thumb is 2x to 4x the number of CPU cores in your cluster. This ensures that executors stay busy without the overhead of managing too many tiny tasks.