Pivoting & Reshaping | Spark Practical Scenarios

Pivoting & Data Reshaping.

Transposing row data into columns to create cross-tab reports and multi-dimensional summaries.

1. The Pivot Workflow

Pivoting in Spark is a two-step aggregation process. First, you groupBy the columns you want to keep as rows, then you pivot the column whose values will become new headers.

from pyspark.sql import functions as F

# Transposing 'Year' values into columns to compare Revenue
pivot_df = df.groupBy("Product") \
             .pivot("Year") \
             .agg(F.sum("Revenue"))

# result: [Product, 2023, 2024, 2025]

2. Performance: Explicit vs. Implicit Pivot

When you call pivot("column"), Spark has to perform an extra pass over the data to discover all unique values. You can significantly speed this up by providing the list of values explicitly.

# Faster execution by specifying the pivot categories
years = ["2023", "2024", "2025"]
optimized_pivot = df.groupBy("Product") \
                    .pivot("Year", years) \
                    .agg(F.sum("Revenue"))

3. Unpivoting (Stacking)

To reverse a pivot—moving columns back into rows—Spark uses the stack() expression. This is often necessary for normalizing wide datasets for machine learning.

# Converting column headers back into a 'Year' column
unpivoted_df = pivot_df.select(
    "Product",
    F.expr("stack(3, '2023', `2023`, '2024', `2024`, '2025', `2025`) as (Year, Revenue)")
)

Interview Q&A

Q: Why is pivoting considered a "Wide Transformation"? Pivoting requires a Shuffle. Data must be redistributed across the cluster so that all records belonging to a group (e.g., a specific Product) end up on the same executor to be aggregated into the new column structure.

Q: What is the limit of the pivot function? By default, Spark limits pivots to 10,000 distinct values to prevent memory issues. This can be adjusted via spark.sql.pivotMaxValues, but pivoting high-cardinality columns (like UserID) is generally bad practice.

Q: How do you handle Nulls that appear after a pivot? If a group doesn't have a value for one of the pivoted columns, Spark inserts a Null. You can chain .na.fill(0) after the aggregation to clean up the report.

Pivoting & Data Reshaping.

Resources

Company

Socials