Window Functions | Spark Practical Scenarios

Analytical Window Functions.

Performing complex rankings and running aggregations across partitioned slices of data.

1. Defining the Window Spec

A window function requires a WindowSpec, which defines how to group rows (partitionBy) and how to sort them within that group (orderBy).

from pyspark.sql.window import Window
from pyspark.sql import functions as F

# Creating a window partitioned by 'department' and ordered by 'salary'
window_spec = Window.partitionBy("department").orderBy(F.col("salary").desc())

2. Ranking and Row Numbering

Ranking functions allow you to identify the top items in a category. row_number() is unique per row, while rank() handles ties by skipping subsequent numbers.

# Identifying the top 3 highest earners in each department
df_ranked = df.withColumn("rank", F.rank().over(window_spec)) \
              .filter(F.col("rank") <= 3)

3. Sliding Windows & Running Totals

By adding a rowsBetween or rangeBetween clause, you can calculate cumulative sums or moving averages.

# Calculating a 7-day moving average of sales
moving_window = Window.orderBy("date").rowsBetween(-6, 0)

df_moving_avg = df.withColumn("7_day_avg", F.avg("sales").over(moving_window))

Interview Q&A

Q: What is the difference between rank() and dense_rank()? Both handle ties similarly, but rank() leaves gaps in the sequence (1, 2, 2, 4), whereas dense_rank() does not leave gaps (1, 2, 2, 3).

Q: Are window functions Wide or Narrow transformations? They are Wide Transformations. Spark must shuffle the data to ensure that all records belonging to the same partitionBy key are located on the same executor to perform the window calculation.

Q: What is the danger of a Window without a partitionBy clause? If you define a window using only orderBy, Spark is forced to move all data into a single partition on a single executor to perform the global sort. This will likely cause an OutOfMemory (OOM) error on large datasets.

Analytical Window Functions.

Resources

Company

Socials