← All Scenarios
1. Defining the Window Spec
Analytical Window Functions.
Performing complex rankings and running aggregations across partitioned slices of data.
A window function requires a WindowSpec, which defines how to group rows (partitionBy) and how to sort them within that group (orderBy).
from pyspark.sql.window import Window
from pyspark.sql import functions as F
# Creating a window partitioned by 'department' and ordered by 'salary'
window_spec = Window.partitionBy("department").orderBy(F.col("salary").desc())
2. Ranking and Row Numbering
Ranking functions allow you to identify the top items in a category. row_number() is unique per row, while rank() handles ties by skipping subsequent numbers.
# Identifying the top 3 highest earners in each department
df_ranked = df.withColumn("rank", F.rank().over(window_spec)) \
.filter(F.col("rank") <= 3)
3. Sliding Windows & Running Totals
By adding a rowsBetween or rangeBetween clause, you can calculate cumulative sums or moving averages.
# Calculating a 7-day moving average of sales
moving_window = Window.orderBy("date").rowsBetween(-6, 0)
df_moving_avg = df.withColumn("7_day_avg", F.avg("sales").over(moving_window))
Interview Q&A
Q: What is the difference between rank() and dense_rank()?
Both handle ties similarly, but rank() leaves gaps in the sequence (1, 2, 2, 4), whereas dense_rank() does not leave gaps (1, 2, 2, 3).
Q: Are window functions Wide or Narrow transformations?
They are Wide Transformations. Spark must shuffle the data to ensure that all records belonging to the same partitionBy key are located on the same executor to perform the window calculation.
Q: What is the danger of a Window without a partitionBy clause?
If you define a window using only orderBy, Spark is forced to move all data into a single partition on a single executor to perform the global sort. This will likely cause an OutOfMemory (OOM) error on large datasets.