Window Functions
/tldr: The most powerful weapon in PySpark SQL
Ranking
Running Totals
LAG/LEAD
Frames
2025 LAW
90% of “complex” PySpark logic
should be solved with Window Functions
should be solved with Window Functions
The One Syntax You Need
from pyspark.sql.window import Window
from pyspark.sql.functions import *
# 1. Define window
w = Window.partitionBy("user_id").orderBy("event_time")
# 2. Use it
df.withColumn("rank", rank().over(w))
.withColumn("prev_value", lag("value", 1).over(w))
.withColumn("running_total", sum("amount").over(w.rowsBetween(Window.unboundedPreceding, Window.currentRow)))
Ranking Functions — Know the Difference
row_number()
Unique 1,2,3... even with ties
row_number().over(w)
rank()
Gaps: 1,1,3
rank().over(w)
dense_rank()
No gaps: 1,1,2
dense_rank().over(w)
Top 10 Patterns You’ll Use Daily
1. Running Total
sum("amount").over(w.rowsBetween(Window.unboundedPreceding, Window.currentRow))
2. Moving Average (7-day)
avg("value").over(w.rowsBetween(-6, 0))
3. Sessionization (Gap Detection)
lag("event_time", 1).over(w)
# Then: when(timestamp_diff > 30min, new session)
4. Dedupe Keep Latest
row_number().over(Window.partitionBy("id").orderBy(desc("ts"))).alias("rn")
# Then filter rn == 1
5. Top N per Group
row_number().over(w.orderBy(desc("sales"))).alias("rank")
# filter rank <= 5
Window Frames — Never Guess Again
| Frame | Code | Use Case |
|---|---|---|
| All previous rows | rowsBetween(Window.unboundedPreceding, Window.currentRow) | Running total |
| Last N rows | rowsBetween(-N+1, 0) | Moving average |
| From start to end | rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) | Group totals |
| Time-based (30 min) | rangeBetween(-1800, 0) | Event-time windows |
Performance Tips (2025)
DO
- Always orderBy() in window
- Use rangeBetween() for time windows
- Cache if reusing window
DON’T
- Forget partitionBy() → global window
- Use rowsBetween with no orderBy()
- Apply window on unfiltered data
MASTERED.
You now wield the most powerful tool in PySpark.
Window Functions = Your superpower.