UDFs vs Pandas UDFs
/tldr: Never use regular UDFs in 2025. Ever.
Scalar UDF
Pandas UDF
10–100×
Performance
2025 LAW
Regular Python UDF = Career-Ending Anti-Pattern
10–100× slower · Breaks Catalyst · Serializes every row
Real-World Benchmarks (10M rows)
| Method | Time | Speed vs Built-in | Verdict |
|---|---|---|---|
| Built-in function (col * 2) | 8 sec | 1× | Best |
| Pandas UDF (vectorized) | 12–25 sec | 1.5–3× slower | Acceptable |
| Regular Python UDF | 4–15 min | 30–100× slower | Never use |
Why Regular UDFs Destroy Performance
Serialization
Row → Python object → serialize → deserialize → Python
No Catalyst
Optimizer can’t see inside → black box
GC Hell
Millions of tiny Python objects → JVM GC cries
Code: The Good, The Bad, The Ugly
NEVER DO THIS
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
@udf(returnType=StringType())
def clean_text(text):
return text.strip().lower() if text else None
df.withColumn("clean", clean_text(col("text")))
DO THIS (2025)
from pyspark.sql.functions import pandas_udf
import pandas as pd
@pandas_udf("string")
def clean_text_vec(texts: pd.Series) -> pd.Series:
return texts.str.strip().str.lower()
df.withColumn("clean", clean_text_vec(col("text")))
GOD TIER
# 99% of the time this exists
df.withColumn("clean",
lower(trim(col("text")))
)
Pandas UDF Types You’ll Actually Use
Scalar (Most Common)
@pandas_udf("double")
def predict_score(features: pd.Series) -> pd.Series:
return model.predict(features)
Grouped Map (Pandas apply)
@pandas_udf(schema)
def normalize_group(pdf: pd.DataFrame) -> pd.DataFrame:
pdf["normalized"] = (pdf["value"] - pdf["value"].mean()) / pdf["value"].std()
return pdf
Decision Tree (2025)
Built-in function exists? → Use it (lower(), trim(), regexp_replace(), etc.)
Complex logic / ML model / external lib? → Pandas UDF (Scalar)
You wrote @udf → Delete it. Rewrite as Pandas UDF or built-in.
FINAL ANSWER:
Regular UDFs are dead.
Pandas UDFs or built-ins only.
Your cluster thanks you.