UDFs vs Pandas UDFs TL;DR
Back to Apache Spark TL;DR Hub

UDFs vs Pandas UDFs

/tldr: Never use regular UDFs in 2025. Ever.

Scalar UDF Pandas UDF 10–100× Performance

2025 LAW

Regular Python UDF = Career-Ending Anti-Pattern

10–100× slower · Breaks Catalyst · Serializes every row

Real-World Benchmarks (10M rows)

MethodTimeSpeed vs Built-inVerdict
Built-in function (col * 2)8 secBest
Pandas UDF (vectorized)12–25 sec1.5–3× slowerAcceptable
Regular Python UDF4–15 min30–100× slowerNever use

Why Regular UDFs Destroy Performance

Serialization

Row → Python object → serialize → deserialize → Python

No Catalyst

Optimizer can’t see inside → black box

GC Hell

Millions of tiny Python objects → JVM GC cries

Code: The Good, The Bad, The Ugly

NEVER DO THIS
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

@udf(returnType=StringType())
def clean_text(text):
    return text.strip().lower() if text else None

df.withColumn("clean", clean_text(col("text")))
DO THIS (2025)
from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf("string")
def clean_text_vec(texts: pd.Series) -> pd.Series:
    return texts.str.strip().str.lower()

df.withColumn("clean", clean_text_vec(col("text")))
GOD TIER
# 99% of the time this exists
df.withColumn("clean",
    lower(trim(col("text")))
)

Pandas UDF Types You’ll Actually Use

Scalar (Most Common)

@pandas_udf("double")
def predict_score(features: pd.Series) -> pd.Series:
    return model.predict(features)

Grouped Map (Pandas apply)

@pandas_udf(schema)
def normalize_group(pdf: pd.DataFrame) -> pd.DataFrame:
    pdf["normalized"] = (pdf["value"] - pdf["value"].mean()) / pdf["value"].std()
    return pdf

Decision Tree (2025)

Built-in function exists? → Use it (lower(), trim(), regexp_replace(), etc.)
Complex logic / ML model / external lib? → Pandas UDF (Scalar)
You wrote @udf → Delete it. Rewrite as Pandas UDF or built-in.

FINAL ANSWER:

Regular UDFs are dead.
Pandas UDFs or built-ins only.

Your cluster thanks you.

Spark 3.5+ • Arrow-optimized • Databricks, EMR, GCP, Azure • 2025 standard