User-Defined Functions (UDFs) | Spark Practical Scenarios

Custom User-Defined Functions.

Extending the Spark ecosystem with custom Python logic for specialized data transformations.

1. Basic Python UDF Workflow

A UDF allows you to take a standard Python function and register it so it can be applied to a Spark column. This process involves serializing the function and sending it to each executor.

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# 1. Define a regular Python function
def categorize_age(age):
    if age is None: return "Unknown"
    if age < 18: return "Minor"
    return "Adult"

# 2. Register as a UDF
age_udf = udf(categorize_age, StringType())

# 3. Apply to DataFrame
df_classified = df.withColumn("age_group", age_udf(df.age))

2. Advanced: Pandas UDFs (Vectorized)

Standard UDFs process data row-by-row, which is slow. Pandas UDFs (using Apache Arrow) process data in batches (vectors), significantly reducing the communication overhead between the JVM and Python.

[Image comparison of standard Spark UDF vs Pandas Vectorized UDF performance]

import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf("double")
def multiply_by_ten(s: pd.Series) -> pd.Series:
    return s * 10

df_vectorized = df.withColumn("multiplied_price", multiply_by_ten(df.price))

3. UDFs in Spark SQL

If you prefer writing raw SQL, you can register your UDF in the Spark catalog to make it accessible within spark.sql() queries.

# Registering for SQL usage
spark.udf.register("sql_categorize", categorize_age, StringType())

# Using in SQL
spark.sql("SELECT name, sql_categorize(age) FROM users_table").show()

Interview Q&A

Q: Why are standard UDFs considered a "performance killer"? Standard UDFs act as a Black Box to the Catalyst Optimizer. Spark cannot optimize the logic inside, and it must move data from the JVM to a Python process and back for every single row, causing massive serialization overhead.

Q: When is it mandatory to use a UDF? When you need to use external Python libraries (like Scikit-learn, NLTK, or custom internal packages) that do not have a native Spark implementation, or when the logic is too complex for when/otherwise chains.

Q: How do you handle Null values inside a UDF? Spark passes Nulls directly to your Python function. You must ensure your function has defensive logic (e.g., if x is None) to avoid TypeError crashes during execution.

Custom User-Defined Functions.

Resources

Company

Socials