← All Scenarios
1. Basic Python UDF Workflow
Custom User-Defined Functions.
Extending the Spark ecosystem with custom Python logic for specialized data transformations.
A UDF allows you to take a standard Python function and register it so it can be applied to a Spark column. This process involves serializing the function and sending it to each executor.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
# 1. Define a regular Python function
def categorize_age(age):
if age is None: return "Unknown"
if age < 18: return "Minor"
return "Adult"
# 2. Register as a UDF
age_udf = udf(categorize_age, StringType())
# 3. Apply to DataFrame
df_classified = df.withColumn("age_group", age_udf(df.age))
2. Advanced: Pandas UDFs (Vectorized)
Standard UDFs process data row-by-row, which is slow. Pandas UDFs (using Apache Arrow) process data in batches (vectors), significantly reducing the communication overhead between the JVM and Python.
[Image comparison of standard Spark UDF vs Pandas Vectorized UDF performance]
import pandas as pd
from pyspark.sql.functions import pandas_udf
@pandas_udf("double")
def multiply_by_ten(s: pd.Series) -> pd.Series:
return s * 10
df_vectorized = df.withColumn("multiplied_price", multiply_by_ten(df.price))
3. UDFs in Spark SQL
If you prefer writing raw SQL, you can register your UDF in the Spark catalog to make it accessible within spark.sql() queries.
# Registering for SQL usage
spark.udf.register("sql_categorize", categorize_age, StringType())
# Using in SQL
spark.sql("SELECT name, sql_categorize(age) FROM users_table").show()
Interview Q&A
Q: Why are standard UDFs considered a "performance killer"?
Standard UDFs act as a Black Box to the Catalyst Optimizer. Spark cannot optimize the logic inside, and it must move data from the JVM to a Python process and back for every single row, causing massive serialization overhead.
Q: When is it mandatory to use a UDF?
When you need to use external Python libraries (like Scikit-learn, NLTK, or custom internal packages) that do not have a native Spark implementation, or when the logic is too complex for when/otherwise chains.
Q: How do you handle Null values inside a UDF?
Spark passes Nulls directly to your Python function. You must ensure your function has defensive logic (e.g., if x is None) to avoid TypeError crashes during execution.