Broadcasting & Optimization.
Eliminating expensive shuffles by replicating small datasets across the cluster for high-performance joins.
In a standard Sort-Merge Join, Spark shuffles both tables. In a Broadcast Hash Join, the smaller table is collected to the driver and then sent to every executor. The large table stays put, and the join happens locally.
from pyspark.sql.functions import broadcast
# Manually hinting Spark to broadcast the smaller 'dim_products' table
fact_sales_df.join(broadcast(dim_products_df), "product_id")
2. Broadcast Variables
Sometimes you need to share a read-only lookup object (like a Python dictionary or a small list) with your UDFs or RDD transformations. Broadcast Variables allow you to do this efficiently without sending the object with every task.
# Creating a broadcast variable for a lookup dictionary
mapping_dict = {"US": "United States", "IN": "India", "UK": "United Kingdom"}
broadcast_map = sc.broadcast(mapping_dict)
# Accessing the variable inside a UDF or map function
def get_country_name(code):
return broadcast_map.value.get(code, "Unknown")
3. Automatic Broadcasting
Spark automatically attempts to broadcast a table if its size is below a certain threshold (default is 10MB). You can tune this behavior to optimize larger lookups.
# Increasing the auto-broadcast threshold to 50MB
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "52428800")
Interview Q&A