← All Scenarios
1. Core Join Strategies
Relational Join Operations.
Integrating disparate datasets using Inner, Left, and Anti joins to build comprehensive data models.
Joins in Spark allow you to combine DataFrames based on a common key. The default join type is inner, which only returns rows where the key exists in both tables.
# Standard Inner Join
enriched_df = orders_df.join(customers_df, "customer_id", "inner")
# Joining on multiple columns with different names
final_df = orders_df.join(payments_df,
orders_df.order_id == payments_df.ref_id,
"left"
)
2. The Left Anti Join (Delta Detection)
A left_anti join is a specialized operation that returns only the rows from the left table that do not have a match in the right table. This is perfect for identifying missing records or new data.
# Finding customers who have NEVER placed an order
inactive_customers = customers_df.join(orders_df, "customer_id", "left_anti")
3. Handling Large-Scale Joins (Shuffle)
When Spark joins two large tables, it performs a Sort-Merge Join. This requires a network shuffle to move matching keys to the same partition. If one table is small, you can optimize this using a Broadcast Join.
from pyspark.sql.functions import broadcast
# Optimizing a join with a small lookup table
fast_join_df = large_sales_df.join(broadcast(small_dim_df), "product_id")
Interview Q&A
Q: What is the difference between a Left Outer join and a Left Semi join?
A Left Outer join returns all rows from the left and nulls for missing right-side values. A Left Semi join returns only rows from the left that have a match in the right, but it doesn't actually append any columns from the right table.
Q: How do you handle duplicate column names after a join?
If you join on a string (e.g.,
"id"), Spark keeps only one copy. If you use a boolean expression, Spark keeps both. To fix this, you should alias the DataFrames before joining or drop the redundant column immediately after.
Q: What is "Data Skew" in the context of joins?
Data Skew occurs when one join key (like a 'Null' or a generic 'Guest' ID) appears much more frequently than others. This causes one task to process significantly more data than others, leading to a Straggler Task that slows down the whole job.