← All Scenarios
The Scenario
Implementation
Handling Nested Arrays with Explode Functions.
Flattening complex data structures for relational analysis and downstream consumption.
Imagine an E-commerce platform where order data is stored in S3 as Parquet. Each row represents a single order_id, but contains a nested column items, which is an Array of structs.
Input Data (Before Explode)
| order_id | customer_id | items (Array of Structs) |
|---|---|---|
| ORD_101 | CUST_A | [{"p_id": "M1", "qty": 1}, {"p_id": "M2", "qty": 3}] |
| ORD_102 | CUST_B | [{"p_id": "L5", "qty": 1}] |
from pyspark.sql.functions import explode, col
# Load data from S3
df = spark.read.parquet("s3a://production-data/orders/")
# Explode the array into individual rows
df_exploded = df.withColumn("item", explode(col("items")))
# Select final columns
df_final = df_exploded.select(
"order_id",
col("item.p_id").alias("product_id"),
col("item.qty").alias("quantity")
)
Output Data (After Explode)
| order_id | product_id | quantity |
|---|---|---|
| ORD_101 | M1 | 1 |
| ORD_101 | M2 | 3 |
| ORD_102 | L5 | 1 |
Notice how ORD_101 is duplicated because it contained two items in the original array.
Interview Q&A
Q: What happens if the 'items' array is empty for a specific order?
With a standard explode, that order row will be filtered out entirely and disappear from the results. To keep the order in the results with null values for item fields, use explode_outer.
Q: Is exploding a "Wide" or "Narrow" transformation?
It is a Narrow transformation. Even though it creates new rows, each row is generated independently from a single parent row. No data needs to be shuffled across the network between executors.