Explode Functions | Spark Practical Scenarios
← All Scenarios

Handling Nested Arrays with Explode Functions.

Flattening complex data structures for relational analysis and downstream consumption.

Imagine an E-commerce platform where order data is stored in S3 as Parquet. Each row represents a single order_id, but contains a nested column items, which is an Array of structs.

Input Data (Before Explode)

order_id customer_id items (Array of Structs)
ORD_101 CUST_A [{"p_id": "M1", "qty": 1}, {"p_id": "M2", "qty": 3}]
ORD_102 CUST_B [{"p_id": "L5", "qty": 1}]
from pyspark.sql.functions import explode, col

# Load data from S3
df = spark.read.parquet("s3a://production-data/orders/")

# Explode the array into individual rows
df_exploded = df.withColumn("item", explode(col("items")))

# Select final columns
df_final = df_exploded.select(
    "order_id",
    col("item.p_id").alias("product_id"),
    col("item.qty").alias("quantity")
)
    

Output Data (After Explode)

order_id product_id quantity
ORD_101 M1 1
ORD_101 M2 3
ORD_102 L5 1

Notice how ORD_101 is duplicated because it contained two items in the original array.

Q: What happens if the 'items' array is empty for a specific order? With a standard explode, that order row will be filtered out entirely and disappear from the results. To keep the order in the results with null values for item fields, use explode_outer.
Q: Is exploding a "Wide" or "Narrow" transformation? It is a Narrow transformation. Even though it creates new rows, each row is generated independently from a single parent row. No data needs to be shuffled across the network between executors.