Explode Functions | Spark Practical Scenarios

Handling Nested Arrays with Explode Functions.

Flattening complex data structures for relational analysis and downstream consumption.

The Scenario

Imagine an E-commerce platform where order data is stored in S3 as Parquet. Each row represents a single order_id, but contains a nested column items, which is an Array of structs.

Input Data (Before Explode)

order_id	customer_id	items (Array of Structs)
ORD_101	CUST_A	[{"p_id": "M1", "qty": 1}, {"p_id": "M2", "qty": 3}]
ORD_102	CUST_B	[{"p_id": "L5", "qty": 1}]

Implementation

from pyspark.sql.functions import explode, col

# Load data from S3
df = spark.read.parquet("s3a://production-data/orders/")

# Explode the array into individual rows
df_exploded = df.withColumn("item", explode(col("items")))

# Select final columns
df_final = df_exploded.select(
    "order_id",
    col("item.p_id").alias("product_id"),
    col("item.qty").alias("quantity")
)

Output Data (After Explode)

order_id	product_id	quantity
ORD_101	M1	1
ORD_101	M2	3
ORD_102	L5	1

Notice how ORD_101 is duplicated because it contained two items in the original array.

Interview Q&A

Q: What happens if the 'items' array is empty for a specific order? With a standard explode, that order row will be filtered out entirely and disappear from the results. To keep the order in the results with null values for item fields, use explode_outer.

Q: Is exploding a "Wide" or "Narrow" transformation? It is a Narrow transformation. Even though it creates new rows, each row is generated independently from a single parent row. No data needs to be shuffled across the network between executors.

Handling Nested Arrays with Explode Functions.

Input Data (Before Explode)

Output Data (After Explode)

Resources

Company

Socials