Array & String Functions | Spark Practical Scenarios
← All Scenarios

Array & String Manipulation.

Splitting strings into collections, joining arrays back to text, and flattening nested data for relational analysis.

Real-world data often arrives as delimited strings (e.g., CSV values inside a single column). Use split to create arrays and concat_ws (concatenate with separator) to reverse the process.

from pyspark.sql.functions import split, concat_ws, col

# Splitting a 'tags' string into an array of strings
df_array = df.withColumn("tag_list", split(col("tags_raw"), ","))

# Joining an array back into a semicolon-delimited string
df_string = df_array.withColumn("tags_formatted", concat_ws(";", col("tag_list")))
    

The explode function is the most powerful tool for nested data. It takes an array and transforms each element into its own row, duplicating the parent data for each item.

from pyspark.sql.functions import explode

# Converting one row per user into one row per tag
df_exploded = df_array.select(
    "user_id",
    explode(col("tag_list")).alias("individual_tag")
)
    

Just like arrays, MapType columns can be exploded. This results in two new columns: one for the key and one for the value.

# Exploding a metadata map into key and value columns
df_metadata = df.select(
    "product_id",
    explode(col("properties_map")).alias("attr_key", "attr_value")
)
    
Q: What is the difference between explode() and posexplode()? explode() only returns the elements of the array. posexplode() returns both the elements and their original position (index) in the array, which is useful if the order of elements carries meaning.
Q: How does explode() handle empty arrays or nulls? Standard explode() will remove the entire row if the array is null or empty. To keep the row and return a null for the element, use explode_outer().
Q: Can you explode multiple columns in a single select? In modern Spark, you can use multiple generators in one select, but it can create a Cartesian product if the arrays have different lengths. For safe parallel exploding of related arrays, arrays_zip followed by explode is the recommended pattern.