Struct & Map Handling | Spark Practical Scenarios
← All Scenarios

Struct & Map Type Handling.

Managing nested hierarchies using Structs for fixed schemas and Maps for flexible key-value metadata.

The struct() function bundles multiple columns into a single nested object. This is highly effective for grouping related fields like address details or technical metadata.

from pyspark.sql.functions import col, struct

# Bundling city and zip into a single 'location' struct
df_nested = df.withColumn("location", struct(
    col("city"), 
    col("zip_code"),
    col("street_address")
))

# Accessing a nested field using dot notation
df_city_only = df_nested.select("location.city")
    

While StructType has a fixed schema, MapType allows for a flexible number of keys. You can convert a struct to a map when you need to iterate over attributes or handle varying keys.

from pyspark.sql.functions import create_map, lit

# Manually converting fields to a map for dynamic key access
df_mapped = df.withColumn("technical_details", create_map(
    lit("os"), col("os_version"),
    lit("browser"), col("browser_name")
))
    

Handling complex types requires careful schema definitions. Using StructType and StructField, you can programmatically define deeply nested structures for JSON or Parquet ingestion.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Defining an explicit schema for nested ingestion
custom_schema = StructType([
    StructField("user_id", IntegerType(), True),
    StructField("profile", StructType([
        StructField("bio", StringType(), True),
        StructField("website", StringType(), True)
    ]))
])
    
Q: When should I use a Struct vs. a Map? Use a Struct when the fields are known and fixed (e.g., 'FirstName', 'LastName'). Use a Map when the keys are unpredictable or highly variable across rows (e.g., 'TagCloud' or 'UserMetadata').
Q: How do you "flatten" a Struct into top-level columns? You can use df.select("struct_col.*"). The .* operator expands all internal fields of the struct into individual columns in the parent DataFrame.
Q: Can a Map have different types for its values? No. In Spark, a MapType requires all keys to be the same type (e.g., String) and all values to be the same type (e.g., Double). If you need mixed types, you must use a Struct.