Struct & Map Handling | Spark Practical Scenarios

Struct & Map Type Handling.

Managing nested hierarchies using Structs for fixed schemas and Maps for flexible key-value metadata.

1. Creating Structs (Nesting Columns)

The struct() function bundles multiple columns into a single nested object. This is highly effective for grouping related fields like address details or technical metadata.

from pyspark.sql.functions import col, struct

# Bundling city and zip into a single 'location' struct
df_nested = df.withColumn("location", struct(
    col("city"), 
    col("zip_code"),
    col("street_address")
))

# Accessing a nested field using dot notation
df_city_only = df_nested.select("location.city")

2. Converting Structs to Maps

While StructType has a fixed schema, MapType allows for a flexible number of keys. You can convert a struct to a map when you need to iterate over attributes or handle varying keys.

from pyspark.sql.functions import create_map, lit

# Manually converting fields to a map for dynamic key access
df_mapped = df.withColumn("technical_details", create_map(
    lit("os"), col("os_version"),
    lit("browser"), col("browser_name")
))

3. Schema Evolution with Nested Data

Handling complex types requires careful schema definitions. Using StructType and StructField, you can programmatically define deeply nested structures for JSON or Parquet ingestion.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Defining an explicit schema for nested ingestion
custom_schema = StructType([
    StructField("user_id", IntegerType(), True),
    StructField("profile", StructType([
        StructField("bio", StringType(), True),
        StructField("website", StringType(), True)
    ]))
])

Interview Q&A

Q: When should I use a Struct vs. a Map? Use a Struct when the fields are known and fixed (e.g., 'FirstName', 'LastName'). Use a Map when the keys are unpredictable or highly variable across rows (e.g., 'TagCloud' or 'UserMetadata').

Q: How do you "flatten" a Struct into top-level columns? You can use df.select("struct_col.*"). The .* operator expands all internal fields of the struct into individual columns in the parent DataFrame.

Q: Can a Map have different types for its values? No. In Spark, a MapType requires all keys to be the same type (e.g., String) and all values to be the same type (e.g., Double). If you need mixed types, you must use a Struct.

Struct & Map Type Handling.

Resources

Company

Socials