← All Scenarios
1. Primitive Type Casting
Type Casting & Data Normalization.
Converting raw strings to strict types and managing MapType transformations for metadata-heavy datasets.
In the Bronze-to-Silver layer, the most common task is converting StringType columns into their correct numeric or date formats. In PySpark, we use the cast() method.
from pyspark.sql.functions import col
# Casting price from string to double and quantity to integer
df_normalized = df.withColumn("price", col("raw_price").cast("double")) \
.withColumn("quantity", col("raw_qty").cast("int")) \
.withColumn("event_date", col("timestamp").cast("date"))
2. Advanced: Columns to Map
Data normalization often involves consolidating multiple sparse columns into a single MapType (key-value pair). This is useful for storing attributes that vary across records without bloating the schema.
from pyspark.sql.functions import create_map, lit
# Consolidating 'color' and 'material' columns into an 'attributes' map
df_mapped = df.withColumn("attributes", create_map(
lit("color"), col("color"),
lit("material"), col("material")
))
3. Map to Columns (Exploding Attributes)
Conversely, you may need to "pivot" a map back into individual columns for easier querying or filtering.
# Extracting specific keys from a map column into standalone columns
df_final = df_mapped.select(
"product_id",
col("attributes")["color"].alias("product_color"),
col("attributes")["material"].alias("product_material")
)
Interview Q&A
Q: What happens if cast() fails (e.g., casting 'ABC' to an Integer)?
By default, Spark will return null for the values that cannot be parsed. It will not fail the job unless you are using specific ANSI mode settings in the Spark configuration.
Q: Why use a MapType instead of separate columns?
MapTypes are ideal for highly sparse data where most rows have nulls for most attributes. It keeps the schema clean and reduces metadata overhead in the Hive Metastore or Unity Catalog.
Q: How do you check the current data types of a DataFrame?
You can use df.printSchema() for a visual tree, or df.dtypes to get a list of tuples containing column names and their string-represented types.