Column Manipulation Essentials.
Efficiently shaping your data schema by adding, renaming, and transforming columns in PySpark.
In PySpark, DataFrames are immutable. When we "add" or "rename" a column, Spark creates a new execution plan that represents the new state. Mastery of functions like withColumn and select allows for clean, readable ETL pipelines.
1. Adding and Updating ColumnsThe withColumn() method is used to either add a new column or replace an existing one of the same name.
from pyspark.sql.functions import col, lit, current_date
# Adding a constant value and a calculated column
df_updated = df.withColumn("ingestion_date", current_date()) \
.withColumn("total_price", col("quantity") * col("unit_price"))
# Updating an existing column (Type Casting)
df_final = df_updated.withColumn("quantity", col("quantity").cast("integer"))
2. Renaming and Dropping
Cleaning up a DataFrame often involves removing temporary variables or aligning column names with a Target Delta Table schema.
# Renaming a single column
df_renamed = df.withColumnRenamed("old_name", "new_name")
# Dropping multiple columns at once
df_cleaned = df_renamed.drop("temp_id", "internal_flag", "raw_string")
3. Selecting and Aliasing
For heavy schema changes, select() is often more efficient than multiple withColumn calls. It allows you to define the final state in one block.
# Selecting a subset and aliasing on the fly
df_selected = df.select(
col("user_id").alias("CustomerID"),
col("email"),
(col("age") + 1).alias("next_year_age")
)
Interview Q&A