← All Scenarios
1. Parsing & Normalization
Temporal Data & Time Travel.
Mastering date arithmetic, format conversions, and interval calculations for time-series analysis.
Data ingestion often yields dates in non-standard string formats (e.g., "MM/dd/yyyy"). To perform arithmetic, we must first parse them using to_date or to_timestamp.
from pyspark.sql.functions import to_date, to_timestamp, col
# Converting raw strings with custom formats
df_clean = df.withColumn("order_date", to_date(col("raw_date"), "MM/dd/yyyy")) \
.withColumn("process_time", to_timestamp(col("raw_time"), "yyyy-MM-dd HH:mm:ss"))
2. Date Arithmetic & Differences
Calculating lead times or expirations requires datediff (for days) and add_months (for logical month steps).
from pyspark.sql.functions import datediff, add_months, current_date
# Calculating days until delivery and adding a 3-month trial end date
df_calc = df_clean.withColumn("days_to_ship", datediff(col("ship_date"), col("order_date"))) \
.withColumn("trial_expiry", add_months(col("order_date"), 3)) \
.withColumn("report_run_date", current_date())
3. Time Intervals & Precise Differences
For sub-day precision, we convert timestamps to Unix seconds to calculate exact hours or minutes between events.
from pyspark.sql.functions import unix_timestamp
# Calculating process duration in minutes
time_diff_df = df.withColumn("duration_min",
(unix_timestamp("end_time") - unix_timestamp("start_time")) / 60
)
Interview Q&A
Q: What is the difference between current_date() and current_timestamp()?
current_date() returns only the date part (yyyy-MM-dd) with no time component. current_timestamp() provides the full precision including hours, minutes, seconds, and milliseconds, typically in UTC.
Q: Why use add_months() instead of simply adding 30 days?
add_months() is month-aware. It handles the different lengths of months (28, 30, 31 days) and leap years correctly, whereas adding a fixed 30 days would cause "date drift" over time.
Q: How do you handle Timezones in Spark?
By default, Spark assumes timestamps are in the session local timezone. You can use from_utc_timestamp and to_utc_timestamp to explicitly convert between specific zones like 'EST' or 'GMT'.