Date & Timestamp Operations | Spark Practical Scenarios
← All Scenarios

Temporal Data & Time Travel.

Mastering date arithmetic, format conversions, and interval calculations for time-series analysis.

Data ingestion often yields dates in non-standard string formats (e.g., "MM/dd/yyyy"). To perform arithmetic, we must first parse them using to_date or to_timestamp.

from pyspark.sql.functions import to_date, to_timestamp, col

# Converting raw strings with custom formats
df_clean = df.withColumn("order_date", to_date(col("raw_date"), "MM/dd/yyyy")) \
             .withColumn("process_time", to_timestamp(col("raw_time"), "yyyy-MM-dd HH:mm:ss"))
    

Calculating lead times or expirations requires datediff (for days) and add_months (for logical month steps).

from pyspark.sql.functions import datediff, add_months, current_date

# Calculating days until delivery and adding a 3-month trial end date
df_calc = df_clean.withColumn("days_to_ship", datediff(col("ship_date"), col("order_date"))) \
                  .withColumn("trial_expiry", add_months(col("order_date"), 3)) \
                  .withColumn("report_run_date", current_date())
    

For sub-day precision, we convert timestamps to Unix seconds to calculate exact hours or minutes between events.

from pyspark.sql.functions import unix_timestamp

# Calculating process duration in minutes
time_diff_df = df.withColumn("duration_min", 
    (unix_timestamp("end_time") - unix_timestamp("start_time")) / 60
)
    
Q: What is the difference between current_date() and current_timestamp()? current_date() returns only the date part (yyyy-MM-dd) with no time component. current_timestamp() provides the full precision including hours, minutes, seconds, and milliseconds, typically in UTC.
Q: Why use add_months() instead of simply adding 30 days? add_months() is month-aware. It handles the different lengths of months (28, 30, 31 days) and leap years correctly, whereas adding a fixed 30 days would cause "date drift" over time.
Q: How do you handle Timezones in Spark? By default, Spark assumes timestamps are in the session local timezone. You can use from_utc_timestamp and to_utc_timestamp to explicitly convert between specific zones like 'EST' or 'GMT'.