DataFrame vs Dataset TL;DR
Back to Apache Spark TL;DR Hub

DataFrame vs Dataset

/tldr: DataFrame = Untyped (fastest) • Dataset = Typed (safer)

Catalyst Tungsten UDFs Performance

2025 Verdict

Use DataFrame 99% of the time

Dataset[T] is effectively deprecated for new code.
Same performance since Spark 3.4+ • Less boilerplate with DataFrame

Head-to-Head (2025 Reality)

DataFrame Dataset[T]
Type SafetyNo (Row)Yes (case class)
Performance100%100% (Spark 3.4+)
Catalyst OptimizerFullFull
Tungsten EncodingYesYes
UDFsSlow (serialization)Can be fast (Encoders)
Python / RYesScala/Java only
Community Usage95%+<5%

Why They’re Both Fast: Catalyst + Tungsten

Catalyst Optimizer

  • Predicate pushdown
  • Column pruning
  • Join reordering
  • Constant folding

Tungsten (Phase 2)

  • Off-heap memory
  • Binary (not JVM objects)
  • Code generation
  • Cache-friendly layout

UDFs: The Real Performance Killer

10–100×
Slow: Regular UDF (serialization)
2–5×
Better: Pandas UDF (Vectorized)
Best: Built-in functions
# NEVER do this
df.withColumn("doubled", udf(lambda x: x*2)(col("value")))

# DO this instead
from pyspark.sql.functions import col
df.withColumn("doubled", col("value") * 2)

Same Logic — Two Styles

DataFrame API (Recommended)

df.filter(col("age") > 30)
  .groupBy("country")
  .agg(avg("salary").alias("avg_salary"))
  .orderBy(desc("avg_salary"))

Dataset API (Legacy)

case class Person(name: String, age: Int, country: String, salary: Double)

ds.filter(_.age > 30)
  .groupByKey(_.country)
  .mapGroups { (country, people) =>
    (country, people.map(_.salary).sum / people.size)
  }

Final Answer (2025+)

Use DataFrame everywhere
Dataset[T] = legacy Scala pattern

Spark 3.5+ • Catalyst & Tungsten apply to both • Performance parity since 3.4 • Databricks, EMR, GCP, Azure