DataFrame vs Dataset

/tldr: DataFrame = Untyped (fastest) • Dataset = Typed (safer)

Catalyst Tungsten UDFs Performance

2025 Verdict

Use DataFrame 99% of the time

Dataset[T] is effectively deprecated for new code.
Same performance since Spark 3.4+ • Less boilerplate with DataFrame

Head-to-Head (2025 Reality)

	DataFrame	Dataset[T]
Type Safety	No (Row)	Yes (case class)
Performance	100%	100% (Spark 3.4+)
Catalyst Optimizer	Full	Full
Tungsten Encoding	Yes	Yes
UDFs	Slow (serialization)	Can be fast (Encoders)
Python / R	Yes	Scala/Java only
Community Usage	95%+	<5%

Why They’re Both Fast: Catalyst + Tungsten

Catalyst Optimizer

Predicate pushdown
Column pruning
Join reordering
Constant folding

Tungsten (Phase 2)

Off-heap memory
Binary (not JVM objects)
Code generation
Cache-friendly layout

UDFs: The Real Performance Killer

10–100×

Slow: Regular UDF (serialization)

2–5×

Better: Pandas UDF (Vectorized)

1×

Best: Built-in functions

# NEVER do this
df.withColumn("doubled", udf(lambda x: x*2)(col("value")))

# DO this instead
from pyspark.sql.functions import col
df.withColumn("doubled", col("value") * 2)

Same Logic — Two Styles

DataFrame API (Recommended)

df.filter(col("age") > 30)
  .groupBy("country")
  .agg(avg("salary").alias("avg_salary"))
  .orderBy(desc("avg_salary"))

Dataset API (Legacy)

case class Person(name: String, age: Int, country: String, salary: Double)

ds.filter(_.age > 30)
  .groupByKey(_.country)
  .mapGroups { (country, people) =>
    (country, people.map(_.salary).sum / people.size)
  }

Final Answer (2025+)

Use DataFrame everywhere
Dataset[T] = legacy Scala pattern