DataFrame vs Dataset
/tldr: DataFrame = Untyped (fastest) • Dataset = Typed (safer)
Catalyst
Tungsten
UDFs
Performance
2025 Verdict
Use DataFrame 99% of the time
Dataset[T] is effectively deprecated for new code.
Same performance since Spark 3.4+ • Less boilerplate with DataFrame
Head-to-Head (2025 Reality)
| DataFrame | Dataset[T] | |
|---|---|---|
| Type Safety | No (Row) | Yes (case class) |
| Performance | 100% | 100% (Spark 3.4+) |
| Catalyst Optimizer | Full | Full |
| Tungsten Encoding | Yes | Yes |
| UDFs | Slow (serialization) | Can be fast (Encoders) |
| Python / R | Yes | Scala/Java only |
| Community Usage | 95%+ | <5% |
Why They’re Both Fast: Catalyst + Tungsten
Catalyst Optimizer
- Predicate pushdown
- Column pruning
- Join reordering
- Constant folding
Tungsten (Phase 2)
- Off-heap memory
- Binary (not JVM objects)
- Code generation
- Cache-friendly layout
UDFs: The Real Performance Killer
10–100×
Slow: Regular UDF (serialization)
2–5×
Better: Pandas UDF (Vectorized)
1×
Best: Built-in functions
# NEVER do this
df.withColumn("doubled", udf(lambda x: x*2)(col("value")))
# DO this instead
from pyspark.sql.functions import col
df.withColumn("doubled", col("value") * 2)
Same Logic — Two Styles
DataFrame API (Recommended)
df.filter(col("age") > 30)
.groupBy("country")
.agg(avg("salary").alias("avg_salary"))
.orderBy(desc("avg_salary"))
Dataset API (Legacy)
case class Person(name: String, age: Int, country: String, salary: Double)
ds.filter(_.age > 30)
.groupByKey(_.country)
.mapGroups { (country, people) =>
(country, people.map(_.salary).sum / people.size)
}
Final Answer (2025+)
Use DataFrame everywhere
Dataset[T] = legacy Scala pattern