← All Scenarios
1. Cache vs. Persist
Caching & Persistence Strategies.
Saving intermediate results to optimize iterative jobs and prevent redundant re-computations.
cache() is a shorthand for persisting with the default storage level (Memory Only for DataFrames). persist() allows you to choose specific storage levels like Disk or Serialized Memory.
# Simple caching (shorthand for MEMORY_AND_DISK in Spark 3.x)
df.cache()
# Specific persistence with serialization (saves space but costs CPU to deserialize)
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_ONLY_SER)
# Trigger the cache with an action
df.count()
2. The Lifecycle of a Cached DataFrame
Caching is **lazy**. It only happens when an action is called. Once cached, subsequent actions will read from memory rather than re-reading from the source. It is equally important to unpersist() to free up cluster resources.
# Perform multiple actions on the same DF
print(df.count()) # Cached here
print(df.distinct().count()) # Reads from cache
# Clean up memory manually
df.unpersist()
3. Choosing the Right Storage Level
If your data fits in memory, MEMORY_ONLY is fastest. If data is larger than RAM, MEMORY_AND_DISK spills overflow to the local drive of the executor, preventing job failure.
Interview Q&A
Q: When should you NOT cache a DataFrame?
Never cache a DataFrame that is only used once. Caching involves an overhead of writing to memory; if the data isn't reused, you're wasting time and RAM. Also, avoid caching very large DataFrames that might cause Garbage Collection (GC) pressure.
Q: Is caching an Action or a Transformation?
It is a Lazy Transformation. It simply marks the DataFrame to be stored. The actual "caching" happens during the first Action (like
count or show) performed on that DataFrame.
Q: How do you verify if a DataFrame is cached?
You can check the "Storage" tab in the Spark UI. It lists all cached RDDs/DataFrames, their memory usage, and the percentage of partitions currently residing in memory.