Caching & Persistence | Spark Practical Scenarios
← All Scenarios

Caching & Persistence Strategies.

Saving intermediate results to optimize iterative jobs and prevent redundant re-computations.

cache() is a shorthand for persisting with the default storage level (Memory Only for DataFrames). persist() allows you to choose specific storage levels like Disk or Serialized Memory.

# Simple caching (shorthand for MEMORY_AND_DISK in Spark 3.x)
df.cache()

# Specific persistence with serialization (saves space but costs CPU to deserialize)
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_ONLY_SER)

# Trigger the cache with an action
df.count() 
    

Caching is **lazy**. It only happens when an action is called. Once cached, subsequent actions will read from memory rather than re-reading from the source. It is equally important to unpersist() to free up cluster resources.

# Perform multiple actions on the same DF
print(df.count()) # Cached here
print(df.distinct().count()) # Reads from cache

# Clean up memory manually
df.unpersist()
    

If your data fits in memory, MEMORY_ONLY is fastest. If data is larger than RAM, MEMORY_AND_DISK spills overflow to the local drive of the executor, preventing job failure.

Q: When should you NOT cache a DataFrame? Never cache a DataFrame that is only used once. Caching involves an overhead of writing to memory; if the data isn't reused, you're wasting time and RAM. Also, avoid caching very large DataFrames that might cause Garbage Collection (GC) pressure.
Q: Is caching an Action or a Transformation? It is a Lazy Transformation. It simply marks the DataFrame to be stored. The actual "caching" happens during the first Action (like count or show) performed on that DataFrame.
Q: How do you verify if a DataFrame is cached? You can check the "Storage" tab in the Spark UI. It lists all cached RDDs/DataFrames, their memory usage, and the percentage of partitions currently residing in memory.