Looping & Iteration | Spark Practical Scenarios
← All Scenarios

Looping & Iterative Processing.

Understanding when to use driver-side loops versus distributed partition-level iteration.

Looping through a DataFrame using Python for loops is generally discouraged unless you are iterating over metadata (like a list of table names) to trigger separate Spark jobs.

# Iterating over a list of paths to process multiple files
file_list = ["sales_jan.csv", "sales_feb.csv", "sales_mar.csv"]

for file in file_list:
    df = spark.read.csv(f"s3://raw/{file}")
    # Process and save
    df.write.mode("append").save("s3://silver/yearly_sales")
    

When you need to perform an expensive setup (like opening a database connection or loading a machine learning model), mapPartitions is the gold standard. It allows you to execute logic once per partition rather than once per row.

def process_batch(partition_iterator):
    # Setup: Initialize a heavy resource once per partition
    model = load_my_custom_model() 
    
    results = []
    for row in partition_iterator:
        # Process each row using the resource
        prediction = model.predict(row.features)
        results.append((row.id, prediction))
    return iter(results)

# Applying the partition-level logic
rdd_results = df.rdd.mapPartitions(process_batch)
    

While map returns exactly one element for every input, flatMap allows you to return a collection of elements that Spark then flattens into a single RDD/DataFrame. This is commonly used for tokenizing text.

# Splitting sentences into individual words
words_rdd = text_df.rdd.flatMap(lambda row: row.sentence.split(" "))
    
Q: Why is mapPartitions faster than map/UDF for external calls? With map, you might open and close a database connection for every row. With mapPartitions, you open one connection and process thousands of rows before closing it, drastically reducing the overhead.
Q: Can you loop through DataFrame rows using .collect()? You can, but it is dangerous. .collect() pulls all data to the Driver's memory. If the data is large, the Driver will crash with an OutOfMemory error. Always prefer distributed functions.
Q: What is the difference between map and flatMap? Map transforms an input of size N into an output of size N. FlatMap transforms an input of size N into an output of size M, where M can be 0, 1, or many, by flattening the resulting collections.