RDD Fundamentals: The Engine Underneath.
Understanding Resilient Distributed Datasets and the low-level functional API of Apache Spark.
The RDD (Resilient Distributed Dataset) is Spark’s primary abstraction. It is a fault-tolerant collection of elements that can be operated on in parallel. Unlike DataFrames, RDDs do not have a schema—they are just collections of Python objects.
1. Parallelization & TransformationYou create an RDD using sc.parallelize(). Common transformations include map (1-to-1) and flatMap (1-to-many).
# Initializing from a Python list
data = ["Hello Spark", "RDD Fundamentals", "Low Level API"]
rdd = sc.parallelize(data)
# flatMap: splitting strings into words
words_rdd = rdd.flatMap(lambda x: x.split(" "))
# map: creating key-value pairs
pairs_rdd = words_rdd.map(lambda word: (word, 1))
2. Aggregation with reduceByKey
reduceByKey is the RDD equivalent of a GroupBy. It merges values for each key using an associative and commutative reduce function.
# Word Count Example
word_counts = pairs_rdd.reduceByKey(lambda a, b: a + b)
# Output: [('Hello', 1), ('Spark', 1), ...]
print(word_counts.collect())
3. Optimization: Broadcast Variables
If you have a large "lookup" list that needs to be accessed by every task, use a Broadcast Variable. This sends the data to each executor once, rather than sending it with every task, saving significant network bandwidth.
# Broadcasting a lookup dictionary
region_lookup = {"UK": "United Kingdom", "US": "United States"}
broadcast_regions = sc.broadcast(region_lookup)
# Accessing inside a map function
mapped_rdd = rdd.map(lambda x: broadcast_regions.value.get(x, "Unknown"))
4. Conversion to DataFrames
Most modern pipelines start with RDDs (from legacy systems or socket streams) and convert to DataFrames for performance optimization via the Catalyst Optimizer.
# Converting RDD to DataFrame
df = word_counts.toDF(["word", "count"])
df.show()
Interview Q&A