Top 50 Apache Spark Interview Questions
Free Preview: 50 Essential Spark Interview Q&A
Prepare with these top-tier questions. For the full technical deep-dive (115+ Q&A), download our comprehensive Guide.
1. What is the difference between a Spark Driver and an Executor?
The Driver is the central coordinator that runs the main() function, creates the SparkSession, and converts user code into tasks. Executors are worker processes that reside on cluster nodes, responsible for executing those tasks and storing data in memory or disk.
2. How does Spark handle fault tolerance for RDDs?
Spark uses a Lineage Graph (DAG) to track the transformations applied to an RDD. If a partition is lost due to node failure, Spark refers to this lineage to recompute only the missing partition from the original data source.
3. What is the "Shuffle" in Spark and why is it expensive?
A shuffle is the process of redistributing data across the cluster so that rows with the same keys are grouped together. It is expensive because it involves heavy disk I/O, data serialization, and network data transfer between executors.
4. Explain the concept of Lazy Evaluation.
Lazy evaluation means Spark does not execute transformations (like map or filter) immediately; instead, it records them in a DAG. Execution only begins when an 'Action' (like collect or count) is called, allowing the Catalyst Optimizer to optimize the entire plan at once.
5. What are Narrow vs. Wide Transformations?
Narrow transformations (e.g., filter, map) happen within a single partition without data movement across the network. Wide transformations (e.g., groupByKey, join) require a shuffle as data from multiple partitions must be combined to produce the result.
6. What is the benefit of a Broadcast Join?
A Broadcast Join sends a small table to every executor in the cluster, allowing a large table to be joined locally without a shuffle. This significantly reduces network traffic and execution time for joining fact tables with smaller dimension tables.
7. How does Spark SQL’s Catalyst Optimizer work?
Catalyst is an extensible query optimizer that transforms a SQL query into an optimized physical plan. It performs rule-based optimizations (like predicate pushdown) and cost-based optimizations to ensure the most efficient execution possible.
8. What is the difference between `cache()` and `persist()`?
`cache()` is a shortcut that uses the default storage level (MEMORY_ONLY). `persist()` allows you to choose from various storage levels, such as MEMORY_AND_DISK or MEMORY_ONLY_SER, giving you more control over how data is stored.
9. What is a "Watermark" in Spark Structured Streaming?
A watermark is a threshold that tells Spark how long to wait for late-arriving data based on event time. Once the watermark passes, Spark can safely discard old state data from memory, preventing the state store from growing too large.
10. Why is Parquet a better format than CSV for Spark?
Parquet is a columnar format that supports "Schema Evolution" and "Predicate Pushdown," allowing Spark to read only the required columns and filter rows at the storage level. This results in much faster I/O compared to scanning entire CSV files.
11. What is the role of an Accumulator?
Accumulators are write-only variables used to aggregate information (like error counters) across executors. While multiple executors can add to an accumulator, only the Driver program can read the final aggregated value.
12. How does `coalesce` differ from `repartition`?
`repartition` increases or decreases the number of partitions by performing a full shuffle. `coalesce` is used only to decrease partitions and is more efficient because it avoids a full shuffle by merging existing partitions locally.
13. What is Adaptive Query Execution (AQE)?
AQE is a Spark 3.x feature that re-optimizes the query plan at runtime based on statistics gathered during execution. It can dynamically coalesce shuffle partitions and handle data skew automatically to improve performance.
14. What is a Spark Task?
A task is the smallest unit of work in Spark, representing a single transformation applied to one partition of data. The Driver sends these tasks to executors, where they are executed in parallel across the cluster.
15. Explain "Event Time" vs. "Processing Time".
Event time is the timestamp recorded when the data was generated at the source, whereas processing time is when the data is actually processed by Spark. Processing by event time is more accurate for handling late or out-of-order data.
16. What is a "Stage" in Spark?
A Stage is a set of parallel tasks that can be executed without a shuffle. A new stage is created whenever a 'Wide Dependency' (shuffle) is required, acting as a boundary in the execution plan.
17. Explain the "Small File Problem" in Spark.
The small file problem occurs when a job generates many tiny files, causing significant metadata overhead for the Namenode and slowing down subsequent reads. It is typically fixed by using `coalesce()` or `repartition()` before writing data.
18. What is "Data Skew" and how do you detect it?
Data skew occurs when one or a few partitions have significantly more data than others, causing some tasks to run much longer. You can detect it in the Spark UI by looking for a high discrepancy between the Max and Median task duration in a stage.
19. How does "Salting" help resolve data skew?
Salting involves adding a random number (the salt) to the join key of the skewed dataset and duplicating the corresponding keys in the other dataset. This force-redistributes the skewed data across more partitions, balancing the workload.
20. What is "Whole-Stage Code Generation"?
This is a Tungsten optimization where Spark collapses multiple physical operators into a single Java function at runtime. This removes virtual function calls and allows the CPU to process data at the speed of hand-written code.
21. What is "Dynamic Partition Pruning" (DPP)?
DPP is a Spark 3.0 optimization where Spark uses the results of a filter on a dimension table to skip reading irrelevant partitions of a large fact table. This drastically reduces I/O in star-schema join scenarios.
22. Explain the difference between `GroupByKey` and `ReduceByKey`.
`ReduceByKey` performs a local combine on each mapper before shuffling data, which reduces network traffic. `GroupByKey` shuffles all data across the network first, which is much less efficient and can lead to OutOfMemory errors.
23. What are Broadcast Variables?
Broadcast variables are read-only variables that are cached on every machine in the cluster rather than being sent with every task. They are ideal for large lookup tables or configuration parameters used across multiple stages.
24. What is a "Checkpoint" and when is it used?
Checkpointing saves the RDD/DataFrame to a reliable storage (like HDFS) and removes its lineage. It is essential for long-running streaming jobs or iterative algorithms to prevent the lineage graph from becoming too deep.
25. What is the role of the Cluster Manager?
The Cluster Manager (e.g., YARN, Kubernetes, Mesos) is responsible for allocating resources across the cluster. Spark requests executors from the manager, which then monitors their health and lifecycle.
26. How do you handle Nulls in a DataFrame?
You can use `df.na.fill(value)` to replace nulls with a default, `df.na.drop()` to remove rows containing nulls, or `coalesce()` to pick the first non-null value from a list of columns.
27. What is the Tungsten Engine?
Tungsten is a Spark sub-project focused on optimizing memory management and CPU efficiency. It uses an off-heap memory manager and binary data format to bypass JVM object overhead and garbage collection issues.
28. What is the difference between `select` and `selectExpr`?
`select` is used to pick existing columns or column objects. `selectExpr` allows you to write SQL-like expressions as strings (e.g., `selectExpr("age + 1 as next_year")`), providing more flexibility for on-the-fly transformations.
29. What is a "Window Function"?
A window function performs a calculation across a set of rows that are related to the current row (e.g., `rank()`, `moving average`). Unlike aggregations, window functions do not group rows into a single output row.
30. How do you convert a DataFrame to an RDD?
You can access the underlying RDD by calling `df.rdd`. This returns an RDD of `Row` objects, allowing you to use low-level RDD transformations if the DataFrame API is insufficient.
31. What is "Exactly-once" processing in Streaming?
Exactly-once ensures that even if a failure occurs, the final output reflects each input record being processed exactly once. Spark achieves this via replayable sources, idempotent sinks, and state checkpointing.
32. What is a "Tumbling Window"?
A tumbling window is a fixed-size, non-overlapping time interval. For example, a 5-minute tumbling window will process data from 10:00-10:05, then 10:05-10:10, with no data shared between windows.
33. What is a "Sliding Window"?
A sliding window has a fixed duration but moves (slides) at a regular interval. If you have a 10-minute window that slides every 5 minutes, you will have overlapping data processed in consecutive windows.
34. How do you monitor Spark memory usage?
The "Executors" tab in the Spark Web UI is the best place to monitor memory. It shows storage memory (for caching), execution memory (for shuffles), and JVM heap usage for each individual executor.
35. What is `spark.sql.shuffle.partitions`?
This parameter sets the number of partitions to use when shuffling data for joins or aggregations. The default is 200, but it often needs to be tuned higher for large datasets or lower for small ones to avoid overhead.
36. What is "Speculative Execution"?
If a task is running significantly slower than the median task in a stage, Spark will launch a duplicate (speculative) copy of that task on another node. Whichever finishes first is kept, and the other is killed.
37. Explain "Client Mode" vs. "Cluster Mode".
In Client mode, the Driver runs on the machine that submitted the job. In Cluster mode, the Driver runs on a worker node within the cluster, which is the standard for production as it survives the disconnection of the client.
38. What is a "UDF" and what is its performance impact?
A User Defined Function (UDF) allows custom logic. However, Python UDFs are slow because data must be serialized between the JVM and Python. Use built-in Spark functions or Pandas UDFs whenever possible.
39. What is "Predicate Pushdown" in Parquet?
It is the ability of Spark to push "where" or "filter" conditions down to the file level. Parquet files store metadata (min/max values), allowing Spark to skip entire blocks of data that don't match the criteria.
40. How do you join a Stream with a Static DataFrame?
Spark allows "Stream-Static" joins, which are typically used for data enrichment. The static DataFrame is joined against every micro-batch of the stream; however, Spark does not automatically detect changes in the static data.
41. What is the "Output Mode" in Structured Streaming?
Output mode defines what is written to the sink. 'Append' writes only new rows, 'Complete' rewrites the entire result, and 'Update' writes only the rows that were updated since the last trigger.
42. What is "Data Locality"?
Data Locality is Spark's attempt to run tasks on nodes where the data is physically stored (e.g., local disk or memory) to minimize network latency. Levels include PROCESS_LOCAL, NODE_LOCAL, and RACK_LOCAL.
43. What is a "DAG Scheduler"?
The DAG Scheduler is the high-level scheduler that transforms the logical plan of RDDs into a physical plan of stages. It computes the best way to execute the job based on data dependencies.
44. What is `spark-submit`?
`spark-submit` is the command-line utility used to launch Spark applications on a cluster. It allows you to specify the JAR/Python file, cluster manager, memory settings, and external dependencies.
45. What is the difference between `Union` and `Join`?
`Union` combines the rows of two DataFrames with the same schema (vertical). `Join` combines columns from two DataFrames based on a common key (horizontal).
46. What is "Broadcast Hash Join"?
This is the fastest join strategy in Spark. It broadcasts the smaller table to all executors, creating a hash table in memory to join with the larger table locally, completely avoiding a shuffle.
47. Explain "Sort-Merge Join".
Sort-Merge Join is the default for large-table joins. It shuffles data to group keys, sorts them on both sides, and then merges them. It is very robust but slower than Broadcast joins due to the shuffle.
48. How do you handle "OOM" (Out of Memory) errors?
OOM errors are handled by increasing executor memory, increasing shuffle partitions to reduce partition size, or checking for data skew. You should also ensure you aren't using `collect()` on a massive dataset.
49. What is "Kryo Serialization"?
Kryo is a fast and compact serialization framework. It is significantly more efficient than the default Java serialization, reducing the size of data sent over the network during shuffles.
50. What is a "Shuffle Hash Join"?
In a Shuffle Hash Join, Spark shuffles both datasets and builds a hash table for the data in each partition. It is used when a Sort-Merge join would be too expensive because the data isn't easily sortable.