Filtering & Deduplication | Spark Practical Scenarios

Filtering, Deduplication & Null Handling.

Ensuring data integrity by removing duplicate records, filtering specific conditions, and managing missing values.

1. Conditional Filtering

The filter() (or its alias where()) method allows you to subset your data based on logical expressions. This is a Narrow Transformation and is highly efficient in Spark.

from pyspark.sql.functions import col

# Filtering for active users in a specific age range
df_filtered = df.filter((col("status") == "active") & (col("age") >= 18))

# Filtering for non-null email addresses
df_valid = df.filter(col("email").isNotNull())

2. Deduplication (Distinct vs. Drop Duplicates)

Duplicates can occur at the row level or the key level. distinct() compares the entire row, while dropDuplicates() allows you to specify a subset of columns.

# Removing exact row duplicates
df_unique = df.distinct()

# Removing duplicates based on specific business keys (keeping first occurrence)
df_clean_users = df.dropDuplicates(["user_id", "email"])

3. Handling Nulls and Missing Data

Spark provides the na sub-module to handle missing values gracefully. You can either drop problematic rows or fill them with default values (imputation).

# Dropping rows where more than 2 columns are null
df_dropped = df.na.drop(thresh=2)

# Filling nulls with default values
df_filled = df.na.fill({
    "age": 0,
    "city": "Unknown",
    "is_active": False
})

Interview Q&A

Q: Is dropDuplicates() a Wide or Narrow transformation? It is a Wide Transformation. Spark must Shuffle data across the cluster to ensure that all records with the same keys (e.g., the same user_id) are on the same executor so it can decide which one to keep and which to drop.

Q: What is the difference between fillna() and coalesce()? fillna() is used to replace nulls across an entire column with a literal value. coalesce() is used to pick the first non-null value from a list of multiple columns in the same row.

Q: How do you filter for a list of values (SQL 'IN' clause)? You can use the isin() method: df.filter(col("country").isin(["UK", "US", "CA"])).

Filtering, Deduplication & Null Handling.

Resources

Company

Socials