← All Scenarios
1. Distinct vs. Approximate Counting
Advanced Aggregation & Counting.
Optimizing unique counts and utilizing collection functions to build complex data summaries.
Calculating exact unique values with countDistinct() is expensive because it requires a global shuffle and memory-intensive tracking. For massive datasets, approx_count_distinct() provides a result with a small margin of error (standard 5%) but runs significantly faster.
from pyspark.sql import functions as F
# Exact vs Approximate count comparison
summary_df = df.groupBy("category").agg(
F.countDistinct("user_id").alias("exact_unique_users"),
F.approx_count_distinct("user_id").alias("approx_unique_users")
)
2. Collection Functions: Building Sets
Sometimes you don't just want a count; you want the actual values. collect_set() aggregates values into an array while automatically removing duplicates.
# Aggregating all unique products purchased by a customer
customer_profile = df.groupBy("customer_id").agg(
F.collect_set("product_name").alias("purchased_items"),
F.size(F.collect_set("product_name")).alias("unique_item_count")
)
3. Summary Statistics
Spark provides built-in functions to get a high-level view of numeric distributions quickly.
# Getting a statistical summary of price distribution
stats_df = df.select(
F.mean("price").alias("avg_price"),
F.stddev("price").alias("price_volatility"),
F.min("price"),
F.max("price")
)
Interview Q&A
Q: When should you use count(*) vs count(column)?
count(*) counts all rows including those with nulls. count(column) only counts rows where that specific column is NOT NULL.
Q: How does approx_count_distinct save time?
It uses the HyperLogLog algorithm, which estimates cardinality using a fixed amount of memory regardless of the data size, avoiding the massive memory overhead of keeping a list of every unique key.
Q: What is the risk of using collect_list() on a large group?
If a single group (e.g., one 'category') contains millions of records, collect_list will attempt to store all of them in a single row in memory, which often leads to OutOfMemory (OOM) errors.