Aggregation & Counting | Spark Practical Scenarios
← All Scenarios

Advanced Aggregation & Counting.

Optimizing unique counts and utilizing collection functions to build complex data summaries.

Calculating exact unique values with countDistinct() is expensive because it requires a global shuffle and memory-intensive tracking. For massive datasets, approx_count_distinct() provides a result with a small margin of error (standard 5%) but runs significantly faster.

from pyspark.sql import functions as F

# Exact vs Approximate count comparison
summary_df = df.groupBy("category").agg(
    F.countDistinct("user_id").alias("exact_unique_users"),
    F.approx_count_distinct("user_id").alias("approx_unique_users")
)
    

Sometimes you don't just want a count; you want the actual values. collect_set() aggregates values into an array while automatically removing duplicates.

# Aggregating all unique products purchased by a customer
customer_profile = df.groupBy("customer_id").agg(
    F.collect_set("product_name").alias("purchased_items"),
    F.size(F.collect_set("product_name")).alias("unique_item_count")
)
    

Spark provides built-in functions to get a high-level view of numeric distributions quickly.

# Getting a statistical summary of price distribution
stats_df = df.select(
    F.mean("price").alias("avg_price"),
    F.stddev("price").alias("price_volatility"),
    F.min("price"),
    F.max("price")
)
    
Q: When should you use count(*) vs count(column)? count(*) counts all rows including those with nulls. count(column) only counts rows where that specific column is NOT NULL.
Q: How does approx_count_distinct save time? It uses the HyperLogLog algorithm, which estimates cardinality using a fixed amount of memory regardless of the data size, avoiding the massive memory overhead of keeping a list of every unique key.
Q: What is the risk of using collect_list() on a large group? If a single group (e.g., one 'category') contains millions of records, collect_list will attempt to store all of them in a single row in memory, which often leads to OutOfMemory (OOM) errors.