Aggregate Functions | Spark Practical Scenarios
← All Scenarios

Mastering Global Data Aggregations.

Summarizing large-scale logistics datasets using grouping, filtering, and approximate counting.

Working with a global shipment table (shipments), we need to extract meaningful KPIs such as total revenue, average weight per region, and unique vessel counts. Simple aggregations are standard, but Spark SQL offers powerful optimizations for massive datasets.

Sample Data: Shipments

vessel_id origin_region status shipping_cost
V_99 EMEA Delivered 4500.00
V_102 APAC In Transit 12000.00
V_99 EMEA Delayed 500.00

The GROUP BY clause collapses rows into summary buckets. You can aggregate by multiple dimensions to see more granular trends.

-- Calculating total revenue and average cost per region/status
SELECT 
    origin_region,
    status,
    SUM(shipping_cost) as total_revenue,
    AVG(shipping_cost) as avg_cost,
    COUNT(*) as total_shipments
FROM shipments
GROUP BY origin_region, status
ORDER BY total_revenue DESC;
    

In big data, calculating exact unique counts can be slow due to memory pressure. APPROX_COUNT_DISTINCT uses the HyperLogLog algorithm to give a 95% accurate result significantly faster.

-- Using FILTER clauses and Approximate counts
SELECT 
    origin_region,
    -- Count unique vessels approximately (fast)
    APPROX_COUNT_DISTINCT(vessel_id) as est_vessel_count,
    -- Conditional aggregation using FILTER
    SUM(shipping_cost) FILTER (WHERE status = 'Delivered') as successful_revenue,
    MAX(shipping_cost) as highest_single_cost
FROM shipments
GROUP BY origin_region;
    

While WHERE filters rows before they are grouped, HAVING filters the resulting groups. Use this to find outliers or high-priority regions.

-- Finding only regions with high delay volumes
SELECT 
    origin_region,
    COUNT(*) as delayed_count
FROM shipments
WHERE status = 'Delayed'
GROUP BY origin_region
HAVING delayed_count > 100;
    
Q: What is the difference between COUNT(column) and COUNT(*)? COUNT(*) counts every row, including those where all values are NULL. COUNT(column) only counts rows where that specific column is NOT NULL.
Q: When should you use APPROX_COUNT_DISTINCT? Use it when the exact number is less important than speed, such as in dashboards or exploratory data analysis (EDA) involving billions of rows. It avoids the Shuffle overhead of a global sort required for an exact count.
Q: Why are aggregate functions considered 'Wide Transformations'? Because data with the same grouping key (e.g., 'EMEA') might reside on different executors. Spark must shuffle that data across the network so that all rows for one group arrive at the same task for calculation.