Aggregate Functions | Spark Practical Scenarios

Mastering Global Data Aggregations.

Summarizing large-scale logistics datasets using grouping, filtering, and approximate counting.

The Logistics Scenario

Working with a global shipment table (shipments), we need to extract meaningful KPIs such as total revenue, average weight per region, and unique vessel counts. Simple aggregations are standard, but Spark SQL offers powerful optimizations for massive datasets.

Sample Data: Shipments

vessel_id	origin_region	status	shipping_cost
V_99	EMEA	Delivered	4500.00
V_102	APAC	In Transit	12000.00
V_99	EMEA	Delayed	500.00

1. Basic & Multi-Column Grouping

The GROUP BY clause collapses rows into summary buckets. You can aggregate by multiple dimensions to see more granular trends.

-- Calculating total revenue and average cost per region/status
SELECT 
    origin_region,
    status,
    SUM(shipping_cost) as total_revenue,
    AVG(shipping_cost) as avg_cost,
    COUNT(*) as total_shipments
FROM shipments
GROUP BY origin_region, status
ORDER BY total_revenue DESC;

2. Advanced Aggregates: Filter & Approx

In big data, calculating exact unique counts can be slow due to memory pressure. APPROX_COUNT_DISTINCT uses the HyperLogLog algorithm to give a 95% accurate result significantly faster.

-- Using FILTER clauses and Approximate counts
SELECT 
    origin_region,
    -- Count unique vessels approximately (fast)
    APPROX_COUNT_DISTINCT(vessel_id) as est_vessel_count,
    -- Conditional aggregation using FILTER
    SUM(shipping_cost) FILTER (WHERE status = 'Delivered') as successful_revenue,
    MAX(shipping_cost) as highest_single_cost
FROM shipments
GROUP BY origin_region;

3. The HAVING Clause

While WHERE filters rows before they are grouped, HAVING filters the resulting groups. Use this to find outliers or high-priority regions.

-- Finding only regions with high delay volumes
SELECT 
    origin_region,
    COUNT(*) as delayed_count
FROM shipments
WHERE status = 'Delayed'
GROUP BY origin_region
HAVING delayed_count > 100;

Interview Q&A

Q: What is the difference between COUNT(column) and COUNT(*)? COUNT(*) counts every row, including those where all values are NULL. COUNT(column) only counts rows where that specific column is NOT NULL.

Q: When should you use APPROX_COUNT_DISTINCT? Use it when the exact number is less important than speed, such as in dashboards or exploratory data analysis (EDA) involving billions of rows. It avoids the Shuffle overhead of a global sort required for an exact count.

Q: Why are aggregate functions considered 'Wide Transformations'? Because data with the same grouping key (e.g., 'EMEA') might reside on different executors. Spark must shuffle that data across the network so that all rows for one group arrive at the same task for calculation.

Mastering Global Data Aggregations.

Sample Data: Shipments

Resources

Company

Socials