Mastering Global Data Aggregations.
Summarizing large-scale logistics datasets using grouping, filtering, and approximate counting.
Working with a global shipment table (shipments), we need to extract meaningful KPIs such as total revenue, average weight per region, and unique vessel counts. Simple aggregations are standard, but Spark SQL offers powerful optimizations for massive datasets.
Sample Data: Shipments
| vessel_id | origin_region | status | shipping_cost |
|---|---|---|---|
| V_99 | EMEA | Delivered | 4500.00 |
| V_102 | APAC | In Transit | 12000.00 |
| V_99 | EMEA | Delayed | 500.00 |
The GROUP BY clause collapses rows into summary buckets. You can aggregate by multiple dimensions to see more granular trends.
-- Calculating total revenue and average cost per region/status
SELECT
origin_region,
status,
SUM(shipping_cost) as total_revenue,
AVG(shipping_cost) as avg_cost,
COUNT(*) as total_shipments
FROM shipments
GROUP BY origin_region, status
ORDER BY total_revenue DESC;
2. Advanced Aggregates: Filter & Approx
In big data, calculating exact unique counts can be slow due to memory pressure. APPROX_COUNT_DISTINCT uses the HyperLogLog algorithm to give a 95% accurate result significantly faster.
-- Using FILTER clauses and Approximate counts
SELECT
origin_region,
-- Count unique vessels approximately (fast)
APPROX_COUNT_DISTINCT(vessel_id) as est_vessel_count,
-- Conditional aggregation using FILTER
SUM(shipping_cost) FILTER (WHERE status = 'Delivered') as successful_revenue,
MAX(shipping_cost) as highest_single_cost
FROM shipments
GROUP BY origin_region;
3. The HAVING Clause
While WHERE filters rows before they are grouped, HAVING filters the resulting groups. Use this to find outliers or high-priority regions.
-- Finding only regions with high delay volumes
SELECT
origin_region,
COUNT(*) as delayed_count
FROM shipments
WHERE status = 'Delayed'
GROUP BY origin_region
HAVING delayed_count > 100;
Interview Q&A