GROUP BY & Aggregation
/tldr: Summarizing and analyzing data sets.
THE CORE IDEA: Summarizing Data
Aggregation transforms multiple rows of data into a single summary row, like calculating the total sales or the average price. GROUP BY then organizes these summary rows based on shared values in a specified column.
Data Rows (Many) → GROUP BY → Summary Rows (Few)
Aggregate Functions (The Five Majors)
These functions operate on a set of rows and return a single, summary value.
-
COUNT()
Counts the number of rows.
COUNT(*)counts all rows;COUNT(column)counts non-NULL values in that column.SELECT COUNT(*) FROM orders;
-
SUM()
Calculates the total sum of all non-NULL values in a numeric column.
SELECT SUM(order_total) FROM orders;
-
AVG()
Calculates the average (mean) value of a numeric column. Ignores NULLs.
SELECT AVG(price) FROM products;
-
MAX() and MIN()
Finds the largest (MAX) or smallest (MIN) value in a column. Works on numbers, dates, and text (alphabetically).
SELECT MAX(salary), MIN(hire_date) FROM employees; -
COUNT(DISTINCT column)
Counts only the unique, non-NULL values in a column.
SELECT COUNT(DISTINCT country) FROM users;
The GROUP BY Clause
Grouping for Context
When you use an aggregate function, you often want to see the summary broken down by category (e.g., total sales *per region*). The GROUP BY clause defines these categories.
SELECT list that is NOT an aggregate function **MUST** be included in the GROUP BY clause.
SELECT
department,
AVG(salary) AS avg_dept_salary
FROM
employees
GROUP BY
department; -- Grouping by the department column
Filtering: WHERE vs. HAVING
These clauses both filter, but they operate at different stages of query execution.
The WHERE Clause
**Filters Individual Rows:** Used to filter data *before* the grouping and aggregation takes place.
It cannot reference the results of an aggregate function (e.g., SUM()).
-- Filters out part-time employees before calculating the average
WHERE is_full_time = TRUE
The HAVING Clause
**Filters Grouped Results:** Used to filter data *after* the grouping and aggregation has taken place. It is the **only** way to filter based on the result of an aggregate function.
-- Filters out groups (departments) that have a total salary less than 500,000
HAVING SUM(salary) > 500000
Putting It All Together
-- Goal: Find the average price of products in categories 10 and 20,
-- but only for those categories where the count of products is greater than 5.
SELECT
category_id, -- 1. Grouping Column
COUNT(*) AS product_count, -- 2. Aggregate Function 1
AVG(price) AS average_price -- 3. Aggregate Function 2
FROM
products
WHERE
category_id IN (10, 20) -- 4. WHERE: Filters individual rows BEFORE grouping
GROUP BY
category_id -- 5. GROUP BY: Groups remaining rows by category
HAVING
COUNT(*) > 5 -- 6. HAVING: Filters the grouped results
ORDER BY
average_price DESC; -- 7. ORDER BY: Sorts the final result set
The full SQL Query order of operations is FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY.
Time to analyze your data.