SQL TL;DR: Aggregation & Grouping (DQL)
Back to SQL TL;DR Hub

GROUP BY & Aggregation

/tldr: Summarizing and analyzing data sets.

Data Query Language (DQL) Summary Functions Analytical Queries

THE CORE IDEA: Summarizing Data

Aggregation transforms multiple rows of data into a single summary row, like calculating the total sales or the average price. GROUP BY then organizes these summary rows based on shared values in a specified column.

Data Rows (Many) GROUP BY Summary Rows (Few)

Aggregate Functions (The Five Majors)

These functions operate on a set of rows and return a single, summary value.

  • COUNT()

    Counts the number of rows. COUNT(*) counts all rows; COUNT(column) counts non-NULL values in that column.

    SELECT COUNT(*) FROM orders;
  • SUM()

    Calculates the total sum of all non-NULL values in a numeric column.

    SELECT SUM(order_total) FROM orders;
  • AVG()

    Calculates the average (mean) value of a numeric column. Ignores NULLs.

    SELECT AVG(price) FROM products;
  • MAX() and MIN()

    Finds the largest (MAX) or smallest (MIN) value in a column. Works on numbers, dates, and text (alphabetically).

    SELECT MAX(salary), MIN(hire_date)
    FROM employees;
                            
  • COUNT(DISTINCT column)

    Counts only the unique, non-NULL values in a column.

    SELECT COUNT(DISTINCT country) FROM users;

The GROUP BY Clause

Grouping for Context

When you use an aggregate function, you often want to see the summary broken down by category (e.g., total sales *per region*). The GROUP BY clause defines these categories.

Rule: Any column in your SELECT list that is NOT an aggregate function **MUST** be included in the GROUP BY clause.
SELECT 
    department, 
    AVG(salary) AS avg_dept_salary
FROM 
    employees
GROUP BY 
    department; -- Grouping by the department column
                

Filtering: WHERE vs. HAVING

These clauses both filter, but they operate at different stages of query execution.

The WHERE Clause

**Filters Individual Rows:** Used to filter data *before* the grouping and aggregation takes place. It cannot reference the results of an aggregate function (e.g., SUM()).

-- Filters out part-time employees before calculating the average
WHERE is_full_time = TRUE
                        

The HAVING Clause

**Filters Grouped Results:** Used to filter data *after* the grouping and aggregation has taken place. It is the **only** way to filter based on the result of an aggregate function.

-- Filters out groups (departments) that have a total salary less than 500,000
HAVING SUM(salary) > 500000
                        

Putting It All Together


-- Goal: Find the average price of products in categories 10 and 20, 
-- but only for those categories where the count of products is greater than 5.

SELECT
    category_id,                        -- 1. Grouping Column
    COUNT(*) AS product_count,          -- 2. Aggregate Function 1
    AVG(price) AS average_price         -- 3. Aggregate Function 2
FROM
    products
WHERE
    category_id IN (10, 20)             -- 4. WHERE: Filters individual rows BEFORE grouping
GROUP BY
    category_id                         -- 5. GROUP BY: Groups remaining rows by category
HAVING
    COUNT(*) > 5                        -- 6. HAVING: Filters the grouped results
ORDER BY
    average_price DESC;                 -- 7. ORDER BY: Sorts the final result set

The full SQL Query order of operations is FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY.

Time to analyze your data.

Mastering GROUP BY and HAVING is essential for creating meaningful business reports.