Deduplication with Collect Set & Array Distinct.
Consolidating row-level data into unique collections for user-profile and behavioral analysis.
In many Data Engineering workflows, you need to group records by a specific key (like customer_id) and list all associated values (like product_ids) without duplicates.
While collect_list keeps every occurrence, collect_set automatically removes duplicates during the aggregation process. If you already have an array column with duplicates, array_distinct is used to clean it.
Practical ExampleConsider an orders table where one customer has made multiple purchases. We want to see a unique list of every book category they are interested in.
-- Aggregating multiple rows into a unique set
SELECT
customer_id,
collect_set(category) AS unique_categories_purchased
FROM orders
GROUP BY customer_id;
Input vs Output
| customer_id | category | Resulting Array (collect_set) |
|---|---|---|
| 101 | Tech | ["Tech", "Fiction"] |
| 101 | Fiction | |
| 101 | Tech |
If your data is already nested (e.g., from a JSON source) and contains duplicate entries within the same row, use array_distinct.
-- Removing duplicates from an existing array column
SELECT
customer_id,
array_distinct(tags) AS cleaned_tags
FROM raw_user_profiles;
Interview Q&A