Array Aggregations | Spark Practical Scenarios

Deduplication with Collect Set & Array Distinct.

Consolidating row-level data into unique collections for user-profile and behavioral analysis.

The Concept

In many Data Engineering workflows, you need to group records by a specific key (like customer_id) and list all associated values (like product_ids) without duplicates.

While collect_list keeps every occurrence, collect_set automatically removes duplicates during the aggregation process. If you already have an array column with duplicates, array_distinct is used to clean it.

Practical Example

Consider an orders table where one customer has made multiple purchases. We want to see a unique list of every book category they are interested in.

-- Aggregating multiple rows into a unique set
SELECT 
    customer_id,
    collect_set(category) AS unique_categories_purchased
FROM orders
GROUP BY customer_id;

Input vs Output

customer_id	category	Resulting Array (collect_set)
101	Tech	["Tech", "Fiction"]
101	Fiction
101	Tech

Cleaning Existing Arrays

If your data is already nested (e.g., from a JSON source) and contains duplicate entries within the same row, use array_distinct.

-- Removing duplicates from an existing array column
SELECT 
    customer_id,
    array_distinct(tags) AS cleaned_tags
FROM raw_user_profiles;

Interview Q&A

Q: When would you use collect_list instead of collect_set? Use collect_list when the frequency of items matters (e.g., calculating the average number of times a user clicks a specific button). Use collect_set when you only care about the existence of an item.

Q: Does collect_set guarantee the order of elements in the array? No. Spark collections are non-deterministic regarding order because data is processed in parallel across different partitions. If order matters, you must follow the aggregation with an array_sort() function.

Q: What is the performance impact of using these functions on very large groups? If a single key has millions of associated items, collect_set can cause Out Of Memory (OOM) errors on the driver or executor because the entire array must fit into memory for that specific row.

Deduplication with Collect Set & Array Distinct.

Input vs Output

Resources

Company

Socials