← All Scenarios
1. Grouping and Multi-Aggregations
Grouping, Aggregation & Ordering.
Consolidating large datasets into summary statistics and organizing results for downstream reporting.
The groupBy() method defines the segments of your data. When followed by agg(), it allows you to compute multiple statistics (sum, average, count) in a single pass over the data.
from pyspark.sql import functions as F
# Summarizing sales by region and category
summary_df = df.groupBy("region", "category").agg(
F.sum("revenue").alias("total_rev"),
F.avg("units_sold").alias("avg_units"),
F.countDistinct("customer_id").alias("unique_customers")
)
2. Ordering and Sorting
Use orderBy() or sort() to arrange data. You can specify ascending or descending order for multiple columns to create hierarchical lists.
# Sorting by region (ascending) and revenue (descending)
sorted_df = summary_df.orderBy(
F.col("region").asc(),
F.col("total_rev").desc()
)
3. Performance: The Shuffle Effect
Grouping is a Wide Transformation. Spark must move data with the same key to the same executor. Understanding the spark.sql.shuffle.partitions configuration is key to preventing small-file problems or OOM errors during these steps.
Interview Q&A
Q: What is the difference between sort() and orderBy()?
In the PySpark DataFrame API, they are functionally identical. orderBy is simply an alias for sort, though orderBy is often preferred by those with a SQL background.
Q: When should you use sortWithinPartitions()?
Use sortWithinPartitions() when you don't need a global sort across the entire dataset. It is much faster because it avoids the shuffle required for a global orderBy, sorting data only within each parallel task.
Q: How do you handle Nulls when sorting?
You can explicitly control null placement using F.col("name").asc_nulls_last() or F.col("name").desc_nulls_first() to ensure they don't interfere with your top-N results.