Data Sampling & Display | Spark Practical Scenarios

Sampling & Exploratory Display.

Techniques for inspecting large-scale DataFrames without overwhelming the cluster resources.

1. Inspecting Content with Show and Take

The show() method prints a tabular version of your DataFrame to the console. By default, it shows 20 rows and truncates long strings. take(n) returns a list of row objects to the driver.

# Showing top 5 rows without truncating column content
df.show(5, truncate=False)

# Getting the first 3 rows as a Python list
top_rows = df.take(3)
print(top_rows[0].asDict())

2. Statistical Sampling

When a dataset is too large for local processing, use sample() to retrieve a representative fraction. This is essential for building machine learning prototypes or quick visualizations.

# Sampling 10% of the data without replacement (seed ensures reproducibility)
sampled_df = df.sample(withReplacement=False, fraction=0.1, seed=42)

print(f"Sampled count: {sampled_df.count()}")

3. Determining Shape & Metadata

Unlike Pandas, PySpark DataFrames do not have a .shape attribute because the row count requires an action. We combine count() and columns to understand the dimensions.

# Getting DataFrame 'shape' (Rows, Columns)
rows = df.count()
cols = len(df.columns)

print(f"DataFrame Shape: ({rows}, {cols})")
print(f"Column Names: {df.columns}")

Interview Q&A

Q: What is the difference between show() and collect()? show() is for visual inspection; it prints formatted results and is memory-safe as it only pulls 20 rows by default. collect() pulls 100% of the data to the driver, which can cause an OutOfMemory (OOM) error if the dataset is large.

Q: Why is df.count() sometimes slow on large DataFrames? count() is an action that requires Spark to scan all partitions across all executors to sum up the records. If the data isn't cached, Spark has to read the entire source from disk/S3.

Q: How do you display a DataFrame in a Jupyter notebook with better formatting? You can use display(df) in Databricks, or df.limit(5).toPandas() in standard environments to leverage the HTML rendering of a Pandas DataFrame.

Sampling & Exploratory Display.

Resources

Company

Socials