Initializing PySpark DataFrames.
Methods for converting Python objects, lists, and dictionaries into distributed datasets.
In PySpark, the SparkSession (usually accessible as spark) is the entry point for creating DataFrames. While production data usually comes from S3 or Databases, creating DataFrames manually is essential for unit testing, creating reference tables, or handling configuration data.
1. Creation from Lists & DictionariesThe simplest way to create a DataFrame is using createDataFrame(). You can pass a list of tuples or rows, and optionally provide column names as a list.
data = [("Laptop", 1200), ("Mouse", 25), ("Monitor", 300)]
columns = ["product", "price"]
# Basic creation
df = spark.createDataFrame(data, columns)
df.show()
2. Enforcing Strict Schemas
For production code, relying on Spark to "infer" types from Python objects can lead to errors (e.g., Nulls being typed as Strings). We use StructType and StructField for explicit control.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
StructField("user_id", IntegerType(), False),
StructField("email", StringType(), True),
StructField("signup_date", StringType(), True)
])
users_data = [(1, "alex@example.com", "2026-01-01"), (2, "sam@example.com", "2026-01-02")]
df_users = spark.createDataFrame(users_data, schema)
3. Handling Empty DataFrames
Sometimes you need to initialize an empty DataFrame to append data to later in a loop, or to ensure a function has a consistent return type even when no data is found.
# Creating an empty DataFrame with a schema
empty_df = spark.createDataFrame([], schema)
# Checking if a DataFrame is empty
if df.isEmpty():
print("No data found to process.")
Interview Q&A