Reading External Data Sources.
Efficiently ingesting CSV, JSON, and Parquet files into Spark DataFrames while managing schema integrity.
Reading data is the first step in any ETL process. Spark provides a unified DataFrameReader interface accessible via spark.read. Choosing the right options (like headers, delimiters, and multiline support) is critical for successful ingestion.
1. Reading CSV with Schema InferenceFor exploratory work, inferSchema allows Spark to take two passes over the data: one to determine types and one to load it.
# Reading CSV with automatic type detection
df_csv = (spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("sep", ",")
.load("s3://raw-zone/sales_data.csv"))
df_csv.printSchema()
2. Production Pattern: Explicit Schema
In production, we avoid inferSchema because it is slow and non-deterministic. Instead, we define a StructType.
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType
user_schema = StructType([
StructField("id", IntegerType(), False),
StructField("name", StringType(), True),
StructField("salary", DoubleType(), True)
])
# Loading with a predefined schema
df = spark.read.csv("path/to/data.csv", schema=user_schema, header=True)
3. Handling Semi-Structured JSON
JSON files can be tricky if they are multiline or have varying structures. Spark handles these with specific reader options.
# Reading multiline JSON files
df_json = spark.read.option("multiLine", "true").json("data/users.json")
Interview Q&A