Reading Data Sources | Spark Practical Scenarios

Reading External Data Sources.

Efficiently ingesting CSV, JSON, and Parquet files into Spark DataFrames while managing schema integrity.

The Data Ingestion Pipeline

Reading data is the first step in any ETL process. Spark provides a unified DataFrameReader interface accessible via spark.read. Choosing the right options (like headers, delimiters, and multiline support) is critical for successful ingestion.

1. Reading CSV with Schema Inference

For exploratory work, inferSchema allows Spark to take two passes over the data: one to determine types and one to load it.

# Reading CSV with automatic type detection
df_csv = (spark.read
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .option("sep", ",")
    .load("s3://raw-zone/sales_data.csv"))

df_csv.printSchema()

2. Production Pattern: Explicit Schema

In production, we avoid inferSchema because it is slow and non-deterministic. Instead, we define a StructType.

from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType

user_schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("salary", DoubleType(), True)
])

# Loading with a predefined schema
df = spark.read.csv("path/to/data.csv", schema=user_schema, header=True)

3. Handling Semi-Structured JSON

JSON files can be tricky if they are multiline or have varying structures. Spark handles these with specific reader options.

# Reading multiline JSON files
df_json = spark.read.option("multiLine", "true").json("data/users.json")

Interview Q&A

Q: Why is Parquet preferred over CSV in Spark? Parquet is a columnar format that stores its own schema (metadata). Unlike CSV, it supports predicate pushdown (reading only required columns/rows) and native compression, making it significantly faster for Spark to process.

Q: What happens if a CSV row has more columns than the schema defines? By default, Spark will ignore the extra columns. If you want to capture them or fail the job, you can use columnNameOfCorruptRecord or set the mode to FAILFAST.

Q: Is reading a CSV file a Wide or Narrow transformation? It is a Narrow transformation. Each Spark task reads a distinct chunk (split) of the file independently. No data shuffle is required to create the initial DataFrame.

Reading External Data Sources.

Resources

Company

Socials