Reading Data Sources | Spark Practical Scenarios
← All Scenarios

Reading External Data Sources.

Efficiently ingesting CSV, JSON, and Parquet files into Spark DataFrames while managing schema integrity.

Reading data is the first step in any ETL process. Spark provides a unified DataFrameReader interface accessible via spark.read. Choosing the right options (like headers, delimiters, and multiline support) is critical for successful ingestion.

For exploratory work, inferSchema allows Spark to take two passes over the data: one to determine types and one to load it.

# Reading CSV with automatic type detection
df_csv = (spark.read
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .option("sep", ",")
    .load("s3://raw-zone/sales_data.csv"))

df_csv.printSchema()
    

In production, we avoid inferSchema because it is slow and non-deterministic. Instead, we define a StructType.

from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType

user_schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("salary", DoubleType(), True)
])

# Loading with a predefined schema
df = spark.read.csv("path/to/data.csv", schema=user_schema, header=True)
    

JSON files can be tricky if they are multiline or have varying structures. Spark handles these with specific reader options.

# Reading multiline JSON files
df_json = spark.read.option("multiLine", "true").json("data/users.json")
    
Q: Why is Parquet preferred over CSV in Spark? Parquet is a columnar format that stores its own schema (metadata). Unlike CSV, it supports predicate pushdown (reading only required columns/rows) and native compression, making it significantly faster for Spark to process.
Q: What happens if a CSV row has more columns than the schema defines? By default, Spark will ignore the extra columns. If you want to capture them or fail the job, you can use columnNameOfCorruptRecord or set the mode to FAILFAST.
Q: Is reading a CSV file a Wide or Narrow transformation? It is a Narrow transformation. Each Spark task reads a distinct chunk (split) of the file independently. No data shuffle is required to create the initial DataFrame.