Parsing JSON-in-Parquet: Methods & Best Practices
A guide to transforming semi-structured S3 data into structured Delta tables.
In Databricks, Parquet is the native storage standard, but nested JSON strings often appear in external data. Spark provides several ways to "flatten" this data, ranging from manual parsing to automated ingestion.
1. Reading External Parquet
Load the file into a DataFrame using the s3a:// or /dbfs/ path. Spark handles partition discovery automatically.
# Standard Read
df = spark.read.parquet("s3a://my-bucket/customers.parquet")
df.printSchema()
2. Parsing the JSON Column
Use from_json with a defined schema. This is the most efficient method for large-scale distributed processing.
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import *
# Define expected structure
schema = StructType([
StructField("age", IntegerType(), True),
StructField("city", StringType(), True),
StructField("preferences", ArrayType(StringType()), True)
])
# Parse and select fields
df_parsed = df.withColumn("data", from_json(col("profile"), schema)) \
.select("customer_id", "data.age", "data.city", "data.preferences")
3. Production Alternatives
Depending on your use case, one of these patterns may be more efficient than a standard read/write loop.
A. The CTAS Approach (SQL First)
Create a Delta table directly from the Parquet files. This registers the metadata and moves the data in one atomic operation.
spark.sql("""
CREATE TABLE silver_customers
USING DELTA
AS SELECT * FROM parquet.`s3a://my-bucket/raw_data/`
""")
B. Auto Loader (Streaming Ingestion)
Best for incremental data. It tracks new files in S3 and provides Schema Inference and Schema Evolution automatically.
df_stream = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "parquet") \
.option("cloudFiles.schemaLocation", "/path/to/checkpoint") \
.load("s3a://my-bucket/raw_data/")
C. Spark SQL get_json_object
If you only need one or two specific fields and don't want to define a full schema.
from pyspark.sql.functions import get_json_object
df.select("customer_id", get_json_object("profile", "$.city").alias("city"))
Summary of Methods
| Method | Primary Benefit | Use Case |
|---|---|---|
| from_json | High performance / Typed | Production ETL |
| CTAS | Simplified Metadata | One-time migrations |
| Auto Loader | Incremental / Auto-Schema | Continuous Ingestion |