Querying External Data | Spark Practical Scenarios

Direct Ingestion & Querying of Multi-Format Data.

Mastering self-describing and non-self-describing file formats in the Databricks Lakehouse.

1. Direct File Querying

Databricks allows you to run Spark SQL directly against files using the `format`.`path` syntax. This is the first step in auditing raw data from an external S3 or DBFS location.

-- Querying JSON with Wildcards
SELECT *, input_file_name() source_file 
FROM json.`${dataset.bookstore}/customers-json/export_*.json`;

-- Querying Binary Files (Useful for Images/Blobs)
SELECT * FROM binaryFile.`${dataset.bookstore}/customers-json`;

-- Querying Text (Returns one 'value' column per line)
SELECT * FROM text.`${dataset.bookstore}/customers-json`;

2. External Tables with CSV

Unlike Parquet or JSON, CSVs are not self-describing. When creating an external table over CSV files, you must provide the schema and specific OPTIONS like delimiters and headers.

CREATE TABLE books_csv
  (book_id STRING, title STRING, author STRING, category STRING, price DOUBLE)
USING CSV
OPTIONS (
  header = "true",
  delimiter = ";"
)
LOCATION "${external_location}/books-csv";

3. The Metadata Challenge

A common pitfall with non-Delta external tables is that Spark caches the file list. If you add new files to the underlying S3 folder (via PySpark or an external process), the table count will not update automatically.

-- Table count will stay the same even if files are added
SELECT COUNT(*) FROM books_csv;

-- You must manually refresh the Metastore
REFRESH TABLE books_csv;

-- Now the count is accurate
SELECT COUNT(*) FROM books_csv;

4. CTAS vs Temp Views

While CTAS (Create Table As Select) is powerful, it often fails to parse CSVs correctly because it treats all columns as strings. To solve this, we use a TEMP VIEW as a schema-enforcement layer.

-- Step 1: Create a View with defined Schema/Options
CREATE TEMP VIEW books_tmp_vw
   (book_id STRING, title STRING, author STRING, category STRING, price DOUBLE)
USING CSV
OPTIONS (
  path = "${dataset.bookstore}/books-csv/export_*.csv",
  header = "true",
  delimiter = ";"
);

-- Step 2: Persist as a Delta Table (CTAS)
CREATE TABLE books AS
SELECT * FROM books_tmp_vw;

Interview Q&A

Q: Why does a SELECT on a JSON directory work immediately while CSV requires a schema? JSON is self-describing; it contains keys that Spark maps to column names. CSV is just text; without a schema or inferSchema, Spark cannot know if a value is an Integer, Double, or String.

Q: When should you use the input_file_name() function? It is a best practice during Bronze layer ingestion. It allows you to trace any record back to its specific source file in S3, which is vital for debugging data quality issues.

Q: What is the limitation of CTAS when reading from a raw CSV? CTAS does not support passing OPTIONS like headers or delimiters directly. This usually results in a table with a single column of unparsed text. Using a Temp View or PySpark .read() first is the solution.

Direct Ingestion & Querying of Multi-Format Data.

Resources

Company

Socials