Direct Ingestion & Querying of Multi-Format Data.
Mastering self-describing and non-self-describing file formats in the Databricks Lakehouse.
Databricks allows you to run Spark SQL directly against files using the `format`.`path` syntax. This is the first step in auditing raw data from an external S3 or DBFS location.
-- Querying JSON with Wildcards
SELECT *, input_file_name() source_file
FROM json.`${dataset.bookstore}/customers-json/export_*.json`;
-- Querying Binary Files (Useful for Images/Blobs)
SELECT * FROM binaryFile.`${dataset.bookstore}/customers-json`;
-- Querying Text (Returns one 'value' column per line)
SELECT * FROM text.`${dataset.bookstore}/customers-json`;
2. External Tables with CSV
Unlike Parquet or JSON, CSVs are not self-describing. When creating an external table over CSV files, you must provide the schema and specific OPTIONS like delimiters and headers.
CREATE TABLE books_csv
(book_id STRING, title STRING, author STRING, category STRING, price DOUBLE)
USING CSV
OPTIONS (
header = "true",
delimiter = ";"
)
LOCATION "${external_location}/books-csv";
3. The Metadata Challenge
A common pitfall with non-Delta external tables is that Spark caches the file list. If you add new files to the underlying S3 folder (via PySpark or an external process), the table count will not update automatically.
-- Table count will stay the same even if files are added
SELECT COUNT(*) FROM books_csv;
-- You must manually refresh the Metastore
REFRESH TABLE books_csv;
-- Now the count is accurate
SELECT COUNT(*) FROM books_csv;
4. CTAS vs Temp Views
While CTAS (Create Table As Select) is powerful, it often fails to parse CSVs correctly because it treats all columns as strings. To solve this, we use a TEMP VIEW as a schema-enforcement layer.
-- Step 1: Create a View with defined Schema/Options
CREATE TEMP VIEW books_tmp_vw
(book_id STRING, title STRING, author STRING, category STRING, price DOUBLE)
USING CSV
OPTIONS (
path = "${dataset.bookstore}/books-csv/export_*.csv",
header = "true",
delimiter = ";"
);
-- Step 2: Persist as a Delta Table (CTAS)
CREATE TABLE books AS
SELECT * FROM books_tmp_vw;
Interview Q&A