Writing & Merging Data | Spark Practical Scenarios
← All Scenarios

Writing, Overwriting, and Merging Tables.

Managing data lifecycle through idempotent overwrites and Delta Lake upserts.

In production, we often prefer INSERT OVERWRITE or CREATE OR REPLACE TABLE (CRAS) over simple appends. This makes pipelines idempotent—meaning you can re-run them without duplicating data.

-- Method A: Replace the entire table structure and data
CREATE OR REPLACE TABLE orders AS
SELECT * FROM parquet.`${dataset.bookstore}/orders`;

-- Method B: Overwrite data but keep the existing schema
INSERT OVERWRITE orders
SELECT *, current_timestamp() FROM parquet.`${dataset.bookstore}/orders`;

-- Check transaction logs to see the overwrite version
DESCRIBE HISTORY orders;
    

Use INSERT INTO when you want to add new records to an existing table without touching current data. This is common for logging and streaming sinks.

INSERT INTO orders
SELECT * FROM parquet.`${dataset.bookstore}/orders-new`;
    

The MERGE statement is the powerhouse of Delta Lake. It allows you to perform Updates, Inserts, and even Deletes in a single atomic transaction based on a join condition.

-- Example: Updating emails only if they were previously NULL
MERGE INTO customers c
USING customers_updates u
ON c.customer_id = u.customer_id
WHEN MATCHED AND c.email IS NULL AND u.email IS NOT NULL THEN
  UPDATE SET email = u.email, updated = u.updated
WHEN NOT MATCHED THEN 
  INSERT *;
    

You can also use MERGE to filter what gets ingested into your target table, ensuring only high-quality or specific data makes it through.

MERGE INTO books b
USING books_updates u
ON b.book_id = u.book_id AND b.title = u.title
WHEN NOT MATCHED AND u.category = 'Computer Science' THEN 
  INSERT *;
    
Q: What is the difference between CRAS and INSERT OVERWRITE? CREATE OR REPLACE TABLE (CRAS) fully replaces the table, including any schema changes. INSERT OVERWRITE only replaces the data; if your new query doesn't match the existing table schema, it will fail unless you enable schema evolution.
Q: Why is MERGE considered better for SCD (Slowly Changing Dimensions)? Because it handles the Upsert logic natively. Without MERGE, you would have to manually find which records changed, delete them from the target, and then insert the new versions—which is not atomic and prone to errors.
Q: How does Delta Lake track these writes? Every write operation is recorded in the Delta Log (_delta_log). You can use DESCRIBE HISTORY to see who did what and even use Time Travel to restore the table to a state before an accidental overwrite.