Writing, Overwriting, and Merging Tables.
Managing data lifecycle through idempotent overwrites and Delta Lake upserts.
In production, we often prefer INSERT OVERWRITE or CREATE OR REPLACE TABLE (CRAS) over simple appends. This makes pipelines idempotent—meaning you can re-run them without duplicating data.
-- Method A: Replace the entire table structure and data
CREATE OR REPLACE TABLE orders AS
SELECT * FROM parquet.`${dataset.bookstore}/orders`;
-- Method B: Overwrite data but keep the existing schema
INSERT OVERWRITE orders
SELECT *, current_timestamp() FROM parquet.`${dataset.bookstore}/orders`;
-- Check transaction logs to see the overwrite version
DESCRIBE HISTORY orders;
2. Appending Data
Use INSERT INTO when you want to add new records to an existing table without touching current data. This is common for logging and streaming sinks.
INSERT INTO orders
SELECT * FROM parquet.`${dataset.bookstore}/orders-new`;
3. Advanced Upserts with MERGE
The MERGE statement is the powerhouse of Delta Lake. It allows you to perform Updates, Inserts, and even Deletes in a single atomic transaction based on a join condition.
-- Example: Updating emails only if they were previously NULL
MERGE INTO customers c
USING customers_updates u
ON c.customer_id = u.customer_id
WHEN MATCHED AND c.email IS NULL AND u.email IS NOT NULL THEN
UPDATE SET email = u.email, updated = u.updated
WHEN NOT MATCHED THEN
INSERT *;
4. Conditional Inserts
You can also use MERGE to filter what gets ingested into your target table, ensuring only high-quality or specific data makes it through.
MERGE INTO books b
USING books_updates u
ON b.book_id = u.book_id AND b.title = u.title
WHEN NOT MATCHED AND u.category = 'Computer Science' THEN
INSERT *;
Interview Q&A