Disaster Recovery (DR) for the Lakehouse
/tldr: Achieving low RPO and RTO by synchronously replicating Data and Metadata across cloud regions.
DR Fundamentals: RPO & RTO
A DR strategy is defined by two metrics:
- Recovery Point Objective (RPO): The maximum acceptable age of data that can be lost from a disruption (e.g., 5 minutes, meaning we can lose up to 5 minutes of data).
- Recovery Time Objective (RTO): The maximum acceptable duration of time needed to restore the service after a disaster (e.g., 4 hours).
1. Data Layer Recovery (Parquet/Delta Files)
The primary data (Parquet files, Delta log files) resides in cloud object storage (S3, ADLS, GCS). DR here means ensuring the data is copied to a secondary region.
A. Cross-Region Storage Replication
Use native cloud features (S3 Cross-Region Replication, ADLS Geo-Redundancy) to automatically copy new files from the primary region to the secondary region.
- **Benefit:** Achieves a near-zero RPO for the raw data files, as replication often happens within seconds or minutes.
- **Caveat:** This only copies the files; the **metadata** (the transaction log entries that define the current state of the Delta table) must be handled separately.
B. Delta Deep Clones (Data and Log)
Delta Lake's `CLONE` operation is ideal for periodic, consistent backups, as it copies the data files *and* the transaction history.
- **Deep Clone:** Copies the physical data files to the target location. This is expensive but provides a fully self-contained backup.
- **Shallow Clone:** Copies only the transaction log. This is cheap and fast but relies on the original data files still being accessible (not suitable for cross-region DR).
CREATE OR REPLACE TABLE disaster_region.table_name DEEP CLONE primary_region.table_name
2. Metadata Layer Recovery (Metastore Backup)
The metastore (e.g., Hive Metastore, Unity Catalog, AWS Glue) holds the schema, partition information, and table locations. Without it, the data files are useless.
Synchronous Metastore Backup
The metastore database (typically PostgreSQL or MySQL) must be continuously backed up or synchronously replicated to the disaster recovery region.
- **Managed Services:** Use cloud features like AWS RDS Read Replicas or Azure DB Geo-Replication to maintain a passive, read-only copy of the metastore database in the secondary region.
- **Restoration:** Upon failover, the replica is promoted to primary, allowing the secondary compute environment to immediately access the critical metadata, ensuring low RTO.
3. Compute and Failover Strategy
The strategy for activating compute resources determines the Recovery Time Objective (RTO).
A. Active-Passive (Warm Standby)
The most common and cost-effective approach. Compute resources (e.g., Spark clusters, job schedulers) are generally **shut down** in the secondary region, but their configuration and infrastructure (VPCs, Networking, IAM roles) are pre-provisioned and ready to start.
- **RTO:** Moderate (e.g., 1-4 hours) as compute instances must be launched and configured before processing resumes.
- **Cost:** Low, as you only pay for storage and the minimal cost of the passive metastore replica.
B. Automated Failover
The failover process must be an orchestrated, codified runbook that can be executed quickly, often via infrastructure-as-code (IaC) tools like Terraform.
- **Trigger:** Manual confirmation after an outage is declared, or automated by monitoring systems.
- **Steps:**
- Stop all writes in the primary region.
- Promote the secondary metastore replica.
- Launch compute clusters in the secondary region.
- Redirect incoming job traffic and end-user read access to the secondary region.
A successful DR plan requires continuous testing and clear separation of Data (storage replication) and Metadata (metastore sync).