Databricks TL;DR: Cost Optimization

Cost Optimization Strategies

/tldr: Strategies to minimize Databricks Unit (DBU) consumption and cloud VM costs.

DBU Pricing Spot Instances Auto-Scaling Cluster Efficiency

1. The Core Cost Driver: DBUs

The Databricks bill has two main components: Cloud VM cost (paid directly to AWS/Azure/GCP) and Databricks Unit (DBU) cost (paid to Databricks). DBUs are the proprietary unit of processing capacity.

DBU Consumption Variables

DBUs are consumed based on two factors: the **size of the cluster** (number of cores/workers) and the **runtime type**.

Workload Type: Interactive clusters (for notebooks) consume more DBUs/hour than Automated clusters (for jobs/workflows).
Runtime Optimization: The specific Databricks Runtime (DBR) version and whether you use Photon affect DBU consumption and job completion time.
Databricks SQL: DB SQL Warehouses typically offer the best price/performance for analytical SQL workloads.

2. Strategic Cluster Setup

Spot / Preemptible Instances

Use **Spot Instances** for non-critical, fault-tolerant workloads (e.g., ETL jobs that can restart). They are significantly cheaper (up to 90% savings on VM cost).

Recommendation: Allocate 60-90% of your cluster worker nodes as Spot instances.
**Risk:** Spot instances can be reclaimed by the cloud provider, leading to cluster failure or delays.

Auto-Termination and Auto-Scaling

These are fundamental for preventing wasted costs on idle clusters.

Auto-Termination: Set interactive clusters to terminate after 10-20 minutes of inactivity.
Auto-Scaling: Define a min/max worker range. Databricks dynamically adjusts cluster size based on workload demand, ensuring you only pay for compute when it's actively processing data.

Cluster Sizing (Right-Sizing)

Choosing the right VM family and size for the driver and workers is key.

Small, optimized jobs: Use smaller, more memory-optimized instances.
I/O heavy jobs: Use machines with high network throughput and optimized SSDs (if available).
Avoid Overprovisioning: Using a cluster that's too large for a job wastes both DBU and VM costs. Use the cluster UI metrics to check CPU utilization.

3. Workload Efficiency

Leverage Photon and DLT

A faster job is a cheaper job. Photon reduces query execution time, directly lowering DBU consumption. DLT further optimizes this with its Enhanced Auto Scaling and automated cluster management.

Delta Lake Optimization

Ensure your Delta tables are well-maintained to minimize expensive file scans.

Z-Ordering: Essential for large tables. Z-Ordering co-locates related data in the same set of files, dramatically speeding up predicate filtering and reducing I/O.
Optimize: Run OPTIMIZE table_name regularly to compact small files, improving read efficiency.

Cost optimization is achieved by minimizing idle time and maximizing workload speed.