Databricks Workflows (Jobs)
/tldr: The native, managed orchestration service for building and running production data pipelines.
1. Tasks and the DAG Structure
A Databricks Workflow is a collection of one or more Tasks, executed in a specific order defined by a Directed Acyclic Graph (DAG). The DAG ensures dependencies are met before a task begins.
Task Types
A single workflow can combine multiple task types:
Notebook Task
Executes a Databricks Notebook. This is the most common type for running ETL logic (Python/Scala/SQL).
Delta Live Tables (DLT) Task
Triggers or refreshes an existing DLT pipeline. Ideal for running stream processing or complex batch data quality checks.
Python Wheel / Jar Task
Runs highly modular, packaged code (e.g., a Python function inside a wheel file) for production-grade robustness.
SQL Task
Runs SQL queries against a Databricks SQL Warehouse (Endpoint). Excellent for final transformation steps or data quality assertions.
Defining Dependencies (The DAG)
Tasks specify their dependencies using the depends_on parameter. This creates the DAG, defining the execution order.
- Parallelism: Tasks without mutual dependencies run simultaneously.
- Control Flow: Tasks can be set to run only on success, failure, or completion of upstream tasks.
2. Task Values and Scheduling
Task Values (Cross-Task Communication)
This mechanism allows a task to pass small data artifacts (like file paths, success metrics, or row counts) to downstream tasks in the DAG.
Setting: Use dbutils.jobs.taskValues.set('key', 'value') in the originating task.
Getting: Use {{tasks.task_name.values.key}} in the downstream task's parameters or dbutils.jobs.taskValues.get(...) in code.
Schedules and Triggers
Workflows can be scheduled to run automatically based on time, file arrival, or external events.
- Time-Based: Standard cron schedules (e.g., daily at 3:00 AM).
- File Arrival (Cloud Files): Triggered when new files land in a specified cloud storage location (e.g., S3 or ADLS).
- API Trigger: Manual or programmatic initiation via the Workflows API (e.g., from an external orchestrator like Airflow).
3. Monitoring and Failure Management
Repair and Rerun
A critical feature for fixing failed runs without reprocessing the entire pipeline.
- Rerun Failed Tasks: Automatically reruns only the tasks that failed and any downstream tasks that depend on them.
- Repair Run: Allows the user to select specific tasks (or dependencies) to be rerun, for example, if a specific upstream data source was fixed manually.
Managed Compute
Each task run provisions its own isolated, ephemeral cluster (or uses a shared cluster pool), ensuring cost efficiency and resource isolation. The cluster automatically terminates after the run is complete.
Workflows provide robust, native orchestration that simplifies deployment and management of mission-critical data pipelines.