Databricks TL;DR: Workflows & Orchestration

Databricks Workflows (Jobs)

/tldr: The native, managed orchestration service for building and running production data pipelines.

Orchestration Production ETL/ELT Task Dependency Managed Compute

1. Tasks and the DAG Structure

A Databricks Workflow is a collection of one or more Tasks, executed in a specific order defined by a Directed Acyclic Graph (DAG). The DAG ensures dependencies are met before a task begins.

Task Types

A single workflow can combine multiple task types:

Notebook Task

Executes a Databricks Notebook. This is the most common type for running ETL logic (Python/Scala/SQL).

Delta Live Tables (DLT) Task

Triggers or refreshes an existing DLT pipeline. Ideal for running stream processing or complex batch data quality checks.

Python Wheel / Jar Task

Runs highly modular, packaged code (e.g., a Python function inside a wheel file) for production-grade robustness.

SQL Task

Runs SQL queries against a Databricks SQL Warehouse (Endpoint). Excellent for final transformation steps or data quality assertions.

Defining Dependencies (The DAG)

Tasks specify their dependencies using the depends_on parameter. This creates the DAG, defining the execution order.

Parallelism: Tasks without mutual dependencies run simultaneously.
Control Flow: Tasks can be set to run only on success, failure, or completion of upstream tasks.

2. Task Values and Scheduling

Task Values (Cross-Task Communication)

This mechanism allows a task to pass small data artifacts (like file paths, success metrics, or row counts) to downstream tasks in the DAG.

Setting: Use dbutils.jobs.taskValues.set('key', 'value') in the originating task.

Getting: Use {{tasks.task_name.values.key}} in the downstream task's parameters or dbutils.jobs.taskValues.get(...) in code.

Schedules and Triggers

Workflows can be scheduled to run automatically based on time, file arrival, or external events.

Time-Based: Standard cron schedules (e.g., daily at 3:00 AM).
File Arrival (Cloud Files): Triggered when new files land in a specified cloud storage location (e.g., S3 or ADLS).
API Trigger: Manual or programmatic initiation via the Workflows API (e.g., from an external orchestrator like Airflow).

3. Monitoring and Failure Management

Repair and Rerun

A critical feature for fixing failed runs without reprocessing the entire pipeline.

Rerun Failed Tasks: Automatically reruns only the tasks that failed and any downstream tasks that depend on them.
Repair Run: Allows the user to select specific tasks (or dependencies) to be rerun, for example, if a specific upstream data source was fixed manually.

Managed Compute

Each task run provisions its own isolated, ephemeral cluster (or uses a shared cluster pool), ensuring cost efficiency and resource isolation. The cluster automatically terminates after the run is complete.

Workflows provide robust, native orchestration that simplifies deployment and management of mission-critical data pipelines.