Databricks TL;DR: Workspace & Repos
Back to Databricks main page

Databricks Workspace & Repos

/tldr: The centralized environment for all data engineering, data science, and machine learning activities on the Databricks Lakehouse Platform.

Lakehouse Platform Collaboration Source Control (Git)

1. The Databricks Workspace

The Workspace is the primary web-based user interface where teams manage resources and execute workloads. It acts as the *operating system* for the Databricks Lakehouse, providing access to all tools, data, and compute resources.

Key Workspace Components

Notebooks

The core development environment. These are live documents that support multiple languages (Python, Scala, R, SQL) and allow users to mix code, visualizations, and narrative text.

Compute & Clusters

The Workspace allows users to configure and manage shared computational resources (Clusters) based on Apache Spark, enabling parallel data processing.

Data & Catalog (Unity Catalog)

The interface to manage data access, security, and governance. This includes browsing the Unity Catalog (data assets) and mounting external storage locations.

Jobs & Workflow Orchestration

Users convert notebooks or Delta Live Tables (DLT) pipelines into production Jobs for scheduled execution. The Workspace manages monitoring and dependency resolution.

2. Databricks Repos (Git Integration)

Databricks Repos is the feature that integrates the Workspace with external Git providers (GitHub, GitLab, AWS CodeCommit, Azure DevOps, etc.). This allows developers to apply **DevOps best practices** (version control, CI/CD, code review) to their Databricks notebooks and code.

Key Benefits and Workflow

Code Isolation & Branches

Developers work in isolated Git branches tied to their Repos folder in the Workspace. This prevents production code from being altered accidentally during development or testing.

CI/CD Integration

The Git connection enables continuous integration and continuous deployment. Once a feature branch is merged to 'main' in the Git provider, automated pipelines can trigger deployment via Databricks Jobs APIs.

Notebooks as Files

Repos treats Databricks Notebooks as standard code files, synchronizing them with the Git repository in their native format (e.g., `.ipynb` or Databricks-specific formats), allowing for easy diffing and merging.

Non-Notebook File Support

Beyond notebooks, Repos supports arbitrary files (like Python modules (`.py`), R scripts, configuration files, and utility libraries) that can be imported and executed by notebooks or Jobs.

The Workspace is the "where" you run things, and Repos is the "how" you manage those things like professional software.

Databricks Fundamentals Series: Workspace and Repos