Databricks Workspace & Repos
/tldr: The centralized environment for all data engineering, data science, and machine learning activities on the Databricks Lakehouse Platform.
1. The Databricks Workspace
The Workspace is the primary web-based user interface where teams manage resources and execute workloads. It acts as the *operating system* for the Databricks Lakehouse, providing access to all tools, data, and compute resources.
Key Workspace Components
Notebooks
The core development environment. These are live documents that support multiple languages (Python, Scala, R, SQL) and allow users to mix code, visualizations, and narrative text.
Compute & Clusters
The Workspace allows users to configure and manage shared computational resources (Clusters) based on Apache Spark, enabling parallel data processing.
Data & Catalog (Unity Catalog)
The interface to manage data access, security, and governance. This includes browsing the Unity Catalog (data assets) and mounting external storage locations.
Jobs & Workflow Orchestration
Users convert notebooks or Delta Live Tables (DLT) pipelines into production Jobs for scheduled execution. The Workspace manages monitoring and dependency resolution.
2. Databricks Repos (Git Integration)
Databricks Repos is the feature that integrates the Workspace with external Git providers (GitHub, GitLab, AWS CodeCommit, Azure DevOps, etc.). This allows developers to apply **DevOps best practices** (version control, CI/CD, code review) to their Databricks notebooks and code.
Key Benefits and Workflow
Code Isolation & Branches
Developers work in isolated Git branches tied to their Repos folder in the Workspace. This prevents production code from being altered accidentally during development or testing.
CI/CD Integration
The Git connection enables continuous integration and continuous deployment. Once a feature branch is merged to 'main' in the Git provider, automated pipelines can trigger deployment via Databricks Jobs APIs.
Notebooks as Files
Repos treats Databricks Notebooks as standard code files, synchronizing them with the Git repository in their native format (e.g., `.ipynb` or Databricks-specific formats), allowing for easy diffing and merging.
Non-Notebook File Support
Beyond notebooks, Repos supports arbitrary files (like Python modules (`.py`), R scripts, configuration files, and utility libraries) that can be imported and executed by notebooks or Jobs.
The Workspace is the "where" you run things, and Repos is the "how" you manage those things like professional software.