Databricks TL;DR: Model Serving

Model Serving

/tldr: Transforming MLflow models into low-latency, production-ready REST APIs.

Deployment Low Latency Auto-Scaling Serverless

1. The Goal: Real-Time Predictions

Model Serving is the process of taking a trained, governed model and exposing it as a highly available, high-performance web service. This is critical for applications that require **real-time, low-latency** predictions (e.g., fraud detection, personalized recommendations).

In Databricks, models registered in the MLflow Model Registry can be easily deployed via the **Serverless Real-Time Inference** capability, which removes the need for managing underlying infrastructure.

2. Key Capabilities

Serverless & Isolation

No infrastructure management required. Dedicated compute endpoints guarantee high availability and isolation from other workloads.

Auto-Scaling

Automatically scales up (from zero replicas) to handle high traffic and scales back down when demand drops to optimize cost.

Feature Integration

Seamlessly integrates with the Feature Store for automatic feature lookups during real-time inference, preventing Training-Serving Skew.

3. From Registry to Endpoint

The serving workflow is fully governed by the MLflow Model Registry, making the deployment process simple and traceable.

A. Model Stage Promotion

A model version must be promoted to the **Staging** or **Production** stage in the MLflow Model Registry. This acts as a gate for deployment.

B. Deployment via UI or API

The endpoint is created either via the Databricks UI or using the Databricks Model Serving REST API, pointing to the specific model version (e.g., models:/fraud_detector/Production).

REST API Endpoint Example (Request for prediction)


POST /serving-endpoints/fraud_detector/invocations

{
  "dataframe_split": {
    "columns": ["user_id", "feature_a"],
    "data": [
      [101, 0.55],
      [102, 0.88]
    ]
  }
}

C. Data and Model Monitoring

Once live, the serving endpoint streams inference data back into Delta Lake. This enables automated monitoring for **data drift** (input features changing) and **model drift** (performance degrading) to trigger re-training pipelines.

Model Serving is the final handshake between Data Science and the production application.