Model Serving
/tldr: Transforming MLflow models into low-latency, production-ready REST APIs.
1. The Goal: Real-Time Predictions
Model Serving is the process of taking a trained, governed model and exposing it as a highly available, high-performance web service. This is critical for applications that require **real-time, low-latency** predictions (e.g., fraud detection, personalized recommendations).
In Databricks, models registered in the MLflow Model Registry can be easily deployed via the **Serverless Real-Time Inference** capability, which removes the need for managing underlying infrastructure.
2. Key Capabilities
Serverless & Isolation
No infrastructure management required. Dedicated compute endpoints guarantee high availability and isolation from other workloads.
Auto-Scaling
Automatically scales up (from zero replicas) to handle high traffic and scales back down when demand drops to optimize cost.
Feature Integration
Seamlessly integrates with the Feature Store for automatic feature lookups during real-time inference, preventing Training-Serving Skew.
3. From Registry to Endpoint
The serving workflow is fully governed by the MLflow Model Registry, making the deployment process simple and traceable.
A. Model Stage Promotion
A model version must be promoted to the **Staging** or **Production** stage in the MLflow Model Registry. This acts as a gate for deployment.
B. Deployment via UI or API
The endpoint is created either via the Databricks UI or using the Databricks Model Serving REST API, pointing to the specific model version (e.g., models:/fraud_detector/Production).
REST API Endpoint Example (Request for prediction)
POST /serving-endpoints/fraud_detector/invocations
{
"dataframe_split": {
"columns": ["user_id", "feature_a"],
"data": [
[101, 0.55],
[102, 0.88]
]
}
}
C. Data and Model Monitoring
Once live, the serving endpoint streams inference data back into Delta Lake. This enables automated monitoring for **data drift** (input features changing) and **model drift** (performance degrading) to trigger re-training pipelines.
Model Serving is the final handshake between Data Science and the production application.