Databricks TL;DR: Feature Store

Feature Store

/tldr: A centralized, governed repository for sharing, discovering, and serving ML features.

MLOps Data Consistency Real-Time Inference Feature Engineering

1. The Need for Feature Store

In machine learning, features are the inputs a model uses. The single biggest challenge in production ML is **Training-Serving Skew**—the statistical difference between the feature data used for training and the data used for real-time inference.

The Feature Store solves this by guaranteeing the exact same feature definitions, code, and values are used in both environments, ensuring consistency and reproducibility. It also provides a central hub for feature discovery and reuse across teams.

2. Dual-Store Architecture

The Databricks Feature Store utilizes two distinct storage layers optimized for different latency and volume requirements.

Offline Feature Store

Purpose: Training and Batch Inference

**Storage:** High-volume, high-throughput storage (Delta Tables on cloud storage).
**Latency:** High (seconds to minutes).
**Use Case:** Joining features with historical data to create large training datasets.

Online Feature Store

Purpose: Real-Time Inference

**Storage:** Low-latency key-value databases (e.g., DynamoDB, Azure Cosmos DB).
**Latency:** Very low (milliseconds).
**Use Case:** Quick lookup of the latest features when a model needs a prediction immediately.

3. Core Workflow and Components

The Feature Store orchestrates feature creation, model training, and serving through specialized APIs.

A. Feature Creation (The Publisher)

Data engineers create a DataFrame of features (e.g., user_7_day_avg_spend) and register it with the Feature Store, defining the primary key and the expected refresh schedule.

Python Example: Publishing a Feature Table


fs = FeatureStoreClient()
fs.create_feature_table(
    name='user_features.daily_agg',
    primary_keys=['user_id'],
    schema=spark_df.schema,
    description='7-day and 30-day aggregated user spending.'
)
# Write the feature data to the offline store (Delta) and sync to the online store
fs.write_table(name='user_features.daily_agg', df=spark_df, mode='merge')

B. Training (The Consumer)

Data scientists define a TrainingSet by specifying which features and keys they need. The Feature Store automatically joins the specified features from the Offline Store with the historical target variable data.

Python Example: Creating a Training Dataset


# Define features to use
feature_lookups = [
    FeatureLookup(feature_table_name='user_features.daily_agg', lookup_key='user_id')
]

# Create the training dataset (joins features to raw data)
training_df = fs.create_training_set(
    df=raw_target_data, # Data with historical labels (e.g., purchase=1)
    feature_lookups=feature_lookups,
    label='purchased'
).load_df()

C. Inference (The Serving Bridge)

When the model is registered to MLflow, the Feature Store artifacts are included. When the model is queried for real-time predictions, the model automatically knows to fetch the required features from the **Online Store** using the same keys defined during training.

**Automatic Lookup:** The model carries the metadata needed to look up features.
**Latency:** Lookups happen in milliseconds against the Online Store.

The Feature Store ensures feature logic is written once but used everywhere—training and serving.