Feature Store
/tldr: A centralized, governed repository for sharing, discovering, and serving ML features.
1. The Need for Feature Store
In machine learning, features are the inputs a model uses. The single biggest challenge in production ML is **Training-Serving Skew**—the statistical difference between the feature data used for training and the data used for real-time inference.
The Feature Store solves this by guaranteeing the exact same feature definitions, code, and values are used in both environments, ensuring consistency and reproducibility. It also provides a central hub for feature discovery and reuse across teams.
2. Dual-Store Architecture
The Databricks Feature Store utilizes two distinct storage layers optimized for different latency and volume requirements.
Offline Feature Store
Purpose: Training and Batch Inference
- **Storage:** High-volume, high-throughput storage (Delta Tables on cloud storage).
- **Latency:** High (seconds to minutes).
- **Use Case:** Joining features with historical data to create large training datasets.
Online Feature Store
Purpose: Real-Time Inference
- **Storage:** Low-latency key-value databases (e.g., DynamoDB, Azure Cosmos DB).
- **Latency:** Very low (milliseconds).
- **Use Case:** Quick lookup of the latest features when a model needs a prediction immediately.
3. Core Workflow and Components
The Feature Store orchestrates feature creation, model training, and serving through specialized APIs.
A. Feature Creation (The Publisher)
Data engineers create a DataFrame of features (e.g., user_7_day_avg_spend) and register it with the Feature Store, defining the primary key and the expected refresh schedule.
Python Example: Publishing a Feature Table
fs = FeatureStoreClient()
fs.create_feature_table(
name='user_features.daily_agg',
primary_keys=['user_id'],
schema=spark_df.schema,
description='7-day and 30-day aggregated user spending.'
)
# Write the feature data to the offline store (Delta) and sync to the online store
fs.write_table(name='user_features.daily_agg', df=spark_df, mode='merge')
B. Training (The Consumer)
Data scientists define a TrainingSet by specifying which features and keys they need. The Feature Store automatically joins the specified features from the Offline Store with the historical target variable data.
Python Example: Creating a Training Dataset
# Define features to use
feature_lookups = [
FeatureLookup(feature_table_name='user_features.daily_agg', lookup_key='user_id')
]
# Create the training dataset (joins features to raw data)
training_df = fs.create_training_set(
df=raw_target_data, # Data with historical labels (e.g., purchase=1)
feature_lookups=feature_lookups,
label='purchased'
).load_df()
C. Inference (The Serving Bridge)
When the model is registered to MLflow, the Feature Store artifacts are included. When the model is queried for real-time predictions, the model automatically knows to fetch the required features from the **Online Store** using the same keys defined during training.
- **Automatic Lookup:** The model carries the metadata needed to look up features.
- **Latency:** Lookups happen in milliseconds against the Online Store.
The Feature Store ensures feature logic is written once but used everywhere—training and serving.