Databricks TL;DR: Feature Store
Back to Databricks main page

Feature Store

/tldr: A centralized, governed repository for sharing, discovering, and serving ML features.

MLOps Data Consistency Real-Time Inference Feature Engineering

1. The Need for Feature Store

In machine learning, features are the inputs a model uses. The single biggest challenge in production ML is **Training-Serving Skew**—the statistical difference between the feature data used for training and the data used for real-time inference.

The Feature Store solves this by guaranteeing the exact same feature definitions, code, and values are used in both environments, ensuring consistency and reproducibility. It also provides a central hub for feature discovery and reuse across teams.

2. Dual-Store Architecture

The Databricks Feature Store utilizes two distinct storage layers optimized for different latency and volume requirements.

Offline Feature Store

Purpose: Training and Batch Inference

  • **Storage:** High-volume, high-throughput storage (Delta Tables on cloud storage).
  • **Latency:** High (seconds to minutes).
  • **Use Case:** Joining features with historical data to create large training datasets.

Online Feature Store

Purpose: Real-Time Inference

  • **Storage:** Low-latency key-value databases (e.g., DynamoDB, Azure Cosmos DB).
  • **Latency:** Very low (milliseconds).
  • **Use Case:** Quick lookup of the latest features when a model needs a prediction immediately.

3. Core Workflow and Components

The Feature Store orchestrates feature creation, model training, and serving through specialized APIs.

A. Feature Creation (The Publisher)

Data engineers create a DataFrame of features (e.g., user_7_day_avg_spend) and register it with the Feature Store, defining the primary key and the expected refresh schedule.

Python Example: Publishing a Feature Table

fs = FeatureStoreClient() fs.create_feature_table( name='user_features.daily_agg', primary_keys=['user_id'], schema=spark_df.schema, description='7-day and 30-day aggregated user spending.' ) # Write the feature data to the offline store (Delta) and sync to the online store fs.write_table(name='user_features.daily_agg', df=spark_df, mode='merge')

B. Training (The Consumer)

Data scientists define a TrainingSet by specifying which features and keys they need. The Feature Store automatically joins the specified features from the Offline Store with the historical target variable data.

Python Example: Creating a Training Dataset

# Define features to use feature_lookups = [ FeatureLookup(feature_table_name='user_features.daily_agg', lookup_key='user_id') ] # Create the training dataset (joins features to raw data) training_df = fs.create_training_set( df=raw_target_data, # Data with historical labels (e.g., purchase=1) feature_lookups=feature_lookups, label='purchased' ).load_df()

C. Inference (The Serving Bridge)

When the model is registered to MLflow, the Feature Store artifacts are included. When the model is queried for real-time predictions, the model automatically knows to fetch the required features from the **Online Store** using the same keys defined during training.

  • **Automatic Lookup:** The model carries the metadata needed to look up features.
  • **Latency:** Lookups happen in milliseconds against the Online Store.

The Feature Store ensures feature logic is written once but used everywhere—training and serving.

Databricks Fundamentals Series: Feature Store