The Gap Between Notebooks and Production

Most ML projects die in the notebook. A data scientist builds a promising model, shares impressive metrics in a slide deck... and then nothing ships. The gap between "it works on my laptop" and "it's running in production" is where MLOps lives.

MLOps (Machine Learning Operations) applies DevOps principles to ML systems: version control, automated testing, continuous deployment, and monitoring — adapted for the unique challenges of machine learning.

Why ML Is Harder to Operationalize

Traditional software has one artifact: code. ML systems have three:

All three can change independently, and any change can break the system. That's why you need versioning, testing, and monitoring for all three.

The MLOps Stack

1. Experiment Tracking

Track every training run: hyperparameters, metrics, datasets, and artifacts. Tools like MLflow, Weights & Biases, or Neptune make this automatic.

import mlflow

mlflow.start_run()
mlflow.log_param("learning_rate", 0.001)
mlflow.log_param("epochs", 50)
mlflow.log_metric("accuracy", 0.94)
mlflow.log_metric("f1_score", 0.91)
mlflow.sklearn.log_model(model, "model")
mlflow.end_run()

2. Model Registry

A central repository for trained models with versioning, stage management (staging → production → archived), and approval workflows. MLflow Model Registry and SageMaker Model Registry are popular choices.

3. Data Versioning

Track changes to your training data the same way you track code changes. DVC (Data Version Control) works alongside Git:

# Track a large dataset
dvc add data/training_set.parquet

# Push data to remote storage
dvc push

# Reproduce the exact training data from any commit
git checkout v1.2
dvc checkout

4. Feature Stores

A centralized system for computing, storing, and serving features. Ensures the same feature logic is used in training and serving — eliminating training-serving skew. Tools: Feast, Tecton, Hopsworks.

CI/CD for Machine Learning

ML CI/CD pipelines look different from traditional software:

  1. Code tests: Unit tests for feature engineering, data validation, and pipeline logic.
  2. Data validation: Check schema, distributions, missing values, and outliers on new data.
  3. Model training: Automated retraining triggered by new data or code changes.
  4. Model validation: Compare new model against baseline on held-out data. Only promote if it's better.
  5. Deployment: Blue-green or canary deployment with automatic rollback.
Testing principle

Test the data and model, not just the code. A pipeline that produces a bad model should fail the same way a pipeline with a bug should.

Monitoring in Production

Models degrade silently. The world changes, user behavior shifts, and your training data becomes stale. You need to monitor:

The Minimum Viable MLOps

You don't need the entire stack on day one. Start with:

  1. Version everything: Code in Git, data in DVC, models in a registry.
  2. Automate training: One command or trigger retrains the model end-to-end.
  3. Test before deploying: Compare new model vs. baseline on validation data.
  4. Monitor predictions: Log predictions and set up alerts for drift.

Add sophistication as your ML system matures. The goal is reliability, not complexity.

Common Anti-Patterns

The best ML system is the one that's actually running in production, delivering value, and being monitored. Ship first, optimize later.