Skip to content

3.3 MLOps Engineering

The Meta-Narrative

MLOps is where software engineering meets machine learning. It's the discipline of delivering ML models reliably, repeatedly, and at scale. Without MLOps, every deployment is a one-off miracle. With it, you have a machine — a system that continuously trains, validates, deploys, and monitors models with minimal human intervention.


The MLOps Stack

graph TD
    subgraph "Development"
        A["Experiment Tracking<br/>(MLflow, W&B)"]
        B["Data Versioning<br/>(DVC, lakeFS)"]
        C["Feature Store<br/>(Feast, Tecton)"]
    end
    subgraph "Pipeline Orchestration"
        D["Training Pipeline<br/>(Kubeflow, Airflow)"]
        E["Validation Gate<br/>(performance thresholds)"]
    end
    subgraph "Deployment"
        F["Model Registry<br/>(MLflow, SageMaker)"]
        G["Serving<br/>(Triton, TorchServe)"]
    end
    subgraph "Monitoring"
        H["Data Drift Detection"]
        I["Performance Monitoring"]
        J["Alerting<br/>(PagerDuty, Slack)"]
    end

    A --> D
    B --> D
    C --> D
    D --> E
    E --> |"Pass"| F
    E --> |"Fail"| D
    F --> G
    G --> H
    G --> I
    I --> J
    H --> J
    J --> |"Retrain"| D

Experiment Tracking: The Foundation

Why Spreadsheets Kill ML Projects

Without experiment tracking, teams cannot:

  • Reproduce a model from 3 months ago
  • Compare 50 hyperparameter configurations
  • Roll back to a previous model version
  • Audit which data trained which model

MLflow, Weights & Biases, and Neptune solve this by logging parameters, metrics, artifacts, and code versions for every run.

🚀 Lab: MLflow Experiment Tracking
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

mlflow.set_experiment("iris-classification")

for n_est in [50, 100, 200]:
    for max_depth in [3, 5, None]:
        with mlflow.start_run(run_name=f"rf_n{n_est}_d{max_depth}"):
            mlflow.log_param("n_estimators", n_est)
            mlflow.log_param("max_depth", max_depth)

            model = RandomForestClassifier(n_estimators=n_est, max_depth=max_depth, random_state=42)
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)

            acc = accuracy_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred, average='macro')
            mlflow.log_metric("accuracy", acc)
            mlflow.log_metric("f1_macro", f1)
            mlflow.sklearn.log_model(model, "model")

            print(f"n={n_est}, d={max_depth}: acc={acc:.4f}, f1={f1:.4f}")

CI/CD for Machine Learning

ML-Specific Testing Pyramid

graph TD
    A["Unit Tests<br/>(Data transforms, feature logic)"] --> B["Integration Tests<br/>(Pipeline end-to-end)"]
    B --> C["Model Quality Tests<br/>(Accuracy > threshold)"]
    C --> D["Fairness Tests<br/>(Bias audits)"]
    D --> E["Performance Tests<br/>(Latency, throughput)"]

Quality Gates

Gate Metric Threshold Action on Failure
Data schema Column types, ranges Must match Block pipeline
Data freshness Max age of data < 24 hours Alert team
Model accuracy Test set performance > baseline Block deploy
Inference latency p95 latency < 100ms Block deploy
Fairness Demographic parity diff < 0.1 Block deploy

References

  • Kreuzberger, D. et al. (2023). Machine Learning Operations (MLOps): Overview, Definition, and Architecture. IEEE Access.
  • Zaharia, M. et al. (2018). Accelerating the Machine Learning Lifecycle with MLflow. IEEE Data Eng.