3.3 MLOps Engineering¶
The Meta-Narrative
MLOps is where software engineering meets machine learning. It's the discipline of delivering ML models reliably, repeatedly, and at scale. Without MLOps, every deployment is a one-off miracle. With it, you have a machine — a system that continuously trains, validates, deploys, and monitors models with minimal human intervention.
The MLOps Stack¶
graph TD
subgraph "Development"
A["Experiment Tracking<br/>(MLflow, W&B)"]
B["Data Versioning<br/>(DVC, lakeFS)"]
C["Feature Store<br/>(Feast, Tecton)"]
end
subgraph "Pipeline Orchestration"
D["Training Pipeline<br/>(Kubeflow, Airflow)"]
E["Validation Gate<br/>(performance thresholds)"]
end
subgraph "Deployment"
F["Model Registry<br/>(MLflow, SageMaker)"]
G["Serving<br/>(Triton, TorchServe)"]
end
subgraph "Monitoring"
H["Data Drift Detection"]
I["Performance Monitoring"]
J["Alerting<br/>(PagerDuty, Slack)"]
end
A --> D
B --> D
C --> D
D --> E
E --> |"Pass"| F
E --> |"Fail"| D
F --> G
G --> H
G --> I
I --> J
H --> J
J --> |"Retrain"| D
Experiment Tracking: The Foundation¶
Why Spreadsheets Kill ML Projects
Without experiment tracking, teams cannot:
- Reproduce a model from 3 months ago
- Compare 50 hyperparameter configurations
- Roll back to a previous model version
- Audit which data trained which model
MLflow, Weights & Biases, and Neptune solve this by logging parameters, metrics, artifacts, and code versions for every run.
🚀 Lab: MLflow Experiment Tracking
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
mlflow.set_experiment("iris-classification")
for n_est in [50, 100, 200]:
for max_depth in [3, 5, None]:
with mlflow.start_run(run_name=f"rf_n{n_est}_d{max_depth}"):
mlflow.log_param("n_estimators", n_est)
mlflow.log_param("max_depth", max_depth)
model = RandomForestClassifier(n_estimators=n_est, max_depth=max_depth, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='macro')
mlflow.log_metric("accuracy", acc)
mlflow.log_metric("f1_macro", f1)
mlflow.sklearn.log_model(model, "model")
print(f"n={n_est}, d={max_depth}: acc={acc:.4f}, f1={f1:.4f}")
CI/CD for Machine Learning¶
ML-Specific Testing Pyramid¶
graph TD
A["Unit Tests<br/>(Data transforms, feature logic)"] --> B["Integration Tests<br/>(Pipeline end-to-end)"]
B --> C["Model Quality Tests<br/>(Accuracy > threshold)"]
C --> D["Fairness Tests<br/>(Bias audits)"]
D --> E["Performance Tests<br/>(Latency, throughput)"]
Quality Gates¶
| Gate | Metric | Threshold | Action on Failure |
|---|---|---|---|
| Data schema | Column types, ranges | Must match | Block pipeline |
| Data freshness | Max age of data | < 24 hours | Alert team |
| Model accuracy | Test set performance | > baseline | Block deploy |
| Inference latency | p95 latency | < 100ms | Block deploy |
| Fairness | Demographic parity diff | < 0.1 | Block deploy |
References¶
- Kreuzberger, D. et al. (2023). Machine Learning Operations (MLOps): Overview, Definition, and Architecture. IEEE Access.
- Zaharia, M. et al. (2018). Accelerating the Machine Learning Lifecycle with MLflow. IEEE Data Eng.