3.4 Case Studies, Tools, and the Modern AI Stack¶
The Meta-Narrative
Theory and engineering converge in real-world systems. This chapter examines how leading companies architect their AI systems, surveys the modern tooling landscape, and provides a comprehensive reference for building production ML systems.
Case Study: Large-Scale Recommendation¶
Architecture: YouTube Recommendations¶
graph LR
A["User Activity<br/>(watch, click, search)"] --> B["Feature Pipeline<br/>(real-time + batch)"]
B --> C["Candidate Generation<br/>(retrieve ~500 from millions)"]
C --> D["Ranking Model<br/>(deep neural ranker)"]
D --> E["Business Rules<br/>(diversity, freshness)"]
E --> F["Final Feed<br/>(~20 items)"]
F --> G["Impression Logging"]
G --> A
Two-Tower Architecture (Internal Detail)
The candidate generation model uses a two-tower architecture: one tower encodes the user (history, demographics), another encodes items (title, category, embeddings). At training time, the towers are jointly optimized. At serving time, item embeddings are precomputed and indexed in an approximate nearest neighbor (ANN) system (FAISS, ScaNN) for sub-millisecond retrieval.
Key Engineering Challenges¶
| Challenge | Solution | Scale |
|---|---|---|
| Candidate set too large | ANN indexing (FAISS, ScaNN) | Billions of items |
| Cold start (new users) | Content-based fallback | Millions of new users/day |
| Real-time features | Streaming pipeline (Kafka → Flink) | Millions of events/sec |
| Feedback loops | Exploration (epsilon-greedy, Thompson sampling) | Continual |
| A/B testing | Multi-armed bandits, interleaving | Statistical rigor |
Case Study: Autonomous Driving¶
Full Perception Stack¶
graph TD
subgraph "Sensors"
C["Camera (8-12x)"]
L["LiDAR (1-5x)"]
R["Radar (5-8x)"]
G["GPS + IMU"]
end
subgraph "Perception"
D["Object Detection<br/>(BEVFormer, YOLO)"]
S["Semantic Segmentation"]
T["3D Tracking<br/>(multi-frame)"]
end
subgraph "Planning"
P["Behavior Prediction<br/>(other agents)"]
Q["Path Planning<br/>(lattice, A*)"]
end
subgraph "Control"
U["PID / MPC Controller"]
end
C --> D
L --> D
R --> T
G --> Q
D --> T
S --> Q
T --> P
P --> Q
Q --> U
Safety Engineering¶
The Long Tail Problem
Self-driving models perform well on common scenarios (99.9%) but fail on edge cases — unusual objects, extreme weather, construction zones. The remaining 0.1% contains the scenarios that cause accidents. This is why:
- Tesla uses a massive fleet (billions of miles) to mine edge cases
- Waymo uses extensive simulation (20 billion simulated miles)
- Neither has fully solved the long-tail safety problem
The Modern AI/ML Tool Landscape¶
Data & Feature Engineering¶
| Tool | Category | Strengths |
|---|---|---|
| Apache Spark | Distributed compute | Mature, large-scale ETL |
| dbt | Data transformation | SQL-first, version controlled |
| Great Expectations | Data validation | Schema + distribution tests |
| Feast | Feature store | Open-source, offline + online |
| DVC | Data versioning | Git-like for data |
Training & Experimentation¶
| Tool | Category | Strengths |
|---|---|---|
| PyTorch | DL framework | Research flexibility, dominant |
| JAX | DL framework | Functional, TPU-optimized |
| HuggingFace | Model hub + libraries | Largest pretrained model ecosystem |
| Weights & Biases | Experiment tracking | Best visualization, team features |
| Lightning | Training framework | Reduces PyTorch boilerplate |
Deployment & Serving¶
| Tool | Category | Strengths |
|---|---|---|
| vLLM | LLM serving | PagedAttention, continuous batching |
| Triton (NVIDIA) | Multi-framework serving | GPU-optimized, ensemble models |
| BentoML | Model packaging | Clean API, containerization |
| TensorRT | Inference optimization | NVIDIA GPU kernel optimization |
| ONNX Runtime | Cross-platform inference | Framework-agnostic |
Orchestration & MLOps¶
| Tool | Category | Strengths |
|---|---|---|
| Kubeflow | ML pipelines on K8s | Full lifecycle on Kubernetes |
| MLflow | Experiment tracking + registry | Open-source, comprehensive |
| Airflow | Workflow orchestration | Mature, large community |
| Evidently AI | ML monitoring | Data + model drift detection |
Benchmarks and Evaluation¶
LLM Benchmarks¶
| Benchmark | What It Tests | Notable |
|---|---|---|
| MMLU | Massive multi-task knowledge (57 subjects) | Standard LLM knowledge test |
| HumanEval | Code generation (Python) | Functional correctness |
| GSM8K | Grade-school math reasoning | Tests chain-of-thought |
| ARC | Science reasoning (challenge set) | Harder than MMLU |
| MT-Bench | Multi-turn conversation quality | GPT-4 as judge |
| LMSYS Chatbot Arena | Human preference (Elo ratings) | Most trusted ranking |
References¶
- Covington, P. et al. (2016). Deep Neural Networks for YouTube Recommendations. RecSys.
- Bojarski, M. et al. (2016). End to End Learning for Self-Driving Cars. arXiv.
- Paleyes, A. et al. (2022). Challenges in Deploying Machine Learning. ACM Computing Surveys.