3.4 Case Studies, Tools, and the Modern AI Stack¶

The Meta-Narrative

Theory and engineering converge in real-world systems. This chapter examines how leading companies architect their AI systems, surveys the modern tooling landscape, and provides a comprehensive reference for building production ML systems.

Case Study: Large-Scale Recommendation¶

Architecture: YouTube Recommendations¶

graph LR
    A["User Activity<br/>(watch, click, search)"] --> B["Feature Pipeline<br/>(real-time + batch)"]
    B --> C["Candidate Generation<br/>(retrieve ~500 from millions)"]
    C --> D["Ranking Model<br/>(deep neural ranker)"]
    D --> E["Business Rules<br/>(diversity, freshness)"]
    E --> F["Final Feed<br/>(~20 items)"]
    F --> G["Impression Logging"]
    G --> A

Two-Tower Architecture (Internal Detail)

The candidate generation model uses a two-tower architecture: one tower encodes the user (history, demographics), another encodes items (title, category, embeddings). At training time, the towers are jointly optimized. At serving time, item embeddings are precomputed and indexed in an approximate nearest neighbor (ANN) system (FAISS, ScaNN) for sub-millisecond retrieval.

Key Engineering Challenges¶

Challenge	Solution	Scale
Candidate set too large	ANN indexing (FAISS, ScaNN)	Billions of items
Cold start (new users)	Content-based fallback	Millions of new users/day
Real-time features	Streaming pipeline (Kafka → Flink)	Millions of events/sec
Feedback loops	Exploration (epsilon-greedy, Thompson sampling)	Continual
A/B testing	Multi-armed bandits, interleaving	Statistical rigor

Case Study: Autonomous Driving¶

Full Perception Stack¶

graph TD
    subgraph "Sensors"
        C["Camera (8-12x)"]
        L["LiDAR (1-5x)"]
        R["Radar (5-8x)"]
        G["GPS + IMU"]
    end
    subgraph "Perception"
        D["Object Detection<br/>(BEVFormer, YOLO)"]
        S["Semantic Segmentation"]
        T["3D Tracking<br/>(multi-frame)"]
    end
    subgraph "Planning"
        P["Behavior Prediction<br/>(other agents)"]
        Q["Path Planning<br/>(lattice, A*)"]
    end
    subgraph "Control"
        U["PID / MPC Controller"]
    end

    C --> D
    L --> D
    R --> T
    G --> Q
    D --> T
    S --> Q
    T --> P
    P --> Q
    Q --> U

Safety Engineering¶

The Long Tail Problem

Self-driving models perform well on common scenarios (99.9%) but fail on edge cases — unusual objects, extreme weather, construction zones. The remaining 0.1% contains the scenarios that cause accidents. This is why:

Tesla uses a massive fleet (billions of miles) to mine edge cases
Waymo uses extensive simulation (20 billion simulated miles)
Neither has fully solved the long-tail safety problem

The Modern AI/ML Tool Landscape¶

Data & Feature Engineering¶

Tool	Category	Strengths
Apache Spark	Distributed compute	Mature, large-scale ETL
dbt	Data transformation	SQL-first, version controlled
Great Expectations	Data validation	Schema + distribution tests
Feast	Feature store	Open-source, offline + online
DVC	Data versioning	Git-like for data

Training & Experimentation¶

Tool	Category	Strengths
PyTorch	DL framework	Research flexibility, dominant
JAX	DL framework	Functional, TPU-optimized
HuggingFace	Model hub + libraries	Largest pretrained model ecosystem
Weights & Biases	Experiment tracking	Best visualization, team features
Lightning	Training framework	Reduces PyTorch boilerplate

Deployment & Serving¶

Tool	Category	Strengths
vLLM	LLM serving	PagedAttention, continuous batching
Triton (NVIDIA)	Multi-framework serving	GPU-optimized, ensemble models
BentoML	Model packaging	Clean API, containerization
TensorRT	Inference optimization	NVIDIA GPU kernel optimization
ONNX Runtime	Cross-platform inference	Framework-agnostic

Orchestration & MLOps¶

Tool	Category	Strengths
Kubeflow	ML pipelines on K8s	Full lifecycle on Kubernetes
MLflow	Experiment tracking + registry	Open-source, comprehensive
Airflow	Workflow orchestration	Mature, large community
Evidently AI	ML monitoring	Data + model drift detection

Benchmarks and Evaluation¶

LLM Benchmarks¶

Benchmark	What It Tests	Notable
MMLU	Massive multi-task knowledge (57 subjects)	Standard LLM knowledge test
HumanEval	Code generation (Python)	Functional correctness
GSM8K	Grade-school math reasoning	Tests chain-of-thought
ARC	Science reasoning (challenge set)	Harder than MMLU
MT-Bench	Multi-turn conversation quality	GPT-4 as judge
LMSYS Chatbot Arena	Human preference (Elo ratings)	Most trusted ranking

References¶

Covington, P. et al. (2016). Deep Neural Networks for YouTube Recommendations. RecSys.
Bojarski, M. et al. (2016). End to End Learning for Self-Driving Cars. arXiv.
Paleyes, A. et al. (2022). Challenges in Deploying Machine Learning. ACM Computing Surveys.