2.4 Modern Generative AI Engineering¶
The Meta-Narrative
Generative AI shifted from research curiosity to the most commercially valuable technology in AI. This chapter focuses on the engineering of generative systems: how to fine-tune foundation models, build efficient inference pipelines, implement RAG systems, and deploy text-to-image generation at scale.
Large Language Model Engineering¶
The Anatomy of an LLM¶
| Component | GPT-3 | Llama 2-70B | GPT-4 (est.) |
|---|---|---|---|
| Parameters | 175B | 70B | ~1.8T (MoE) |
| Layers | 96 | 80 | ~120 |
| Hidden size | 12,288 | 8,192 | ~12,288 |
| Attention heads | 96 | 64 | ~96 per expert |
| Context length | 2K | 4K | 128K |
| Training tokens | 300B | 2T | ~13T |
Fine-Tuning Strategies¶
Update all parameters. Maximum flexibility, maximum compute cost.
Cost: For Llama-70B: ~300GB VRAM, 128+ A100 GPUs.
Freeze the pretrained weights \(W\) and add trainable low-rank decomposition:
where \(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times d}\), and \(r \ll d\) (typically 8-64).
Cost: Only \(2 \times r \times d\) trainable params per layer. For Llama-7B with \(r=16\): ~4M trainable params (0.06% of total).
Quantize base model to 4-bit, then apply LoRA adapters in FP16.
Cost: Fine-tune a 70B model on a single 48GB GPU.
Why LoRA Works
Aghajanyan et al. (2021) showed that pretrained models have a low intrinsic dimensionality — the updates during fine-tuning lie in a low-rank subspace. LoRA directly parameterizes this subspace, achieving comparable performance to full fine-tuning with 10,000× fewer trainable parameters.
Quantization for Deployment¶
| Method | Bits | Quality Loss | Speedup | Memory Savings |
|---|---|---|---|---|
| FP32 (baseline) | 32 | None | 1× | 1× |
| FP16 / BF16 | 16 | Negligible | ~2× | 2× |
| INT8 (LLM.int8()) | 8 | Minimal | ~2-3× | 4× |
| GPTQ (4-bit) | 4 | Small | ~3-4× | 8× |
| GGUF (llama.cpp) | 2-6 | Variable | CPU inference | 5-16× |
Inference Optimization¶
graph LR
A["Naive Autoregressive<br/>(slow)"] --> B["KV-Cache<br/>(don't recompute past keys/values)"]
B --> C["Continuous Batching<br/>(vLLM, TGI)"]
C --> D["Speculative Decoding<br/>(draft model + verify)"]
D --> E["PagedAttention<br/>(vLLM: virtual memory for KV cache)"]
🚀 Lab: Fine-tuning with LoRA using PEFT
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
# Load base model
model_name = "meta-llama/Llama-2-7b-hf" # requires access
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank
lora_alpha=32, # Scaling factor
lora_dropout=0.05,
target_modules=["q_proj", "v_proj"], # Apply to attention projections
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable %: 0.06%
# Train with HuggingFace Trainer
training_args = TrainingArguments(
output_dir="./lora-output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
)
# trainer = Trainer(model=model, args=training_args, train_dataset=...)
# trainer.train()
Text-to-Image: Diffusion in Production¶
Stable Diffusion Architecture¶
graph LR
A["Text Prompt"] --> B["CLIP Text Encoder"]
B --> C["Cross-Attention Conditioning"]
N["Random Noise z_T"] --> D["U-Net Denoiser"]
C --> D
D --> |"T denoising steps"| E["Denoised Latent z_0"]
E --> F["VAE Decoder"]
F --> G["Output Image"]
The Latent Diffusion Trick
Instead of denoising in pixel space (expensive), Stable Diffusion operates in a compressed latent space (64×64 instead of 512×512). The VAE encoder/decoder handles the compression. This reduces compute by ~48× while maintaining image quality.
References¶
- Hu, E. J. et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR.
- Dettmers, T. et al. (2023). QLoRA: Efficient Finetuning of Quantized Large Language Models.
- Kwon, W. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention.
- Rombach, R. et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR.