2.5 Reinforcement Learning Engineering¶

The Meta-Narrative

RL theory is elegant. RL engineering is brutal. The gap between the Bellman equation on a whiteboard and a working RL agent in production is filled with hyperparameter sensitivity, reward shaping nightmares, and sample inefficiency. This chapter focuses on the practical engineering of RL systems.

The RL Engineering Stack¶

graph TD
    A["Environment<br/>(Gymnasium, Unity, Real World)"] <--> B["Agent<br/>(Policy + Value Network)"]
    B --> C["Replay Buffer<br/>(Experience Storage)"]
    C --> D["Training Loop<br/>(Gradient Updates)"]
    D --> B
    B --> E["Evaluation<br/>(Deterministic Policy)"]
    E --> F["Logging<br/>(W&B, TensorBoard)"]

Environment Engineering¶

Environment Type	Latency	Fidelity	Example
Grid World	μs	Very low	FrozenLake, Cliff Walking
Physics Sim	ms	Medium	MuJoCo, PyBullet
Game Engine	ms-s	High	Atari, StarCraft II
Photorealistic Sim	10ms-1s	Very high	CARLA, Isaac Sim
Real World	10ms+	Perfect	Robotics

Reward Engineering: The Hardest Part¶

Specification Gaming

Agents exploit loopholes in reward functions. Famous examples:

Coast Runners: Agent loops collecting points instead of finishing the race
Evolved Creatures: Tall creatures that fall over to move by "body surfing"
Tetris Agent: Pauses the game before the last piece to avoid losing

The lesson: the reward function is the specification, and specifications are notoriously hard to get right.

Practical Algorithm Selection¶

Scenario	Algorithm	Why
Discrete actions, simple	DQN	Straightforward, well-understood
Continuous control	SAC (Soft Actor-Critic)	Sample-efficient, stable
General purpose	PPO	Most robust, good defaults
RLHF / LLM alignment	PPO with KL penalty	Standard for ChatGPT-style training
Sample-constrained	Model-based (Dreamer, MuZero)	Learn world model, plan internally

🚀 Lab: PPO Training with Logging

import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.evaluation import evaluate_policy

# Create environment
env = gym.make("LunarLander-v3")
eval_env = gym.make("LunarLander-v3")

# Evaluation callback: periodically evaluate and save best model
eval_callback = EvalCallback(
    eval_env,
    best_model_save_path="./best_model/",
    eval_freq=5000,
    n_eval_episodes=10,
    deterministic=True,
)

# Train PPO
model = PPO(
    "MlpPolicy", env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.01,  # Entropy bonus for exploration
    verbose=1,
)
model.learn(total_timesteps=200_000, callback=eval_callback)

# Final evaluation
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=50)
print(f"Final: {mean_reward:.2f} ± {std_reward:.2f}")

References¶

Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. arXiv.
Haarnoja, T. et al. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning.
Amodei, D. et al. (2016). Concrete Problems in AI Safety. arXiv.