Skip to content

2.5 Reinforcement Learning Engineering

The Meta-Narrative

RL theory is elegant. RL engineering is brutal. The gap between the Bellman equation on a whiteboard and a working RL agent in production is filled with hyperparameter sensitivity, reward shaping nightmares, and sample inefficiency. This chapter focuses on the practical engineering of RL systems.


The RL Engineering Stack

graph TD
    A["Environment<br/>(Gymnasium, Unity, Real World)"] <--> B["Agent<br/>(Policy + Value Network)"]
    B --> C["Replay Buffer<br/>(Experience Storage)"]
    C --> D["Training Loop<br/>(Gradient Updates)"]
    D --> B
    B --> E["Evaluation<br/>(Deterministic Policy)"]
    E --> F["Logging<br/>(W&B, TensorBoard)"]

Environment Engineering

Environment Type Latency Fidelity Example
Grid World μs Very low FrozenLake, Cliff Walking
Physics Sim ms Medium MuJoCo, PyBullet
Game Engine ms-s High Atari, StarCraft II
Photorealistic Sim 10ms-1s Very high CARLA, Isaac Sim
Real World 10ms+ Perfect Robotics

Reward Engineering: The Hardest Part

Specification Gaming

Agents exploit loopholes in reward functions. Famous examples:

  • Coast Runners: Agent loops collecting points instead of finishing the race
  • Evolved Creatures: Tall creatures that fall over to move by "body surfing"
  • Tetris Agent: Pauses the game before the last piece to avoid losing

The lesson: the reward function is the specification, and specifications are notoriously hard to get right.

Practical Algorithm Selection

Scenario Algorithm Why
Discrete actions, simple DQN Straightforward, well-understood
Continuous control SAC (Soft Actor-Critic) Sample-efficient, stable
General purpose PPO Most robust, good defaults
RLHF / LLM alignment PPO with KL penalty Standard for ChatGPT-style training
Sample-constrained Model-based (Dreamer, MuZero) Learn world model, plan internally
🚀 Lab: PPO Training with Logging
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.evaluation import evaluate_policy

# Create environment
env = gym.make("LunarLander-v3")
eval_env = gym.make("LunarLander-v3")

# Evaluation callback: periodically evaluate and save best model
eval_callback = EvalCallback(
    eval_env,
    best_model_save_path="./best_model/",
    eval_freq=5000,
    n_eval_episodes=10,
    deterministic=True,
)

# Train PPO
model = PPO(
    "MlpPolicy", env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.01,  # Entropy bonus for exploration
    verbose=1,
)
model.learn(total_timesteps=200_000, callback=eval_callback)

# Final evaluation
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=50)
print(f"Final: {mean_reward:.2f} ± {std_reward:.2f}")

References

  • Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. arXiv.
  • Haarnoja, T. et al. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning.
  • Amodei, D. et al. (2016). Concrete Problems in AI Safety. arXiv.