Spinning Up: Part 2: Kinds of RL Algorithms

RL

Spinning Up: Part 2 Kinds of RL Algorithms

Welcome to Spinning Up in Deep RL! — Spinning Up documentation

Part 2: Kinds of RL Algorithms

../_images/rl_algorithms_9_15.svg

Goal:

Highlight the most foundational design in deep RL algorithms: what to learn and how to learn
Expose the trade-offs in those
Place a few prominent modern algorithms

Model Free && Model-Based RL

The most important point is \[\mathrm{Agent}\begin{array}{ c c }\rightarrow & \mathrm{Environment}\\ \cancel{\rightarrow} & \mathrm{Environment} \end{array}\]

, and the environment model means a function which can predicts state transitions and rewards.

Thus the main upside of this model is that it allows the agent to plan by seeing something would happen. AlphaZero is a famous example.

However, the main downside is that a ground truth model of the environment is not available to the agent. The most challenge is that the bias in the model can be exploited by the agent, resulting in overfitting of the model, and work bad in real world.

What to Learn

Policies: stochastic or deterministic
Action-value function (Q-functions)
Value functions
*Environment models

What to Learn in Model-Free RL

Policy Optimization

Policy $\pi_\theta(a \mid s)$
Performance objective $J(\pi_\theta)$, improved by gradient ascent, or by max local approximation.
- On-policy: update only uses data collected while acting the recent policy
- Approximation: $V_\phi(s)$ for on-policy value function $V^\pi(s)$

Eg:
A2C/A3C: gradient ascent
PPO: maximizing a surrogate objective function -> conservative estimate : $J(\pi_\theta)$ <- update.

Q-Learning

Approximator $Q_\theta(s,a)$ for optimal action-value function $Q^*(s,a)$.
Based on Bellman equation
Off-policy: update uses data collected during training
Corresponding policy: $Q^* \leftrightarrow \pi^*$ \[a(s) = \arg \max_a Q_\theta(s,a)\]

Eg:
DQN
C51

Trade-off Between Policy Optimization and Q-Learning

Policy optimization methods is directly optimize for the target: stable and reliable.
Q-learning indirectly optimize for the target, by training $Q^*$ to satisfy a self-consistency equation: unstable.
Q-learning are more sample effectively than Policy optimization

Interpolating Between Policy Optimization and Q-Learning

DDPG: learn policy and Q-function, using each to improve the other
SAC: stochastic policies, entropy regularization -> stabilize learning and score higher > DDPG

‹» Spinning Up: Part 1 Key Concepts inb RL« » Spinning Up: Part 3: Intro to Policy Optimization«›