Spinning Up: Part 2: Kinds of RL Algorithms
RL
Spinning Up: Part 2 Kinds of RL Algorithms
Welcome to Spinning Up in Deep RL! — Spinning Up documentation
Part 2: Kinds of RL Algorithms
Goal:
- Highlight the most foundational design in deep RL algorithms: what to learn and how to learn
- Expose the trade-offs in those
- Place a few prominent modern algorithms
Model Free && Model-Based RL
The most important point is \[\mathrm{Agent}\begin{array}{ c c }\rightarrow & \mathrm{Environment}\\ \cancel{\rightarrow} & \mathrm{Environment} \end{array}\]
, and the environment model means a function which can predicts state transitions and rewards.
Thus the main upside of this model is that it allows the agent to plan by seeing something would happen. AlphaZero is a famous example.
However, the main downside is that a ground truth model of the environment is not available to the agent. The most challenge is that the bias in the model can be exploited by the agent, resulting in overfitting of the model, and work bad in real world.
What to Learn
- Policies: stochastic or deterministic
- Action-value function (Q-functions)
- Value functions
- *Environment models
What to Learn in Model-Free RL
Policy Optimization
- Policy $\pi_\theta(a \mid s)$
- Performance objective $J(\pi_\theta)$, improved by gradient ascent, or by max local approximation.
- On-policy: update only uses data collected while acting the recent policy
- Approximation: $V_\phi(s)$ for on-policy value function $V^\pi(s)$
Eg:
- A2C/A3C: gradient ascent
- PPO: maximizing a surrogate objective function -> conservative estimate : $J(\pi_\theta)$ <- update.
Q-Learning
Approximator $Q_\theta(s,a)$ for optimal action-value function $Q^*(s,a)$.
Based on Bellman equation
Off-policy: update uses data collected during training
Corresponding policy: $Q^* \leftrightarrow \pi^*$ \[a(s) = \arg \max_a Q_\theta(s,a)\]
Eg:
- DQN
- C51
Trade-off Between Policy Optimization and Q-Learning
- Policy optimization methods is directly optimize for the target: stable and reliable.
- Q-learning indirectly optimize for the target, by training $Q^*$ to satisfy a self-consistency equation: unstable.
- Q-learning are more sample effectively than Policy optimization
Interpolating Between Policy Optimization and Q-Learning
- DDPG: learn policy and Q-function, using each to improve the other
- SAC: stochastic policies, entropy regularization -> stabilize learning and score higher > DDPG