Spinning Up: Part 1 Key Concepts inb RL

Spinning Up: Part 1 Key Concepts inb RL

RL 

Spinning Up: Part 1 Key Concepts inb RL

Welcome to Spinning Up in Deep RL! — Spinning Up documentation

Introduction to RL

Part 1: Key Concepts in RL

../_images/rl_diagram_transparent_bg.png

Concepts:

  • states and observations,
  • action spaces,
  • policies,
  • trajectories,
  • different formulations of return,
  • the RL optimization problem,
  • and value functions.
  1. The agent interacts with the environment, and takes actions($a$).
  2. Then the agent gets observations($o$), and takes actions again.
  3. While the agent takes actions, the environment changes, but may also change on its own,

Reward is the signal from the environment, the goal of the agent is to maximize its cumulated reward, called return.

States and observations

  • $s$ has all the information of the world.
  • $o$ is just a partial description of a state.
  • Fully observed && Partially observed

Action space: all valid actions in the environment

  • Discrete action spaces
  • Continuous action spaces

Policies($\mu$): rules used to decide which actions to take, \[a_t = \mu(s_t).\]

Or stochastic: \[a_t \sim \pi(\cdot | s_t).\]

We often denote the parameters of policies by $\theta$, or $\phi$: \[\begin{cases} a_{t} =\mu _{\theta }( s_{t})\\ a_{t} =\pi _{\theta }( \cdot |s_{t}) \end{cases}\]

Two key computations on Stochastic Policies:

  • Sampling actions from the policy.
  • Computing log-likelihoods of actions, $\log \pi_\theta (a\mid s)$

This process is like training a classifier acting actions to get best log-likelihoods.

Log-likelihoods \[\log\pi_\theta(a|s) = log[P_\theta(s)]_a\]

Diagonal Gaussian Policies

  • with mean vector $\mu$, and covariance $\Sigma$
  • Diagonal gaussian distribution just has values on the diagonal, so it can be a vector.
  • $o \xrightarrow{NN} \mu_{\theta }(s_t)$

Two ways of covariance:

  1. Stand alone parameters $\log\sigma$.
  2. Neuro network $s \xrightarrow{NN} \log \sigma_\theta( s)$

Sampling:

Given the mean action $\mu_\theta(s)$ and standard deviation $\sigma_\theta(s)$, and the vector $z$ of noise from spherical Gaussian ($z\sim \mathcal{N}(0, I)$): \[a=\mu_\theta(s)+\sigma_\theta(s) \odot z,\]

which $\odot$ denote the elementwise product of two vectors.

Log-likelihood of $k$-dim action $a$: \[\log \pi_\theta(a \mid s)=-\frac{1}{2}\left(\sum_{i=1}^k\left(\frac{\left(a_i-\mu_i\right)^2}{\sigma_i^2}+2 \log \sigma_i\right)+k \log 2 \pi\right)\]

Trajectories

A trajectory is a sequence of states: \[\tau=(s_0, a_0, s_1, a_1, ...),\]

which $s_0\sim\rho_0(\cdot)$, and $\rho_0$ is the start state distribution.

State transitions: $s_{t+1}=f(s_t, a_t)$, or stochastic $s_{t+1}\sim P(\cdot \mid s_t, a_t)$

Reward and return

We use $r_t$ as reward at $t$ time: \[r_t = R(s_t, a_t, s_{t+1})\simeq R(s_t,a_t)\simeq R(s_t)\]

Two kinds of return:

  1. Finite-horizon undiscounted return \[R(\tau)=\sum_{t=0}^T r_t,\]

  2. Infinite-horizon discounted return \[R(\tau)=\sum_{t=0}^\infty \gamma^t r_t,\]

Why use $\gamma^t$?

  • intuitively appealing: cash now » cash later.
  • math convenient: converge to a finite value.

RL Problem

Goal of RL: Agent actions -> policy -> max{expected return}

$T$-step trajectory: \[P(\tau \mid \pi)=\rho_0\left(s_0\right) \prod_{t=0}^{T-1} P\left(s_{t+1} \mid s_t, a_t\right) \pi\left(a_t \mid s_t\right)\]

The expected return: \[J(\pi)=\int_\tau P(\tau \mid \pi) R(\tau)=\underset{\tau \sim \pi}{\mathrm{E}}[R(\tau)]\]

The central optimization problem in \[\pi^*=\arg \max _\pi J(\pi)\]

Value Function

Value is the expected return if keeping on acting current policy.

  1. On-Policy Value Function: start at $s$, taking policy $\pi$
\[V^\pi(s)=\underset{\tau \sim \pi}{\mathrm{E}}\left[R(\tau) \mid s_0=s\right]\]
  1. On-Policy Action-Value Function: start at $s$, for any action $a$, taking policy $\pi$
\[Q^\pi(s, a)=\underset{\tau \sim \pi}{\mathrm{E}}\left[R(\tau) \mid s_0=s, a_0=a\right]\]
  1. The Optimal Value Function: start at $s$, taking the optimal policy:
\[V^*(s)=\max _\pi \operatorname{E}_{\tau \sim \pi}^{\mathrm{E}}\left[R(\tau) \mid s_0=s\right]\]
  1. The Optimal Action-Value Function: start at $s$, for any action $a$, taking the optimal policy:
\[Q^*(s, a)=\max _\pi \mathrm{E}_{\tau \sim \pi}\left[R(\tau) \mid s_0=s, a_0=a\right]\]

Thus, we get: \[V^\pi(s)=\underset{a \sim \pi}{\mathrm{E}}\left[Q^\pi(s, a)\right], \\ V^*(s)=\max _a Q^*(s, a)\]

Optimal Q-Function and Action

The optimal action-value function $Q^*(s, a)$ is starting at $s$, for any action $a$, taking the optimal policy forever.

Thus, if we have $Q^*$, then we get: \[a^*(s)=\arg \max_a Q^*(s, a)\]

Bellman Equation

All equations follow the self-consistency equation: Bellman Equation:

The value of your starting point is the reward you expect to get from being there, plus the value of wherever you land next.

For on-policy value function: \[\begin{aligned} V^\pi(s) & =\underset{\substack{a, \sim \pi \\ s^{\prime} \sim P}}{\mathrm{E}}\left[r(s, a)+\gamma V^\pi\left(s^{\prime}\right)\right], \\ Q^\pi(s, a) & =\underset{s^{\prime} \sim P}{\mathrm{E}}\left[r(s, a)+\gamma \underset{a^{\prime} \sim \pi}{\mathrm{E}}\left[Q^\pi\left(s^{\prime}, a^{\prime}\right)\right]\right], \end{aligned}\]

where $s^{\prime} \sim P$ is shorthand for $s^{\prime} \sim P(\cdot \mid s, a)$, indicating that the next state $s^{\prime}$ is sampled from the environment’s transition rules; $a \sim \pi$ is shorthand for $a \sim \pi(\cdot \mid s)$; and $a^{\prime} \sim \pi$ is shorthand for $a^{\prime} \sim \pi\left(\cdot \mid s^{\prime}\right)$.

Then, for the optimal value functions: \[\begin{aligned} V^*(s) & =\max _a \underset{s^{\prime} \sim P}{\mathrm{E}}\left[r(s, a)+\gamma V^*\left(s^{\prime}\right)\right], \\ Q^*(s, a) & =\underset{s^{\prime} \sim P}{\mathrm{E}}\left[r(s, a)+\gamma \max _{a^{\prime}} Q^*\left(s^{\prime}, a^{\prime}\right)\right] . \end{aligned}\]

Advantage Functions

People want to know how much the action is better than other on average, using advantage functions: \[A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s).\]

This describes in $s$, how taking $a$ is better than taking random $\pi( \cdot \mid s )$.

Formalism

Markov Decision Processes (MDPs). An MDP is a 5-tuple, $\left\langle S, A, R, P, \rho_0\right\rangle$, where

  • $S$ is the set of all valid states,
  • $A$ is the set of all valid actions,
  • $R: S \times A \times S \rightarrow \mathbb{R}$ is the reward function, with $r_t=R\left(s_t, a_t, s_{t+1}\right)$,
  • $P: S \times A \rightarrow \mathcal{P}(S)$ is the transition probability function, with $P\left(s^{\prime} \mid s, a\right)$ being the probability of transitioning into state $s^{\prime}$ if you start in state $s$ and take action $a$,
  • and $\rho_0$ is the starting state distribution.

Understanding

\(V^\pi(s)=\underset{\tau \sim \pi}{\mathrm{E}}\left[R(\tau) \mid s_0=s\right]\)

  • $\pi \overset{sample}{\longrightarrow} \tau$, for deep search
\[Q^\pi(s, a)=\underset{\tau \sim \pi}{\mathrm{E}}\left[R(\tau) \mid s_0=s, a_0=a\right]\]
  • \[Q^\pi(s_t,a)=\begin{cases}a \\V^\pi(s_{t+1})\end{cases}\]

\(V^\pi(s)=\underset{a \sim \pi}{\mathrm{E}}\left[Q^\pi(s, a)\right]\)

  • $s_t \overset{a}{\longrightarrow} s_{t+1}$
  • $V^\pi(s_t) \overset{a}{\rightarrow} Q^\pi(s_t, a)$

\(V^\pi(s) =\underset{\substack{a, \sim \pi \\ s^{\prime} \sim P}}{\mathrm{E}}\left[r(s, a)+\gamma V^\pi\left(s^{\prime}\right)\right],\)

  • Backtracking + Recursion
  • $s_t \rightarrow s_{t+1}$
  • $s_{t}\overset{?}{\leftarrow} s_{t+1}$
\[Q^\pi(s, a) =\underset{s^{\prime} \sim P}{\mathrm{E}}\left[r(s, a)+\gamma \underset{a^{\prime} \sim \pi}{\mathrm{E}}\left[Q^\pi\left(s^{\prime}, a^{\prime}\right)\right]\right],\] \[A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)\]
  • $V^\pi(s)=\underset{a \sim \pi}{\mathrm{E}}\left[Q^\pi(s, a)\right]$
  • $Q^\pi-\mathrm{E}(Q^\pi)$

Written by Yiran //