A (Long) Peek into Reinforcement Learning: Part2

A (Long) Peek into Reinforcement Learning: Part2

RL 

A (Long) Peek into Reinforcement Learning: Part2

[A (Long) Peek into Reinforcement LearningLil’Log](https://lilianweng.github.io/posts/2018-02-19-rl-overview/)

Common Approaches

Dynamic Programming

Policy Evaluation: $\pi\rightarrow V_\pi$ \(\begin{align} V_{t+1}(s) &=\mathbb{E}_\pi\left[r+\gamma V_t\left(s^{\prime}\right) \mid S_t=s\right] \\ &=\sum_a \pi(a \mid s) \sum_{s^{\prime}, r} P\left(s^{\prime}, r \mid s, a\right)\left(r+\gamma V_t\left(s^{\prime}\right)\right) \end{align}\) Policy Improvement: $\pi\xrightarrow{Greedy} \pi^\prime$ \(\begin{align} Q_\pi(s, a) &=\mathbb{E}\left[R_{t+1}+\gamma V_\pi\left(S_{t+1}\right) \mid S_t=s, A_t=a\right] \\ &=\sum_{s^{\prime}, r} P\left(s^{\prime}, r \mid s, a\right)\left(r+\gamma V_\pi\left(s^{\prime}\right)\right) \end{align}\) Policy Iteration

Generalized Policy Iteration (GPI) algorithm \(\pi_0 \xrightarrow{\text { evaluation }} V_{\pi_0} \xrightarrow{\text { improve }} \pi_1 \xrightarrow{\text { evaluation }} V_{\pi_1} \xrightarrow{\text { improve }} \pi_2 \xrightarrow{\text { evaluation }} \cdots \xrightarrow{\text { improve }} \pi_* \xrightarrow{\text { evaluation }} V_*\) why $V_{\pi^\prime}$ is better than $V_\pi$ ?

​ For $\pi^{\prime}(s) = \arg\max_{a\in \mathbb{A}}Q_\pi(s, a)$,
\(\begin{align} V_{\pi^\prime}(s) &=Q_\pi(s, \pi^{\prime}(s)) \\ &= Q_\pi(s, \arg\max_{a\in \mathbb{A}}Q_\pi(s, a))\\ &= \max_{a\in\mathbb{A}}Q_\pi(s,a) \\ &\geq Q_\pi(s,\pi(s)) = V_\pi(s) \end{align}\)

Monte-Carlo Methods

  • Learn from raw experience (complete episodes), x modeling the environment.
  • Compute mean return $\simeq$ expected return
  • $[S_1, A_1, R_2, \dots, S_T]\rightarrow G_t=\sum_{k=t+1}^{T}\gamma^k R_{k}$

Empirical Mean return: \(V(s) = \frac{\sum_{t=1}^T (\mathbb{I}[S_t = s]G_t)}{\sum_{t=1}^T \mathbb{I}[S_t = s]}\) where $\mathbb{I}[S_t=s]$ is a binary indicator function.

e.g. Difference between “first-visit” and “every-visit”

  • Episode 1: $t=2,5; G_2=10, G_5=6$
  • Episode 2: $t=3, G_3=8$
  • 1st visit: $V(s)=(10+8)/2=9$
  • every visit: $V(s)=(10+8+6)/3=8$

Similar to action-value $Q(s,a)$ \(Q(s, a)=\frac{ \sum_{t=1}^T (\mathbb{I}\left[S_t=s, A_t=a\right] G_t) }{ \sum_{t=1}^T \mathbb{I}\left[S_t=s, A_t=a\right]}\) Iterate to learn optimal policy:

img

  1. Improvement: $\pi(s) \arg\max_{a\in \mathbb{A}}Q_\pi(s, a)$
  2. Generate new episode with $\pi$ ($\epsilon$-greedy: balance exploitation and exploration)
  3. Evaluation $Q_\pi(s, a)=\frac{\sum_{t=1}^T\left(\mathbb{1}\left[S_t=s, A_t=a\right] \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}\right)}{\sum_{t=1}^T \mathbb{1}\left[S_t=s, A_t=a\right]}$

Written by Yiran //