Reinforcement Learning Notes

Reinforcement Learning Framework

  
import gym
import numpy as np
import matplotlib.pyplot as plt

How to learn to make good sequences of decisions under uncertainty?

Agent and environment interact in discrete time steps: $t = 0, 1, 2, ...$
Initial state of the environment is given and may be random: $s_0 \sim \mathbb{P}(s_0)$, $s_0 \in \mathcal{S}$
Agent observes the environment and determines the state of the environment: $o_t = s_t \in \mathcal{S}$ (here we assume that environment is fully observable)

Agent makes a decision on which action to choose in the current state. Agents decision is based on a policy $\pi$ that is either deterministic ($\pi: \mathcal{S} \rightarrow \mathcal{A}$) or stochastic ($$\pi(a_t

s_t), a_t \in \mathcal{A}$$)

State of environment changes based on agents action and environments dynamics: $$s_{t+1} \sim P(s_{t+1} s_t, a_t)$,$s_{t+1} \in \mathcal{S}$$
Agent recieves a reward: $r_{t+1} = R(s_t, a_t, s_{t+1}) \in \mathbb{R}$ (rewards may be delayed!)

Markov Decision Process

Two assumptions has been made in this framework that makes it a Markov Decision Process (MDP):

State transition probabilities satisfy markov property:

\[\mathbb{P}(s_{t+1}| a_t, s_t, ..., a_0, s_0) = \mathbb{P}(s_{t+1}| a_t, s_t)\]

\[R(s_0, a_0, ..., s_t, a_t, s_{t+1}) = R(s_t, a_t, s_{t+1})\]

A Markov Decision Process is defined by a tuple $(\mathcal{S}, \mathcal{A}, P, R, s_0)$.

Main Objective

The main goal in reinforcement learning is to find a policy that maximizes expected sum of rewards:

\[\pi^* = argmax_\pi \mathbb{E}\left[\sum_{t=0}^\infty R(s_t, a_t, s_{t+1}) \bigg| s_0, \pi\right]\]

Note that the expectation is on joint distribution of states and actions that by markov property can be factorized as:

\[\prod_{t = 0}^\infty \pi(a_t|s_t)P(s_{t+1}| s_t, a_t)\]

For a deterministic policy this reduces to:

\[\prod_{t = 0}^\infty P(s_{t+1}| s_t, \pi(s_t))\]

One problem with this objective is that in infinite horizon setting (where t goes to infinity) this expectation may not converge and we can have multiple policies with infinite expected sum of rewards. In this case comparing different policies is not straight forward and makes optimization hard. For example consider a MDP with two actions, 0 and 1 where each reward is equal to action taken. There are two policies, one that always does the action 1 and one that does 0, 1 in sequence. Both of these policies have infinite expected sum of rewards but one may argue that first policy is better.

To alleviate this problem a discount factor $0< \gamma < 1$ can be added to make the summation convergent.

\[\pi^* = argmax_\pi \mathbb{E}\left[\sum_{t=0}^\infty \gamma^t R(s_t, a_t, s_{t+1}) \bigg| s_0, \pi\right]\]

This discount factor also reduces the variance of estimating expectation, so it also helps in the finite horizon settings.

Overview of RL Algorithms

Dynamic Programming Methods

Value Iteration

Value Function and Bellman Equation

Given a state $s$ and a policy $\pi$, value of policy $\pi$ at state $s$ is equal to expected sum of (discounted) rewards starting from $s$ and acting by $\pi$.

\[\begin{align*} V^\pi(s) &= \mathbb{E}\left[\sum_{t=0}^\infty \gamma^t R(s_t, a_t, s_{t+1}) \bigg| s_0=s, \pi\right] \qquad \text{(Value function)}\\ &= \mathbb{E}\left[R(s_0, a_0, s_1)+\sum_{t=1}^\infty \gamma^t R(s_t, a_t, s_{t+1}) \bigg| s_0=s, \pi\right] \\ &= \mathbb{E}\left[R(s_0, a_0, s_1)+\gamma V^\pi(s_1) \bigg| s_0=s, \pi\right] \\ & = \sum_{a\in\mathcal{A}} \pi(a|s) \sum_{s'\in \mathcal{S}} P(s'|s, a) \left[ R(s, a, s') + \gamma V^\pi(s')\right] \qquad \text{(Bellman equation)} \end{align*} For deterministic policies bellman equation reduces to: \begin{align*} V^\pi(s) & = \sum_{s'\in \mathcal{S}} P(s'|s, \pi(s)) \left[ R(s, \pi(s), s') + \gamma V^\pi(s')\right] \end{align*}\]

Optimal Value Function and Optimal Policy

Optimal Value functon shows the maximum expected sum of discounted rewards that one can achieve from each state s with optimal play.

\[V^*(s) = V^{\pi^*}(s) = \max_\pi V^\pi(s) \qquad \forall s\in \mathcal{S}\]

Considering partial ordering $\pi_1 \geq \pi_2 \iff V^{\pi_1}(s) \geq V^{\pi_2}(s) \quad \forall s\in\mathcal{S}$ it has been proven that:

There exists an optimal policy $\pi^*$ that is at least as good as all other policies: $\pi^* \geq \pi \quad \forall \pi$
There can be many optimal policies, but all optimal policies achieve the optimal value function.
There is always an optimal deterministic policy for any MDP.
Optimal value function $V^*(s)$ satisfies bellman equation (Bellman Optimality Equation):

$V^*(s) = \max_{a \in \mathcal{A}} \sum_{s'\in \mathcal{S}} P(s'|s, a) \left[ R(s, a, s') + \gamma V^*(s')\right] \quad \forall s \in \mathcal{S}$ * Any algorithm that solves this non-linear equation can be used to find optimal value function.

Optimal policy can be found using optimal value function:

\[\pi^*(s) = argmax_{a \in \mathcal{A}} \sum_{s'\in \mathcal{S}} P(s'|s, a) \left[ R(s, a, s') + \gamma V^*(s')\right] \quad \forall s \in \mathcal{S}\]

Value Iteration Algorithm

\[V_0(s) := 0 \qquad \forall s \in \mathcal{S}\]
for $k=0, 1, ...$ until convergence: $V_{k+1}(s) := \max_{a \in \mathcal{A}} \sum_{s'\in \mathcal{S}} P(s'|s, a) \left[ R(s, a, s') + \gamma V_k(s')\right] \quad \forall s \in \mathcal{S}$

* On convergence $V_{k+1}(s) \approx V_{k+1}(s) \quad \forall s \in \mathcal{S}$.

* Value iteration converges. At convergence, we have found the optimal value function $V^*$ which satisﬁes the bellman optimality equation. In fact we have: $\|V_{i+1} - V_i\|_\infty < \epsilon \Rightarrow \|V_{i+1} - V^*\|_\infty < 2\epsilon\frac{\gamma}{1-\gamma}$

Problems

We may not have transition probabilities $$P(s’

s, a)$$, hence we have to solve a density estimation problem to find transition probabilities. Even if we have transition probabilities or a good model of it, the summation (or integral in case of continuous states) may be intractable.

We can not approximate the expectation using monte-carlo methods due to max.

Frozen Lake

FrozenLake is a simple grid-world problem with 16 states and 4 actions. Transition probabilities are known and number of states is tractable so we can easily solve it using value iteration algorithm.

The goal of this game is to go from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H). However, the ice is slippery, so you won’t always move in the direction you intend (stochastic environment)

Here I explain gym environment as I implement the algorithm. For introduction to gym library I suggest you to read the documentation page.

  
# make FrozenLake environment
env = gym.make('FrozenLake-v0')

# initializes the environment and returns initial state (in FrozenLake initial state is always 0)
s = env.reset()

# FrozenLake has a discrete observation space that takes 16 values [0-15] corresponding to current position
print('observation space:', env.observation_space)

# FrozenLake has a discrete action space that takes 4 values [0-3] corresponding to dirrection of movement
# 0: Left
# 1: Down
# 2: Right
# 3: Up
print('action space:', env.action_space)

# transition probabilities
# env.P[s][a] = [(p(s'|s, a), s', r, done) for each s' if p(s'|s, a) != 0] :))
# done indicates end of an episode
print('P[0][2]:', env.P[0][2])

# visualizes the environment
env.render()

    observation space: Discrete(16)
    action space: Discrete(4)
    P[s][a]: [(0.3333333333333333, 4, 0.0, False), (0.3333333333333333, 1, 0.0, False), (0.3333333333333333, 0, 0.0, False)]
    
    SFFF
    FHFH
    FFFH
    HFFG

  
nS = env.observation_space.n        # number of states
nA = env.action_space.n             # number of actions
P = env.P

gamma = .99                         # discount factor
threshold = 1e-9                    # convergence threshold

# Value Iteration Algorithm
V = np.zeros(nS)
while True:
    U = np.zeros(nS)
    for s in range(nS):
        U[s] = np.max([np.sum([t[0] * (t[2] + gamma * V[t[1]]) for t in P[s][a]]) for a in range(nA)])
    
    if np.linalg.norm(V - U) < threshold:
        V = U
        break
    V = U

print('Optimal value function:\n', V)

# extracting policy from value function
pi = np.zeros(nS)
for s in range(nS):
    pi[s] = np.argmax([np.sum([t[0] * (t[2] + gamma * V[t[1]]) for t in P[s][a]]) for a in range(nA)])
print('Optimal policy:\n', pi)


# average sum of rewards for 100 episodes
rewards = []
for i in range(100):
    s = env.reset()
    done = False
    reward = 0.
    while not done:
        s, r, done, _ = env.step(pi[s])
        reward += r
    rewards.append(reward)
print('average reward:', np.mean(rewards))

    Optimal value function:
     [0.54202592 0.49880318 0.47069568 0.45685169 0.55845095 0.
     0.35834807 0.         0.59179874 0.64307982 0.61520755 0.
     0.         0.74172044 0.86283743 0.        ]
    Optimal policy:
     [0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]
    average reward: 0.79

Q-Value Iteration (Tabular Q-Learning)

Q-Value Function and Bellman Equation

Given a state $s$, an action $a$ and a policy $\pi$, q-value of policy $\pi$ at state $s$ starting with action $a$, is equal to expected sum of (discounted) rewards starting from $s$, taking action $a$ and acting by $\pi$ afterwards. $$ \begin{align} Q^\pi(s, a) &= \mathbb{E}\left[\sum_{t=0}^\infty \gamma^t R(s_t, a_t, s_{t+1}) \bigg| s_0=s, a_0=a, \pi\right] \qquad \text{(Q-Value function)}
& = \sum_{s’\in \mathcal{S}} P(s’|s, a) \left[ R(s, a, s’) + \gamma \sum_{a’ \in \mathcal{A} }\pi(a’|s’)Q^\pi(s’, a’)\right] \qquad \text{(Bellman equation)} \end{align}

For deterministic policies bellman equation reduces to: \begin{align} Q^\pi(s, a) & = \sum_{s’\in \mathcal{S}} P(s’|s, a) \left[ R(s, a, s’) + \gamma Q^\pi(s’, \pi(s’))\right] \end{align} $$

Optimal Q-Value Function and Optimal Policy

\[Q^*(s, a) = Q^{\pi^*}(s, a) = \max_\pi Q^\pi(s, a) \qquad \forall s\in \mathcal{S}\]

All optimal policies achieve the optimal q-value function..
Optimal q-value function $Q^*(s, a)$ satisfies bellman equation (Bellman Optimality Equation): \begin{align} Q^(s, a) & = \sum_{s’\in \mathcal{S}} P(s’|s, a) \left[ R(s, a, s’) + \gamma \max_{a’ \in \mathcal{A}}Q^(s’, a’)\right] \end{align}
Optimal policy can be found using optimal value function: $\pi^*(s) = argmax_{a \in \mathcal{A}} Q^*(s, a)$

Q-Value Iteration Algorithm (Known P)

\[Q_0(s, a) := 0 \qquad \forall (s, a) \in \mathcal{S}\times\mathcal{A}\]
for $k=0, 1, ...$ until convergence: $Q^\pi_{k+1}(s, a) = \sum_{s'\in \mathcal{S}} P(s'|s, a) \left[ R(s, a, s') + \gamma \max_{a' \in \mathcal{A}}Q_k^\pi(s', a')\right]$

* On convergence $Q_{k+1}(s, a) \approx Q_{k}(s, a) \quad \forall (s, a) \in \mathcal{S}\times \mathcal{A}$.

Frozen Lake

  
gamma = .99                         # discount factor
threshold = 1e-9                    # convergence threshold

# Q-Value Iteration Algorithm
Q = np.zeros((nS, nA))
while True:
    U = np.zeros((nS, nA))
    for s in range(nS):
        for a in range(nA):
            U[s, a] = np.sum([t[0] * (t[2] + gamma * np.max(Q[t[1]])) for t in P[s][a]])
    
    if np.linalg.norm(Q - U) < threshold:
        Q = U
        break
    Q = U

print('Optimal q-value function:\n', Q)

# extracting policy from value function
pi = np.argmax(Q, axis=1)

print('Optimal policy:\n', pi)


# average sum of rewards for 100 episodes
rewards = []
for i in range(100):
    s = env.reset()
    done = False
    reward = 0.
    while not done:
        s, r, done, _ = env.step(pi[s])
        reward += r
    rewards.append(reward)
print('average reward:', np.mean(rewards))

    Optimal q-value function:
     [[0.54202593 0.52776242 0.52776242 0.52234216]
     [0.34347361 0.33419813 0.31993462 0.49880318]
     [0.43818949 0.43362097 0.4243455  0.47069568]
     [0.30609063 0.30609063 0.30152212 0.45685169]
     [0.55845096 0.3795824  0.37416214 0.36315737]
     [0.         0.         0.         0.        ]
     [0.35834807 0.20301849 0.35834807 0.15532958]
     [0.         0.         0.         0.        ]
     [0.3795824  0.40750993 0.39650516 0.59179874]
     [0.44006133 0.64307982 0.44778624 0.39831208]
     [0.61520756 0.49695269 0.40299121 0.3304712 ]
     [0.         0.         0.         0.        ]
     [0.         0.         0.         0.        ]
     [0.45698409 0.5295041  0.74172044 0.49695269]
     [0.73252259 0.86283743 0.82108818 0.78111957]
     [0.         0.         0.         0.        ]]
    Optimal policy:
     [0 3 3 3 0 0 0 0 3 1 0 0 0 2 1 0]
    average reward: 0.77

Q-Value Iteration Algorithm (Unknown P)

We can estimate the expectation using a one-sample monte-carlo estimate by choosing an action and observing next state of the environment by easily changing algorithm to:

\[Q(s, a) := 0 \qquad \forall (s, a) \in \mathcal{S}\times\mathcal{A}\]
for episode $i = 1, 2, ..., N$:
- for $t = 0, 1, ...$:
  - choose an action $a_t$ ($\epsilon-greedy$)
  - observe next state $s_{t+1}$
  - if $s_{t+1}$ is terminal:
    - \[Q(s_t, a_t) = R(s_t, a_t, s_{t+1})\]
  - else:
    - \[Q(s_t, a_t) = R(s_t, a_t, s_{t+1}) + \gamma \max_{a' \in \mathcal{A}}Q(s_{t+1}, a')\]

* One-sample approximation of expectation is very high variane. To reduce this variance we can incorporate Q-values in a running average: $Q(s_t, a_t) = (1-\alpha_t) Q(s,a) + \alpha_t \left[R(s_t, a_t, s_{t+1}) + \gamma \max_{a' \in \mathcal{A}}Q(s_{t+1}, a')\right]$ Parameter $\alpha$ (learning rate) should satisfy following conditions: $\sum_k \alpha_k^2 < L, \quad \sum_k \alpha_k = \infty$ This method is also called Temporal Difference Learning. We will motivate this more later.

* $\epsilon-greedy$ strategy: Choose random action with probability $\epsilon$, otherwise choose action greedily. Q-learning converges to optimal policy even if you’re acting suboptimally (off-policy learning).

Frozen Lake

Here we do not use transition probabilities.

  
epsilon = .2
alpha = .5
gamma = .99
Q = np.zeros((nS, nA))

for i in range(20000):
    s = env.reset()
    done = False
    while not done:
        
        if np.random.rand() < epsilon:
            a = env.action_space.sample()
        else:
            a = np.argmax(Q[s])
        
        s_, r, done, _ = env.step(a)

        if done:
            target = r
        else:
            target = r + gamma * np.max(Q[s_])
        
        Q[s, a] = (1 - alpha) * Q[s, a] + alpha * target
        s = s_

# extracting policy from value function
pi = np.argmax(Q, axis=1)
print('Optimal policy:\n', pi)

# average sum of rewards for 100 episodes
rewards = []
for i in range(100):
    s = env.reset()
    done = False
    reward = 0.
    while not done:
        s, r, done, _ = env.step(pi[s])
        reward += r
    rewards.append(reward)
print('average reward:', np.mean(rewards))

    Optimal policy:
     [0 2 3 3 0 0 2 0 3 1 0 0 0 2 1 0]
    average reward: 0.63

Cart Pole

In CartPole problem we do not know the transition probabilities and observation space is continuous hence we have to estimate the expectation and discretize the state.

  
env = gym.make('CartPole-v0')
epsilon = .2
alpha = .5
gamma = .99
Q = np.zeros((2, 2, 2, 2, 2))

for i in range(20000):
    s = (env.reset() > 0).astype(int)   # simplest case of discretization
    done = False
    while not done:
        if np.random.rand() < epsilon:
            a = env.action_space.sample()
        else:
            a = np.argmax(Q[s[0], s[1], s[2], s[3]])
        
        s_, r, done, _ = env.step(a)
        s_ = (s_ > 0).astype(int)

        if done:
            target = r
        else:
            target = r + gamma * np.max(Q[s_[0], s_[1], s_[2], s_[3]])
        
        Q[s[0], s[1], s[2], s[3], a] = (1 - alpha) * Q[s[0], s[1], s[2], s[3], a] + alpha * target
        s = s_

pi = np.argmax(Q, axis=4)
rewards = []
for i in range(100):
    s = (env.reset() > 0).astype(int)
    done = False
    reward = 0.
    while not done:
        s, r, done, _ = env.step(pi[s[0], s[1], s[2], s[3]])
        s = (s > 0).astype(int)
        reward += r
    rewards.append(reward)
print('average reward:', np.mean(rewards))

    average reward: 42.72

Problems

Needs a lot of episodes and exploration for a good approximation.
If we do not see a state action pair during training, we dont have any idea about it (We do not generalize).
We need a table of size $$\mathcal{S} \times \mathcal{A} $$ to store q-values. We can not scale well to high dimensional continuous state or action spaces.

Approximate Q-Learning

In realistic situations, we cannot possibly learn about every single state:

Too many states to visit them all in training
Too many states to hold the q-tables in memory

Instead, we want to generalize:

Learn about some small number of training states from experience
Generalize that experience to new, similar situations

What if instead of keeping a table of all q-values, we approximate q-value function with a function approximator $Q_\theta: \mathcal{S}\times\mathcal{A} \rightarrow\mathbb{R}$? In this way we can generalize between similar state action values and we do not have to keep a large q-value table. Now the problem is how to find the parameters?

Recall that optimal Q-Vlaue function satisfies bellamn equation:

\begin{align} Q^(s, a) &= \sum_{s’ \in \mathcal{S}} P(s’ | s, a) \left[R(s, a, s’) + \gamma \max_{a’ \in \mathcal{A}}Q^(s’, a’)\right]
&\approx R(s, a, s’) + \gamma \max_{a’ \in \mathcal{A}}Q^(s’, a’) \qquad s’\sim P(s’|s, a) \end{align*}

We want our parametric function $Q_\theta$ be as close as possible to $Q^*$. We can find parameters such that $Q_\theta(s, a)$ be as close as possible to $target=R(s, a, s') + \gamma \max_{a' \in \mathcal{A}}Q^*(s', a')$ for each $s'\sim P(s'|s, a)$. To compute target value we must calculate $\max_{a' \in \mathcal{A}}Q^*(s', a')$ but we do not have $Q^*$ so instead we put our best approximation of it (i.e $Q_\theta$). We can minimize a squared error between our $Q_\theta$ and target values.

Initialize $Q_\theta(s, a)$
for episode $i = 1, 2, ..., N$:
- observe state $s_0$
- for $t = 0, 1, ...$:
  - choose an action $a_t$ ($\epsilon-greedy$)
  - observe next state $s_{t+1}$
  - if $s_{t+1}$ is terminal:
    - \[target = R(s_t, a_t, s_{t+1})\]
  - else:
    - \[target = R(s_t, a_t, s_{t+1}) + \gamma \max_{a' \in \mathcal{A}} Q_\theta(s_{t+1}, a')\]
  - \[\theta = \theta - \alpha \nabla_\theta \frac{1}{2}\left[Q_\theta(s_t, a_t) - target\right]^2\]
  - \[s_t = s_{t+1}\]

Cartpole with Polynomial Function Approximator

  
env = gym.make('CartPole-v0')
nA = env.action_space.n

epsilon = .2
alpha = .01
gamma = .99

theta = np.random.randn(2, 9)

# features
def f(s):
    return np.concatenate([np.ones(1), s, s**2])

# parametrized Q-function
def Q(theta, s):
    return np.dot(theta, f(s))

# policy
def pi(theta, s):
    return np.argmax(Q(theta, s))

for i in range(20000):
    s = env.reset()
    done = False
    while not done:
        if np.random.rand() < epsilon:
            a = env.action_space.sample()
        else:
            a = pi(theta, s)
        
        s_, r, done, _ = env.step(a)

        if done:
            target = r
        else:
            target = r + gamma * np.max(Q(theta, s_))
        
        theta -= alpha * (Q(theta, s)[a] - target) * f(s) 
        s = s_

rewards = []
for i in range(100):
    s = env.reset()
    done = False
    reward = 0.
    while not done:
        s, r, done, _ = env.step(pi(theta, s))
        reward += r
    rewards.append(reward)
print('average reward:', np.mean(rewards))

    average reward: 22.66

Tabular Q-Learning as a special case of Approximate Q-Learning

Assume our function approximator is in the form:

$Q_\theta(s, a) = \theta_{s, a} \qquad \theta \in \mathbb{R}^{|\mathcal{S}| \times |\mathcal{A}|}$ Therefore we have:

\begin{align} \theta_{s, a} &= \theta_{s, a} - \alpha \nabla_{\theta_{s, a}} \frac{1}{2}\left[\theta_{s, a} - target\right]^2
& = \theta_{s, a} - \alpha \left(\theta_{s, a} - target\right)
& = (1- \alpha)\theta_{s, a} + \alpha ~ (target) \end{align}

DQN

The main Idea is to use a Neural Network as function approximator in Approximate Q-Learning. Two main ideas has been used to stabilize Q-learning:

Apply Q-updates on batches of past experience instead of online (Use Experience Replay): Instead of using a one-sample monte-carlo estimate of expectation (online updates), store transitions $(s, a, s', r)$ in a buffer called experience replay and update parameters on a batch of transitions to reduce variance. This makes data distribution more stationary since data samples are not highly correlated due to online updates.
In supervised learning targets are constatnt (do not depend on parameters $\theta$), but in Q-Learning computed targets are dependent on parameters $\theta$ and change every time that $\theta$ changes. To prevent target values from changing too fast we can use a target network that its parameters change less often $Q_{\theta^-}$.

DQN Algorithm

Initialize $Q_\theta(s, a)$
\[Q_{\theta^-} = Q_{\theta}\]
for episode $i = 1, 2, ..., N$:
- observe state $s_0$
- for $t = 0, 1, ...$:
  - choose an action $a_t$ ($\epsilon-greedy$)
  - observe next state $s_{t+1}$
  - add $(s_t, a_t, s_{t+1}, r_{t+1})$ to experience replay
  - take a batch $\left[(s_i, a_i, s'_i, r_i)\right]_i$ from experience replay
  - if $s'_i$ is terminal:
    - \[target_i = r_i\]
  - else:
    - \[target_i = r_i + \gamma \max_{a' \in \mathcal{A}} Q_{\theta^-}(s'_i, a')\]
  - \[\theta = \theta - \alpha \nabla_\theta \frac{1}{2} \sum_i \left[Q_\theta(s_i, a_i) - target_i\right]^2\]
  - \[s_t = s_{t+1}\]
- \[Q_{\theta^-} = Q_\theta\]

Policy Optimization

The approach we have taken so far to solve RL problem is to first find Value function or Q-Value function using any variant of bellman equation, then extract the policy corresponding to our learned Value or Q-Value function. Finding optimal Q-value for each state-action pair in a complex environment, is a very hard problem. What if we directly learn a parametrized policy $\pi_\theta$? We don’t need to exactly know $Q(s, a)$ for every state-action pair to determine the correct action, we only need to know which action maximizes $Q(s, a)$ for every state and the exact value is irrelevant. By this we can say that policy function is simpler that Q-Value function, and learning policy directly could be easier.

Policy Gradient

Recall the main objective stated earlier, let’s rewrite that for a parametrized policy $\pi_\theta$:

$\theta^* = argmax_\theta \mathbb{E}\left[\sum_{t=0}^\infty R(s_t, a_t, s_{t+1}) \bigg| s_0, \pi_\theta\right]$ Where the expectation is on: $\prod_{t = 0}^\infty \pi_\theta(a_t|s_t)P(s_{t+1}| s_t, a_t)$

Let’s try to optimize this objective. Exact computation of this expectation is intractable and impossible when we don’t have environments dynamics $P$. We can approximate this expectation using monte-carlo method:

\[\theta^* = argmax_\theta \frac{1}{N}\sum_{i=1}^N\sum_{t=0}^\infty R(s^i_t, a^i_t, s^i_{t+1})\]

where $$a^i_t \sim \pi_\theta(a_t

s^i_t)$,$s^i_{t+1}\sim P(s_{t+1}

s^i_t, a^i_t)$and$s^i_0$$ is given for all i.

We can’t optimize this objective since it does not depend on $\theta$ so its gradient is zero and gradient descent can not work.

Reparametrization Trick (Pathwise Derivatives)

One way to solve this problem is to use reparametrization trick. In general assume we want optimize $\theta$ in the objective of the form $\mathbb{E}_{p_\theta(x)}[f(x)]$. In reparametrization trick we find a differentiable (w.r.t. parameters) and invertible transformation $g$ and a random variable $\epsilon$ with distribution $p(\epsilon)$ that is independent of $\theta$ such that $x = g(\epsilon; \theta)$. Then we have:

\[\mathbb{E}_{p_\theta(x)}[f(x)] = \mathbb{E}_{p(\epsilon)}[f(g(\epsilon; \theta))] \approx \frac{1}{N}\sum_{i=1}^N f(g(\epsilon^i; \theta))\]

Let’s assume we have such $g$ and $\epsilon$ and apply this trick to our objective:

\[\theta^* = argmax_\theta \frac{1}{N}\sum_{i=1}^N\sum_{t=0}^\infty R(s^i_t, g(\epsilon^i_t;s^i_t, \theta), s^i_{t+1})\]

where $\epsilon^i_t \sim p(\epsilon)$, $s^i_{t+1}\sim P(s_{t+1} | s^i_t, g(\epsilon^i_t;s^i_t, \theta))$ and $s^i_0$ is given for all i. This objective does depend on $\theta$. We can optimize this if R is differentiable w.r.t. $g$.

REINFORCE (Score Function Estimator)

R is not always differentiable hence we can’t use reparametrization trick. Reinforce algorithm introduces another solution for this problem.

Let’s rewite our objective with a simpilified notation. Let $\tau = s_0, a_0, s_1, a_1 ,...$, $$p_\theta(\tau) = \prod_{t = 0}^\infty \pi_\theta(a_t

s_t)P(s_{t+1}

s_t, a_t)$and$r(\tau) = \sum_{t=0}^\infty R(s_t, a_t, s_{t+1})$$then we have:

\[\begin{align*} \nabla_\theta\mathbb{E}\left[\sum_{t=0}^\infty R(s_t, a_t, s_{t+1}) \bigg| s_0, \pi_\theta\right] &= \nabla_\theta\mathbb{E}*{p*\theta(\tau)}\left[r(\tau)\bigg| s_0, \pi_\theta\right] \\ &=\nabla_\theta\int_\tau p_\theta(\tau)r(\tau)~d\tau\\ &=\int_\tau p_\theta(\tau)\frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)}r(\tau)~d\tau\\ &=\int_\tau p_\theta(\tau)\nabla_\theta \log p_\theta(\tau) r(\tau)~d\tau\\ &=\mathbb{E}*{p*\theta(\tau)} \left[\nabla_\theta \log p_\theta(\tau) r(\tau)\right]\\ &\approx \frac{1}{N}\sum_{i=1}^N \nabla_\theta \log p_\theta(\tau^i) r(\tau^i) \end{align*}\]

Now we only have to calculate $\nabla_\theta \log p_\theta(\tau)$:

\begin{align} \nabla_\theta \log p_\theta(\tau) &= \nabla_\theta \log \left(\prod_{t=0}^\infty \pi_\theta(a_t|s_t)P(s_{t+1}|s_t, a_t)\right)
&= \nabla_\theta \sum_{t = 0}^\infty \log \pi_\theta(a_t|s_t) + \log P(s_{t+1}| s_t, a_t)
&= \sum_{t = 0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) \end{align}

Fortunately this doesn’t depend on transition probabilities!

* Likelihood ratio changes probabilities of experienced paths, does not try to change the paths (Pathwise Derivative does)

REINFORCE Algorithm

Initialize $\pi_\theta$
for iterations $i = 1, 2, ...$:
- Rollout N episodes with policy $\pi_\theta$ and save $(s^i_t, a^i_t, s^i_{t+1}, r^i_{t+1})$
- Compute $$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \left( \sum_{t = 0}^\infty\log \pi_\theta(a^i_t s^i_t) \right) \left(\sum_{t’=0}^\infty R(s^i_{t’}, a^i_{t’}, s^i_{t’+1}) \right)$$
- BackProp and minimize $\mathcal{L}$ using GD.

Actor-Critic Methods

Policy gradient objective tries to:

Increase probability of paths with positive R
Decrease probability of paths with negative R

What if we only get positive (negative) rewards?

Baseline

Baseline tries to solve the positive (negative) reward problem with subtracting a baseline (mean reward) from path reward. Another motivation of this baseline is to reduce variance (You will learn about this in your homework). Gradient estimate using a baseline becomes:

\[\frac{1}{N}\sum_{i=1}^N \nabla_\theta \log p_\theta(\tau^i) \left(r(\tau^i) - b\right)\]

This estimate is unbiased as long as $b$ doesn’t depend on actions.

Popular Baselines

Constant Baseline: $b = \mathbb{E}[r(\tau)]$
Optimal Constant Baseline: $b = \frac{\mathbb{E}[(\nabla_\theta \log p_\theta (\tau))^2 r(\tau)]}{\mathbb{E}[\nabla_\theta \log p_\theta (\tau))^2]}$

Temporal Structure

In the original policy gradient objective we increase or decrease probability of an action depending on the whole path reward. But instead we want to modify the probability of an action by rewards that are consequence of that action. We can do this by taking temporal structure of rewards into account:

\[\frac{1}{N}\sum_{i=1}^N \left( \sum_{t = 0}^\infty \nabla_\theta \log \pi_\theta(a^i_t|s^i_t) \left(\sum_{t'=t}^\infty R(s^i_{t'}, a^i_{t'}, s^i_{t'+1}) - b\right) \right)\]

Popular Baselines

Time Dependent Baseline: $b_t = \mathbb{E}[\sum_{t'=t}^\infty R(s_{t'}, a_{t'}, s_{s'+1})]$
State Dependent Baseline: $$b(s_t) = V^{\pi}(s_t) = \mathbb{E}\left[\sum_{t’=0}^\infty R(s_{t’}, a_{t’}, s_{t’+1}) \bigg s_0=s_t, \pi\right]$$

Advantage Actor Critic Method (A2C)

Main idea of this method is to increase or decrease probability of an action based on Advantage of that action under current policy. We already tried to find the advantage of an action by different baselines and sum of rewards $\left(\sum_{t'=t}^\infty R(s^i_{t'}, a^i_{t'}, s^i_{t'+1}) - b\right)$ but the optimal way to calculate advantage is to use Value / Q-Value functions:

$\begin{align*} A^{\pi_\theta}(s_t, a_t) &= Q^{\pi_\theta}(s_t, a_t) - V^{\pi_\theta}(s_t)\\ \end{align*}$ One way to find this advantage is to use another function approximator for Value function $V^{\pi_\theta}\phi$ and find Q-Value according to K-step look ahead equation: $\begin{align*} Q^{\pi_\theta}*\phi(s, a) &= \mathbb{E}\left[\sum_{t=0}^{K-1}R(s_{t}, a_{t}, s_{t+1}) + V^{\pi_\theta}*\phi(s*{K}) \bigg|s_0=s, a_0=a, \pi_\theta \right]\\ &\approx \sum*{t=0}^{K-1}R(s_{t}, a_{t}, s_{t+1}) + V^{\pi_\theta}*\phi(s*{K}) \qquad \text{(one-sample approximation)} \end{align*}$ For K = 1: $\begin{align*} Q^{\pi_\theta}*\phi(s, a) &\approx R(s, a, s') + V^{\pi*\theta}_\phi(s') \end{align*}$ Now the question is how to fit $V^{\pi\theta}_\phi$? Answer: Bellman Equation! Recall Bellman Equation for value function: $\begin{align*} V^{\pi_\theta}_\phi(s) & = \mathbb{E} \left[ R(s, a, s') + V^{\pi_\theta}*\phi(s') \bigg |\pi*\theta\right] \end{align*}$ We can use this equation to fit a parametrized value function the same way we did for Approximate Q-Learning, by minimizing a squared error between $target=R(s, a, s') + V^{\pi_\theta}_\phi(s')$ and our value function $[target - V^{\pi_\theta}_\phi(s)]^2$

A2C Algorithm

Sample a batch $\{(s_i, a_i, r_i, s_{i + 1})\}_i$ under policy $\pi_\theta$.
Fit $V_\phi^{\pi_\theta}(s_i)$ to $r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})$ by minimizing squared error $\|r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})- V_\phi^{\pi_\theta}(s_i)\|^2$.
\[\max_{\theta}~ \sum_{i} \log \pi_\theta(a_i|s_i) \left[ r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})- V^{\pi_\theta}_\phi(s_i) \right]\]

references

Deep RL Bootcamp

Reinforcement Learning Notes

Reinforcement Learning Framework

Markov Decision Process

Main Objective

Overview of RL Algorithms

Dynamic Programming Methods

Value Iteration

Value Function and Bellman Equation

Optimal Value Function and Optimal Policy

Value Iteration Algorithm

Problems

Frozen Lake

Q-Value Iteration (Tabular Q-Learning)

Q-Value Function and Bellman Equation

Optimal Q-Value Function and Optimal Policy

Q-Value Iteration Algorithm (Known P)

Frozen Lake

Q-Value Iteration Algorithm (Unknown P)

Frozen Lake

Cart Pole

Problems

Approximate Q-Learning

Cartpole with Polynomial Function Approximator

Tabular Q-Learning as a special case of Approximate Q-Learning

DQN

DQN Algorithm

Policy Optimization

Policy Gradient

Reparametrization Trick (Pathwise Derivatives)

REINFORCE (Score Function Estimator)

REINFORCE Algorithm

Actor-Critic Methods

Baseline

Popular Baselines

Temporal Structure

Popular Baselines

Advantage Actor Critic Method (A2C)

A2C Algorithm

references

Further Reading

Soft Margin SVM Theory and Implementation with CVXOPT

Machine Learning Project Template

Notes on Kolmogorov–Arnold Networks