值函数的近似

idea: Approximate the state and action values using parameterized functions: $\hat{v}(s,w)\approx v_\pi(s)$, where $w\in R^m$ is the parameter vector.

Advantage:
(1) Storage: The dimension of $w$ may be much less than $|\mathcal{S}|$.
(2) Generalization: the learned value can generated to unvisited states.

Algorithm for state value estimation

Objective function
$v_\pi(s)$: the true value.
$\hat{v}(s,w)$: a function of approximation.

The objective function is $J(w)=E[(v_\pi(S)-\hat{v}(S,w))^2]$

  • goal: to find the best $w$ that can minimize $J(w)$
  1. the first way is to use a uniform distribution.
    setting the probability of each state as $1/|S|$.
    the objective function becomes $J(w)=1/|S|\sum_{s\in S}(v_\pi(S)-\hat{v}(S,w))^2$
    缺点:所有状态同等重要(并不符合实际情况)

  2. The second way is to use the stationary distribution.

    • it describes the long-run behavior.
    • Let $\{d_\pi(s)\}_{s\in S}$ denote the stationary distribution of the Markov process unger policy $\pi$.
      反映了达到平稳状态后某个状态被访问的概率为多大(需要算法运行很长时间后才可以达到)
    • The objective function can be rewritten as $J(w)=E[(v_\pi(S)-\hat{v}(S,w))^2]=\sum_{s\in S}d_\pi(s)(v_\pi(s)-\hat{v}(s,w))^2$.(这里是加权平均,权重大小是该状态在稳态过程中被访问的概率大小)

Optimization algorithm
该算法用于估计一个给定策略的state value.
Use the graident-descent algorithm: $w_{k+1}=w_k-\alpha_k\nabla_w J(w_k)$
The true gradient is

$$\begin{aligned} \nabla_w J(w)&=\nabla_w E[(v_\pi(S)-\hat{v}(S,w))^2]\\ &= E[\nabla_w(v_\pi(S)-\hat{v}(S,w))^2]\\ &=2 E[(v_\pi(S)-\hat{v}(S,w))(-\nabla_w\hat{v}(S,w))] \\&=-2 E[(v_\pi(S)-\hat{v}(S,w))(\nabla_w\hat{v}(S,w))] \end{aligned} $$

use stochastic gradient to replace the true gradient:
$w_{t+1}=w_t+\alpha_t (v_\pi(s_t)-\hat{v}(s_t,w_t))\nabla_w\hat{v}(s_t,w_t)$
这个算法需要$v_\pi(s_t)$的真实值,所以是不可行的
Usa Monte Carlo or TD learning with function approximation to approximate $v_t(s_t)$

Pseudocode:TD learning with function approximation:

  • Initialization: A function $\hat{v}(s,w)$ that is a differentiable in $w$. Iniitial parameter $w_0$.
  • Aim: Approximate the true state values of a given policy $\pi$.

For each epsiode generated following the policy $\pi$, do
 For each step $(s_t,r_{t+1},s_{t+1})$, do
  In general case, $w_{t+1}=w_{t}+\alpha_t[r_{t+1}+\gamma\hat{v}(s_{t+1},w_t)-\hat{v}(s_t,w_t)]\nabla_w \hat{v}(s_t,w_t)$
  In linear case, $w_{t+1}=w_{t}+\alpha_t[r_{t+1}+\gamma \phi ^T(s_{t+1})w_t-\phi^T(s_t)w_t]\phi(s_t)$.

Selection of function approximators

  1. use a linear function $\hat{v}(s,w)=\phi^T(s)w$ to approximate approximate $\hat{v}(s,w)$
    $\nabla_w \hat{v(s,w)}=\phi(s)$.
    劣势:需要选取何时的feature vectors.
    优势:理论性质可以被比较好的分析;有比较强的表征能力,可以表征大部分函数.
  2. 神经网络(非线性)

Sarsa with function approximation

估计action value.
The sarsa algorithm with value function approximation is:
$w_{t+1}=w_t+\alpha_t[r_{t+1}+\gamma \hat{q}(s_{t+1},a_{t+1},w_t)-\hat{q}(s_t,a_t,w_t)]\nabla_w\hat q(s_t,a_t,w_t)$.

Pseudocode:Sarsa with function approximation:

  • Aim: Search a policy that can lead the agent to the target from an initial state-action pair $(s_0,a_0)$.

For each epsiode do
 If the current $s_t$ is not the target state, do
  Take action $a_t$ following $\pi_t(s_t)$, generate $r_{t+1},s_{t+1}$, and then take action $a_{t+1}$ following $\pi_t(s_{t+1})$.
  Value update (parameter update):
  $w_{t+1}=w_t+\alpha_t[r_{t+1}+\gamma \hat{q}(s_{t+1},a_{t+1},w_t)-\hat{q}(s_t,a_t,w_t)]\nabla_w\hat q(s_t,a_t,w_t)$.
  Policy update:
  $\pi_{t+1}(a|s_t)=1-\frac{\epsilon}{|\mathcal{A}(s)|}(|\mathcal{A}(s)|-1)$ if $a=argmax_{a\in\mathcal{A}(s_t)}\hat q(s_t,a,w_{t+1})$.
  $\pi_{t+1}(a|s_t)=\frac{\epsilon}{|\mathcal{A}(s)|}$ otherwise.

Q-learning with function approximation

The q-value update rule is $w_{t+1}=w_t+\alpha_t[r_{t+1}+\gamma \underset{a\in\mathcal{A}(s_{t+1})}\max\hat{q}(s_{t+1},a,w_t)-\hat{q}(s_t,a_t,w_t)]\nabla_w\hat q(s_t,a_t,w_t)$.

Pseudocode:Sarsa with function approximation:

  • Aim: Search a policy that can lead the agent to the target from an initial state-action pair $(s_0,a_0)$.

For each epsiode do
 If the current $s_t$ is not the target state, do
  Take action $a_t$ following $\pi_t(s_t)$, generate $r_{t+1},s_{t+1}$.
  Value update (parameter update):
  $w_{t+1}=w_t+\alpha_t[r_{t+1}+\gamma \underset{a\in\mathcal{A}(s_{t+1})}\max\hat{q}(s_{t+1},a,w_t)-\hat{q}(s_t,a_t,w_t)]\nabla_w\hat q(s_t,a_t,w_t)$.
  Policy update:
  $\pi_{t+1}(a|s_t)=1-\frac{\epsilon}{|\mathcal{A}(s)|}(|\mathcal{A}(s)|-1)$ if $a=argmax_{a\in\mathcal{A}(s_t)}\hat q(s_t,a,w_{t+1})$.
  $\pi_{t+1}(a|s_t)=\frac{\epsilon}{|\mathcal{A}(s)|}$ otherwise.

Deep Q-learning(DQN)

最成功的将深度神经网络引入强化学习的方法.
神经网络扮演的角色是非线性函数.

DQN aims to minimize the objective function/loss function:
$J(w)=E\Big[\Big(R+\gamma\underset{a\in\mathcal{A(S’)}}\sum \hat q(S’,a,w)-\hat q(S,A,w)\Big)^2\Big]$, where $(S,A,R,S’)$ are random variables.
Let $y:=R+\gamma\underset{a\in\mathcal{A(S’)}}\sum \hat q(S’,a,w)$. For the sake of simplicity, we can assume that $w$ in $y$ is fixed(at least for a while) when we calculate the gradient.

To do that, we can introduce 2 networks:

  • a main network representing $\hat q(s,a,w)$.
  • a target network $\hat q(s,a,w_T)$.
    The objective function in this case degenerated to $J=E\Big[\Big(R+\gamma\underset{a\in\mathcal{A(S’)}}\sum \hat q(S’,a,w_T)-\hat q(S,A,w)\Big)^2\Big]$ where $w_T$ is the target network parameter.
    When $w_T$ is fixed, the gradient of $J$ can be easily obtained as
    $\nabla_w J=E\Big[\Big(R+\gamma\underset{a\in\mathcal{A(S’)}}\sum \hat q(S’,a,w_T)-\hat q(S,A,w)\Big)\nabla_w\hat{q}(S,A,w)\Big]$

Implementation details:

  • Let $w$ and $w_T$ denote the parameters of the main and target networks, respectively. They are set to be the same initially.
  • In every iteration, we draw a mini-batch of samples $\{(s,a,r,s’)\}$ from the replay buffer.
  • The inputs of the networks include state $s$ and action $a$. The target output is $y_T=R+\gamma \max_{a\in\mathcal{A}}\hat q(s’,a,w_T)$. Then we directly minimize the TD error or called loss function $(y_T-\hat q(s,a,w))^2$ over the mini-match $\{(s,a,y_T)\}$.

Experience replay:
将所有数据放在一个集合中$\mathcal{B}:=\{(s,a,r,s’)\}$
每次需要训练神经网络,我们从该集合中取数据.
为什么要用经验回放:
$J(w)=E\Big[\Big(R+\gamma\underset{a\in\mathcal{A(S’)}}\sum \hat q(S’,a,w)-\hat q(S,A,w)\Big)^2\Big]$

  • $(S,A)\sim d$
  • $R\sim p(R|S,A),S’\sim p(S’|S,A)$
  • The distribution of $(S,A)$ is assumed to be uniform. But the samples are not uniformly collected. So we can break the correlation between consequent samples by experience replay method.
作者

Jiamin Liu

发布于

2025-07-03

更新于

2025-07-07

许可协议

评论