2025-07-03发表2025-07-07更新10 分钟读完 (大约1529个字)

值函数的近似

idea: Approximate the state and action values using parameterized functions: $\hat{v}(s,w)\approx v_\pi(s)$, where $w\in R^m$ is the parameter vector.

Advantage:
(1) Storage: The dimension of $w$ may be much less than $|\mathcal{S}|$.
(2) Generalization: the learned value can generated to unvisited states.

Algorithm for state value estimation

Objective function
$v_\pi(s)$: the true value.
$\hat{v}(s,w)$: a function of approximation.

The objective function is $J(w)=E[(v_\pi(S)-\hat{v}(S,w))^2]$

goal: to find the best $w$ that can minimize $J(w)$

the first way is to use a uniform distribution.
setting the probability of each state as $1/|S|$.
the objective function becomes $J(w)=1/|S|\sum_{s\in S}(v_\pi(S)-\hat{v}(S,w))^2$
缺点：所有状态同等重要（并不符合实际情况）
The second way is to use the stationary distribution.
- it describes the long-run behavior.
- Let $\{d_\pi(s)\}_{s\in S}$ denote the stationary distribution of the Markov process unger policy $\pi$.
  反映了达到平稳状态后某个状态被访问的概率为多大(需要算法运行很长时间后才可以达到)
- The objective function can be rewritten as $J(w)=E[(v_\pi(S)-\hat{v}(S,w))^2]=\sum_{s\in S}d_\pi(s)(v_\pi(s)-\hat{v}(s,w))^2$.(这里是加权平均，权重大小是该状态在稳态过程中被访问的概率大小)

Optimization algorithm
该算法用于估计一个给定策略的state value.
Use the graident-descent algorithm: $w_{k+1}=w_k-\alpha_k\nabla_w J(w_k)$
The true gradient is

$$\begin{aligned} \nabla_w J(w)&=\nabla_w E[(v_\pi(S)-\hat{v}(S,w))^2]\\ &= E[\nabla_w(v_\pi(S)-\hat{v}(S,w))^2]\\ &=2 E[(v_\pi(S)-\hat{v}(S,w))(-\nabla_w\hat{v}(S,w))] \\&=-2 E[(v_\pi(S)-\hat{v}(S,w))(\nabla_w\hat{v}(S,w))] \end{aligned} $$

use stochastic gradient to replace the true gradient:
$w_{t+1}=w_t+\alpha_t (v_\pi(s_t)-\hat{v}(s_t,w_t))\nabla_w\hat{v}(s_t,w_t)$
这个算法需要$v_\pi(s_t)$的真实值，所以是不可行的
Usa Monte Carlo or TD learning with function approximation to approximate $v_t(s_t)$

Pseudocode:TD learning with function approximation:

Initialization: A function $\hat{v}(s,w)$ that is a differentiable in $w$. Iniitial parameter $w_0$.
Aim: Approximate the true state values of a given policy $\pi$.

For each epsiode generated following the policy $\pi$, do
For each step $(s_t,r_{t+1},s_{t+1})$, do
In general case, $w_{t+1}=w_{t}+\alpha_t[r_{t+1}+\gamma\hat{v}(s_{t+1},w_t)-\hat{v}(s_t,w_t)]\nabla_w \hat{v}(s_t,w_t)$
In linear case, $w_{t+1}=w_{t}+\alpha_t[r_{t+1}+\gamma \phi ^T(s_{t+1})w_t-\phi^T(s_t)w_t]\phi(s_t)$.

Selection of function approximators

use a linear function $\hat{v}(s,w)=\phi^T(s)w$ to approximate approximate $\hat{v}(s,w)$
$\nabla_w \hat{v(s,w)}=\phi(s)$.
劣势：需要选取何时的feature vectors.
优势：理论性质可以被比较好的分析；有比较强的表征能力，可以表征大部分函数.
神经网络（非线性）

Sarsa with function approximation

估计action value.
The sarsa algorithm with value function approximation is:
$w_{t+1}=w_t+\alpha_t[r_{t+1}+\gamma \hat{q}(s_{t+1},a_{t+1},w_t)-\hat{q}(s_t,a_t,w_t)]\nabla_w\hat q(s_t,a_t,w_t)$.

Pseudocode:Sarsa with function approximation:

Aim: Search a policy that can lead the agent to the target from an initial state-action pair $(s_0,a_0)$.

For each epsiode do
If the current $s_t$ is not the target state, do
Take action $a_t$ following $\pi_t(s_t)$, generate $r_{t+1},s_{t+1}$, and then take action $a_{t+1}$ following $\pi_t(s_{t+1})$.
Value update (parameter update):
$w_{t+1}=w_t+\alpha_t[r_{t+1}+\gamma \hat{q}(s_{t+1},a_{t+1},w_t)-\hat{q}(s_t,a_t,w_t)]\nabla_w\hat q(s_t,a_t,w_t)$.
Policy update:
$\pi_{t+1}(a|s_t)=1-\frac{\epsilon}{|\mathcal{A}(s)|}(|\mathcal{A}(s)|-1)$ if $a=argmax_{a\in\mathcal{A}(s_t)}\hat q(s_t,a,w_{t+1})$.
$\pi_{t+1}(a|s_t)=\frac{\epsilon}{|\mathcal{A}(s)|}$ otherwise.

Q-learning with function approximation

The q-value update rule is $w_{t+1}=w_t+\alpha_t[r_{t+1}+\gamma \underset{a\in\mathcal{A}(s_{t+1})}\max\hat{q}(s_{t+1},a,w_t)-\hat{q}(s_t,a_t,w_t)]\nabla_w\hat q(s_t,a_t,w_t)$.

Pseudocode:Sarsa with function approximation:

Aim: Search a policy that can lead the agent to the target from an initial state-action pair $(s_0,a_0)$.

For each epsiode do
If the current $s_t$ is not the target state, do
Take action $a_t$ following $\pi_t(s_t)$, generate $r_{t+1},s_{t+1}$.
Value update (parameter update):
$w_{t+1}=w_t+\alpha_t[r_{t+1}+\gamma \underset{a\in\mathcal{A}(s_{t+1})}\max\hat{q}(s_{t+1},a,w_t)-\hat{q}(s_t,a_t,w_t)]\nabla_w\hat q(s_t,a_t,w_t)$.
Policy update:
$\pi_{t+1}(a|s_t)=1-\frac{\epsilon}{|\mathcal{A}(s)|}(|\mathcal{A}(s)|-1)$ if $a=argmax_{a\in\mathcal{A}(s_t)}\hat q(s_t,a,w_{t+1})$.
$\pi_{t+1}(a|s_t)=\frac{\epsilon}{|\mathcal{A}(s)|}$ otherwise.

Deep Q-learning(DQN)

最成功的将深度神经网络引入强化学习的方法.
神经网络扮演的角色是非线性函数.

DQN aims to minimize the objective function/loss function:
$J(w)=E\Big[\Big(R+\gamma\underset{a\in\mathcal{A(S’)}}\sum \hat q(S’,a,w)-\hat q(S,A,w)\Big)^2\Big]$, where $(S,A,R,S’)$ are random variables.
Let $y:=R+\gamma\underset{a\in\mathcal{A(S’)}}\sum \hat q(S’,a,w)$. For the sake of simplicity, we can assume that $w$ in $y$ is fixed(at least for a while) when we calculate the gradient.

To do that, we can introduce 2 networks:

a main network representing $\hat q(s,a,w)$.
a target network $\hat q(s,a,w_T)$.
The objective function in this case degenerated to $J=E\Big[\Big(R+\gamma\underset{a\in\mathcal{A(S’)}}\sum \hat q(S’,a,w_T)-\hat q(S,A,w)\Big)^2\Big]$ where $w_T$ is the target network parameter.
When $w_T$ is fixed, the gradient of $J$ can be easily obtained as
$\nabla_w J=E\Big[\Big(R+\gamma\underset{a\in\mathcal{A(S’)}}\sum \hat q(S’,a,w_T)-\hat q(S,A,w)\Big)\nabla_w\hat{q}(S,A,w)\Big]$

Implementation details:

Let $w$ and $w_T$ denote the parameters of the main and target networks, respectively. They are set to be the same initially.
In every iteration, we draw a mini-batch of samples $\{(s,a,r,s’)\}$ from the replay buffer.
The inputs of the networks include state $s$ and action $a$. The target output is $y_T=R+\gamma \max_{a\in\mathcal{A}}\hat q(s’,a,w_T)$. Then we directly minimize the TD error or called loss function $(y_T-\hat q(s,a,w))^2$ over the mini-match $\{(s,a,y_T)\}$.

Experience replay:
将所有数据放在一个集合中$\mathcal{B}:=\{(s,a,r,s’)\}$
每次需要训练神经网络，我们从该集合中取数据.
为什么要用经验回放：
$J(w)=E\Big[\Big(R+\gamma\underset{a\in\mathcal{A(S’)}}\sum \hat q(S’,a,w)-\hat q(S,A,w)\Big)^2\Big]$

$(S,A)\sim d$
$R\sim p(R|S,A),S’\sim p(S’|S,A)$
The distribution of $(S,A)$ is assumed to be uniform. But the samples are not uniformly collected. So we can break the correlation between consequent samples by experience replay method.

值函数的近似

http://example.com/RL/DQL.html

作者

Jiamin Liu

发布于

2025-07-03

更新于

2025-07-07

值函数的近似

Algorithm for state value estimation

Sarsa with function approximation

Q-learning with function approximation

Deep Q-learning(DQN)

作者

发布于

更新于

许可协议

评论