2025-07-02发表2025-07-03更新9 分钟读完 (大约1343个字)

时序差分算法

1. TD learning of state value
The data/experience required by the algorithm:

$(s_0,r_1,s_1,\cdots,s_t,r_{t+1},s_{t+1})$ or $\{(s_t,r_{t+1},s_{t+1})\}_t$ renerated following the given policy $\pi$.

The TD learning algorithm is:
$v_{t+1}(s_t)=v_t(s_t)-\alpha_t(s_t)\Big[v_t(s_t)-[r_{t+1}+\gamma v_t(s_{t+1})]\Big]$,(1)
$v_{t+1}(s)=v_t(s),\forall s\neq s_t$,(2)
where $t=0,1,2,\cdots$. Here, $v_t(s_t)$ is the estimated state value of $v_{\pi}(s_t)$; $\alpha_t(s_t)$ is the learning rate of $s_t$ at time $t$.

This algorithm drives $v(s_t)$ towards $\bar v_t$.
Other property :
The algorithm only estimate state values. It can’t estimate the action values and optimal values.

Theorem (Convergence of TD learning)
By the TD algorithm(1), $v_t(s)$ converges with probability 1 to $v_\pi(s)$ for all $s\in \mathcal{S}$ as $t\rightarrow \infty$ if $\sum_t\alpha_t(s)=\infty$ and $\sum_t\alpha_t^2(s)\lt\infty$ for all $s\in\mathcal{S}$.

TD/Sarsa learning vs MC learning:

TD/Sarsa	MC learning
Online: it can update the state/action values immediately after receiving a reward.	Offline: has to wait until an epsiode has been completely collection
Continuing tasks: handle both episodic and continuing tasks	Episodic tasks: only handle tasks that has terminate states.
Bootstrapping: TD bootstraps because the update of a value relies on the previous estimate of this value. Hence it require initial guess.	Non-bootstrapping: MC is not bootstrapping, because it can directly estimate state/action values without any initial guess.
Low estimation variance（方差小）: there are fewer random variables.	High estimation variance（方差大）: need a lot of random variables.

2. TD learning of action values: Sarsa
Suppose we have some experience $\{(s_t,a_t,r_{t+1},s_{t+1},a_{t+1})\}$
$q_{t+1}(s_t,a_t)=q_t(s_t,a_t)-\alpha_t(s_t,a_t)\Big[q_t(s_t,a_t)-[r_{t+1}+\gamma q_t(s_{t+1},a_{t+1})]\Big]$,(1)
$q_{t+1}(s,a)=q_t(s,a),\forall (s,a)\neq (s_t,a_t)$,(2)
where $t=0,1,2,\cdots$. Here, $q_t(s_t,a_t)$ is the estimated state value of $q_{\pi}(s_t,a_t)$; $\alpha_t(s_t,a_t)$ is the learning rate of $s_t,a_t$ at time $t$.

Sarsa is the abbreviation of state-action-reward-state-action.
Sarsa is an action-value version of the TD algorithm.

Theorem (Convergence of Sarsa learning)
By the Sarsa algorithm, $q_t(s,a)$ converges with probability $1$ to $q_\pi(s,a)$ as $t\rightarrow \infty$ for all $(s,a) $ if $\sum_t\alpha_t(s,a)=\infty$ and $\sum_t\alpha_t^2(s,a)\lt\infty$ for all $(s,a)$.

3. Expected Sarsa
$q_{t+1}(s_t,a_t)=q_t(s_t,a_t)-\alpha_t(s_t,a_t)\Big[q_t(s_t,a_t)-(r_{t+1}+\gamma E[q_t(s_{t+1},A)])\Big]$,
$q_{t+1}(s,a)=q_t(s,a),
where $E[q_t(s_{t+1},A)]=\underset{a}\sum \pi_t(a|s_{t+1})q_t(s_{t+1},a)=v_t(s_{t+1}) is the expected value of $q_t(s_{t+1},a)$ under policy $\pi_t$.

Need more computation than Sarsa. But it is beneficial in the sense that it reduces the estimation variances because it reduces random variables in Sarsa from $\{s_t,a_t,r_{t+1},s_{t+1},a_{t+1}\}$ to $\{s_t,a_t,r_{t+1},s_{t+1}\}$.

Expected Sarsa is a stochastic approximation for solving the following equation:
$q_\pi(s,a)=E\Big[R_{t+1}+\gamma E_{A_{t+1}\sim\pi(S_{t+1})}[q_\pi(S_{t+1},A_{t+1})]\Big|S_t=s,A_t=a\Big],\forall s,a$.

n-step Sarsa
It can unify Sarsa and Monte Carlo learning.
Sarsa aims to solve: $q_\pi(s,a)=E[G_t^{(1)}|s,a]=E[R_{t+1}+\gamma q_\pi(S_{t+1},A_{t+1})|s,a]$.
MC learning aims to solve: $q_\pi(s,a)=E[G_t^{(\infty)}|s,a]=E[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\cdots|s,a]$.
An intermediate algorithm called n-step Sarsa aims to solve:
$q_\pi(s,a)=E[G_t^{(n)}|s,a]=E[R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^nq_\pi(S_{t+n},A_{t+n}|s,a]$.

The algorithm of n-step Sarsa is:$q_{t+1}(s_t,a_t)=q_t(s_t,a_t)-\alpha_t(s_t,a_t)\Big[q_t(s_t,a_t)-[r_{t+1}+\gamma r_{t+2}+\cdots+ \gamma^n q_t(s_{t+n},a_{t+n})])\Big]$.
n-step Sarsa is more general because it becomes the (one-step) Sarsa algorithm when $n=1$ and the MC learning algorithm becomes MC when $n=\infty$.

n-step Sarsa needs $(s_t,a_t,r_{t+1},s_{t+1},a_{t+1},\cdots,r_{t+n},s_{t+n},a_{t+n})$.

TD learning odf optimal action value: Q-learning
直接估计最优action value.
The Q-learning algorithm is
$q_{t+1}(s_t,a_t)=q_t(s_t,a_t)-\alpha_t(s_t,a_t)\Big[q_t(s_t,a_t)-[r_{t+1}+\gamma \underset{a\in\mathcal{A}}\max q_t(s_{t+1},a)]\Big]$,
$q_{t+1}(s,a)=q_t(s,a),\forall (s,a)\neq (s_t,a_t).

It aims to solve $q(s,a)=E[R_{t+1}+\gamma\underset{a}q(S_{t+1},a)\Big|S_t=s,A_t=a],\forall s,a$.
This is the Bellman optimality equation expressed in trems of action values.

There exists 2 policies in TD learning task:

The behavior policy is used to generate experience samples.
The target policy is constantly update toward an optimal policy.

On-policy vs off-policy:

When the behavior policy is the same as the target policy, such king of learning is called on policy.(Sarsa)
When they are different, the learning is called off-policy.(Q-learning)
- Advantages: It can search for optimal policies based on the experience samples generated by any other policies.

Sarsa is on-policy:
- Sarsa aims to solve the Bellman equation of a given policy $\pi$.
- The algorithm requires $(s_t,a_t,r_{t+1},s_{t+1},a_{t+1})$.
- $\pi_t$ is both the target policy and the behavior policy.
Mont Carlo learning is on-policy:
- aims to estimate action values with a long epsiode.
- $\pi\rightarrow exp$
- A policy is used to generate samples, which is further used to estimate the action values of the policy. Based on the action values, we can improved the policy(target).
Q-learning is off-policy:
- Aims to solve the Bellman optimality equation(显式不含有任何策略$\pi$)
- requires$(s_t,a_t,r_{t+1},s_{t+1})$.
- The behavior policy to generate $a_t$ from $s_t$ can be anything. The target policy is the optimal policy.

Q-learning can be implemented in either off-policy or on-policy fashion.

A unified point of view
uniform expression:
$q_{t+1}(s_t,a_t)=q_t(s_t,a_t)-\alpha_t(s_t,a_t)[q_t(s_t,a_t)-\bar{q}_t]$,
where $\bar{q}_t is the TD target.

Different TD algorithm have different $\bar{q}_t$.

Sarsa: $\bar{q}_t=r_{t+1}+\gamma q_t(s_{t+1},a_{t+1})$
$n$-step Sarsa: $\bar{q}_t=r_{t+1}+\gamma r_{t+2}+\cdots+\gamma^nq_t(s_{t+n},a_{t+n})$
Expected Sarsa: $\bar{q}_t=r_{t+1}+\gamma \sum_a \pi_t(a|s_{t+1})q_t(s_{t+1},a)$.
Q-learning: $\bar{q}_t=r_{t+1}+\gamma \max_a q_t(s_{t+1},a)$.
Monte Carlo: $\bar{q}_t=r_{t+1}+\gamma r_{t+2}+\cdots$.

时序差分算法

http://example.com/RL/TDL.html

作者

Jiamin Liu

发布于

2025-07-02

更新于

2025-07-03

时序差分算法

作者

发布于

更新于

许可协议

评论