值迭代与策略迭代

Value iteration: it is the iterative algorithm solving the Bellman optimality: given an initial value $v_0$,

$$ \begin{aligned} v_{k+1}=\max_\pi(r_{\pi}+\gamma P_{\pi}v_{\pi})\\ \begin{cases} Policy\ update:\pi_{k+1}=argmax_{\pi}(r_{\pi}+\gamma P_{\pi}v_k)\\ Value\ update:v_{k+1}=r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}}v_k \end{cases} \end{aligned} $$

Policy iteration: given an initial policy $\pi_0$,

$$ \begin{aligned} \begin{cases} Policy\ evaluation:v_{\pi_k}=r_{\pi_k}+\gamma P_{\pi_k}v_{\pi_k}\\ Policy\ improvement:\pi_{k+1}=argmax_{\pi}(r_{\pi}+\gamma P_\pi v_{\pi_k}) \end{cases} \end{aligned} $$

Compare value iteration and policy iteration

  • In policy iteration, solving $v_{\pi_k}=r_{\pi_k}+\gamma P_{\pi_k}v_{\pi_k}$ requires an iterative algorithm (an infinite number of iterations).
  • In value iteration,$v_{k+1}=r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}}v_k$ is a one-step iteration.

注意到Policy iteration算法在计算$v_{\pi_k}=r_{\pi_k}+\gamma P_{\pi_k}v_{\pi_k}$要迭代无穷多次才可以逼近$v^{*}$,所以该算法只在理论上可行. 如果将迭代次数控制在$j$次,就产生了Truncated policy iteration算法.

Pseudocode of Truncated policy iteration
Initialization: The probability model $p(r|s,a)$ and $p(s’|s,a)$ for all $(s,a)$ are known. Initial guess $\pi_0$.
Aim: Search for the optimal state value and an optimal policy.
While the policy has not converged, for the $k$-th iteration, do

  • Policy evaluation:
    Initialization: select the initial guess as $v_k^{(0)}=v_{k-1}$. The maximum iteration is set to be $j_{truncate}$.
    While $j\lt j_{truncate}$, do
     For every state $s\in S$, do
      $v_k^{(j+1)}(s)=\sum_a\pi_k(a|s)[\sum_r p(r|s,a)r+\gamma\sum_{s’}p(s’|s,a)v_k^{(j)}(s’)]$

    Set $v_k=v_k^{(j_{truncate})}$.

  • Policy improvement:
    For every state $s\in S$, do
     For every action $a\in \mathcal{A}(s)$, do
      $q_k(s,a)=\sum_r p(r|s,a)r+\gamma\sum_{s’}p(s’|s,a)v_k(s’)$
      $a_k^{*}(s)=argmax_a q_k(s,a)$

      $\pi_{k+1}(a|s)=1$ if $a=a_k^{*}$, and $\pi_{k+1}(a|s)=0$ otherwise.

可以看到Value iteration和Policy iteration是Truncated policy iteration将$j_{truncated}$设置为$1$和$\infty$的两个特殊版本.可以证明,一般的$j_{truncated}$值下每轮迭代计算得到的优化结果位于Value iteration和Policy iteration之间,因此也是收敛的.

值迭代与策略迭代

http://example.com/RL/PIVI.html

作者

Jiamin Liu

发布于

2025-06-27

更新于

2025-06-29

许可协议

评论