值迭代与策略迭代
Value iteration: it is the iterative algorithm solving the Bellman optimality: given an initial value $v_0$,
Policy iteration: given an initial policy $\pi_0$,
Compare value iteration and policy iteration
- In policy iteration, solving $v_{\pi_k}=r_{\pi_k}+\gamma P_{\pi_k}v_{\pi_k}$ requires an iterative algorithm (an infinite number of iterations).
- In value iteration,$v_{k+1}=r_{\pi_{k+1}}+\gamma P_{\pi_{k+1}}v_k$ is a one-step iteration.
注意到Policy iteration算法在计算$v_{\pi_k}=r_{\pi_k}+\gamma P_{\pi_k}v_{\pi_k}$要迭代无穷多次才可以逼近$v^{*}$,所以该算法只在理论上可行. 如果将迭代次数控制在$j$次,就产生了Truncated policy iteration算法.
Pseudocode of Truncated policy iteration
Initialization: The probability model $p(r|s,a)$ and $p(s’|s,a)$ for all $(s,a)$ are known. Initial guess $\pi_0$.
Aim: Search for the optimal state value and an optimal policy.
While the policy has not converged, for the $k$-th iteration, do
Policy evaluation:
Initialization: select the initial guess as $v_k^{(0)}=v_{k-1}$. The maximum iteration is set to be $j_{truncate}$.
While $j\lt j_{truncate}$, do
For every state $s\in S$, do
$v_k^{(j+1)}(s)=\sum_a\pi_k(a|s)[\sum_r p(r|s,a)r+\gamma\sum_{s’}p(s’|s,a)v_k^{(j)}(s’)]$Set $v_k=v_k^{(j_{truncate})}$.
Policy improvement:
For every state $s\in S$, do
For every action $a\in \mathcal{A}(s)$, do
$q_k(s,a)=\sum_r p(r|s,a)r+\gamma\sum_{s’}p(s’|s,a)v_k(s’)$
$a_k^{*}(s)=argmax_a q_k(s,a)$$\pi_{k+1}(a|s)=1$ if $a=a_k^{*}$, and $\pi_{k+1}(a|s)=0$ otherwise.
可以看到Value iteration和Policy iteration是Truncated policy iteration将$j_{truncated}$设置为$1$和$\infty$的两个特殊版本.可以证明,一般的$j_{truncated}$值下每轮迭代计算得到的优化结果位于Value iteration和Policy iteration之间,因此也是收敛的.
值迭代与策略迭代