贝尔曼最优公式
Bellman optimality equation (elementwise form) (BOE)
类似Bellman公式,但这里$\pi(s)$是未知的,需要求解得到最优策略
Bellman optimality equation (matrix-vector form)
where the elements corresponding to $s$ or $s’$ are
Contraction mapping(压缩映射):f is a contraction mapping if $||f(x_1)-f(x_2)||\leq\gamma||x_1-x_2||$ where $\gamma\in (0,1)$.
Contraction mapping theorem(压缩映射原理):
For any equation that has the form of $x=f(x)$, if $f$ is a contraction mapping, then:
- Existence: there exists a fixed point $x^{*}$ satisfying $f(x^{*})=x^{*}$.
- Uniqueness: the fixed point $x^{*}$ is unique.
- Algorithm: consider a sequence $\{x_k\}$ where $x_{k+1}=f(x_k)$, then $x_k\rightarrow x^{*}$ as $k\rightarrow\infty$. Moreover, the convergence rate is exponentially fast.
For BOE,
Theorem(Contraction Property)
$f(v)$ is a contraction mapping satisfying $||f(v_1)-f(v_2)||\leq \gamma||v_1-v_2||$ where $\gamma$ is the discount rate.
Theorem(Exitence,Uniqueness, and Algorithm)
For the BOE $v=f(v)=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v)$, there always exists a solution $v^{*}$ and the solution is unique. The solution could be solved iteratively by
This sequence $\{v_k\}$ converges to $v^{*}$ exponentially fast given any initial guess $v_0$. The convergence rate is determined by $\gamma$.
Theorem(Policy Optimality)
Suppose that $v^{*}$ is the unique solution to $v=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v)$, and $v_{\pi}$ is the state value function satisfying $v_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi}$ for any given policy $\pi$, then $v^{*}\geq v_{\pi},\forall\pi$
Optimal polity $\pi^*$ has the form:
For any $s\in S$, the deterministic greedy policy
is an optimal policy solving the BOE.Here $a^{*}(s)=\arg\max_aq^{*}(a,s)$, where $q^{*}(s,a):=\sum_r p(r|s,a)r+\gamma\sum_{s’}p(s’|s,a)v^{*}(s’)$.