2025-06-26发表2025-06-29更新3 分钟读完 (大约480个字)

贝尔曼最优公式

Bellman optimality equation (elementwise form) (BOE)

$$ \begin{aligned} v(s)&=\max_{\pi}\sum_{a}\pi(a|s)[\sum_r p(r|s,a)r+\gamma\sum_{s'}p(s'|s,a)v(s')]\\ &=\max_{\pi}\sum_a\pi(a|s)q(s,a) \end{aligned} $$

类似Bellman公式，但这里$\pi(s)$是未知的，需要求解得到最优策略

Bellman optimality equation (matrix-vector form)

$$ \begin{aligned} v=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v) \end{aligned} $$

where the elements corresponding to $s$ or $s’$ are

$$ \begin{aligned} [r_{\pi}]_s &\triangleq\sum_a\pi(a|s)\sum_r p(r|s,a)r,\\ [P_{\pi}]_{s,s'}=p(s|s')&\triangleq\sum_a\pi(a|s)\sum_{s'}p(s'|s,a) \end{aligned} $$

Contraction mapping(压缩映射):f is a contraction mapping if $||f(x_1)-f(x_2)||\leq\gamma||x_1-x_2||$ where $\gamma\in (0,1)$.
Contraction mapping theorem(压缩映射原理):
For any equation that has the form of $x=f(x)$, if $f$ is a contraction mapping, then:

Existence: there exists a fixed point $x^{*}$ satisfying $f(x^{*})=x^{*}$.
Uniqueness: the fixed point $x^{*}$ is unique.
Algorithm: consider a sequence $\{x_k\}$ where $x_{k+1}=f(x_k)$, then $x_k\rightarrow x^{*}$ as $k\rightarrow\infty$. Moreover, the convergence rate is exponentially fast.

For BOE,

$\begin{aligned} v=f(v)=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v) \end{aligned}$

Theorem(Contraction Property)
$f(v)$ is a contraction mapping satisfying $||f(v_1)-f(v_2)||\leq \gamma||v_1-v_2||$ where $\gamma$ is the discount rate.

Theorem(Exitence,Uniqueness, and Algorithm)
For the BOE $v=f(v)=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v)$, there always exists a solution $v^{*}$ and the solution is unique. The solution could be solved iteratively by

$\begin{aligned} v_{k+1}=f(v_k)=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v_k) \end{aligned}$

This sequence $\{v_k\}$ converges to $v^{*}$ exponentially fast given any initial guess $v_0$. The convergence rate is determined by $\gamma$.

$\begin{aligned} v^*=r_{\pi}^*+\gamma P_{\pi^*}v^* \end{aligned}$

Theorem(Policy Optimality)
Suppose that $v^{*}$ is the unique solution to $v=\max_{\pi}(r_{\pi}+\gamma P_{\pi}v)$, and $v_{\pi}$ is the state value function satisfying $v_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi}$ for any given policy $\pi$, then $v^{*}\geq v_{\pi},\forall\pi$

Optimal polity $\pi^*$ has the form:
For any $s\in S$, the deterministic greedy policy

$$ \begin{aligned} \pi^*(a|s)=\begin{cases} 1,&a=a^*(s)\\0,&a\neq a^*(s) \end{cases} \end{aligned} $$

is an optimal policy solving the BOE.Here $a^{*}(s)=\arg\max_aq^{*}(a,s)$, where $q^{*}(s,a):=\sum_r p(r|s,a)r+\gamma\sum_{s’}p(s’|s,a)v^{*}(s’)$.

贝尔曼最优公式

http://example.com/RL/BOE.html

作者

Jiamin Liu

发布于

2025-06-26

更新于

2025-06-29

贝尔曼最优公式

作者

发布于

更新于

许可协议

评论