蒙特卡洛方法

Three algorithm:

  • MC Basic
  • MC Exploring Starts
  • MC $\epsilon$-Greedy

MA Basic

2 expression of action value:

  1. requires the model:
     $q_{\pi_k}(s,a)=\underset{r}{\sum}p(r|s,a)r+\gamma\underset{s’}{\sum}p(s’|s,a)v_{\pi_k}(s’)$
  2. does not require the model(model-free):
     $q_{\pi_k}(s,a)=E[G_t|S_t=s,A_t=a]$
     We can use expression 2 to calculate $q_{\pi_k}(s,a)$ based on data (samples or experiences)
    总结:计算Action value的两种方法,条件和数据必须二者有其一.

The procedure of Monte Carlo estimation of action values:

  • Starting from $(s,a)$, following policy $\pi_k$, generate an episode.
  • The return of this episode is $g(s,a)$
  • $g(s,a)$ is a sample of $G_t$ in $q_{\pi_k}(s,a)=E[G_t|S_t=s,A_t=a]$
  • Suppose we have a set of episodes and hence $\{g^{(j)}(s,a)\}. Then, $q_{\pi_k}(s,a)=E[G_t|S_t=s,A_t=a]\approx\frac{1}{N}\underset{i=1}{\overset{N}\sum}g^{(i)}(s,a)$.
    用数据估计$q_{\pi_k}(s,a)$. 对于每个$(s,a)$都需要有足够多的episodes计算得到$q_{\pi_k}(s,a)$

MC Basic效率很低,在实际中用处不大.

MC Exploring Starts

想要优化MC Basic算法可以从两方面进行调整:

1. Use data more efficiently

尝试更加充分的利用每个episode种被访问到的数据.
Visit: every time a state-action appears in the episode, it is called a visited of that state-action pair.

  • Initial-visit method:只估计initial visit的action value
  • Data-efficient method:
    • first-visit method: 只有一个episode中$(s,a)$的第一次出现可以被用来估计$(s,a)$的action value.
    • every-visit method: 只要在episode中出现了,就可以被用来估计anction value.

2. Update value estimate more efficiently

  • The first method: 等从一个$(s,a)$出发的所有episodes都被收集完成,再进行$(s,a)$action value的估算,然后使用计算结果更新$\pi$.
      MC Basic 使用该方法. 需要等待所有episodes收集完成,费时间,效率低.
  • The second method:使用单个episodes的返回值去近似action value,每个episode都可以改进策略,效率提升.

Pseudocode: MC Exploring Starts(a sample-efficient variant of MC Basic)

  • Initialization: Initial guess $\pi_0$.
  • Aim: Search for an optimal policy.

  • For each episode,do

    • Episode generation:
作者

Jiamin Liu

发布于

2025-06-28

更新于

2025-06-29

许可协议

评论