蒙特卡洛方法
Three algorithm:
- MC Basic
- MC Exploring Starts
- MC $\epsilon$-Greedy
MA Basic
2 expression of action value:
- requires the model:
$q_{\pi_k}(s,a)=\underset{r}{\sum}p(r|s,a)r+\gamma\underset{s’}{\sum}p(s’|s,a)v_{\pi_k}(s’)$ - does not require the model(model-free):
$q_{\pi_k}(s,a)=E[G_t|S_t=s,A_t=a]$
We can use expression 2 to calculate $q_{\pi_k}(s,a)$ based on data (samples or experiences)
总结:计算Action value的两种方法,条件和数据必须二者有其一.
The procedure of Monte Carlo estimation of action values:
- Starting from $(s,a)$, following policy $\pi_k$, generate an episode.
- The return of this episode is $g(s,a)$
- $g(s,a)$ is a sample of $G_t$ in $q_{\pi_k}(s,a)=E[G_t|S_t=s,A_t=a]$
- Suppose we have a set of episodes and hence $\{g^{(j)}(s,a)\}. Then, $q_{\pi_k}(s,a)=E[G_t|S_t=s,A_t=a]\approx\frac{1}{N}\underset{i=1}{\overset{N}\sum}g^{(i)}(s,a)$.
用数据估计$q_{\pi_k}(s,a)$. 对于每个$(s,a)$都需要有足够多的episodes计算得到$q_{\pi_k}(s,a)$
MC Basic效率很低,在实际中用处不大.
MC Exploring Starts
想要优化MC Basic算法可以从两方面进行调整:
1. Use data more efficiently
尝试更加充分的利用每个episode种被访问到的数据.
Visit: every time a state-action appears in the episode, it is called a visited of that state-action pair.
- Initial-visit method:只估计initial visit的action value
- Data-efficient method:
- first-visit method: 只有一个episode中$(s,a)$的第一次出现可以被用来估计$(s,a)$的action value.
- every-visit method: 只要在episode中出现了,就可以被用来估计anction value.
2. Update value estimate more efficiently
- The first method: 等从一个$(s,a)$出发的所有episodes都被收集完成,再进行$(s,a)$action value的估算,然后使用计算结果更新$\pi$.
MC Basic 使用该方法. 需要等待所有episodes收集完成,费时间,效率低. - The second method:使用单个episodes的返回值去近似action value,每个episode都可以改进策略,效率提升.
Pseudocode: MC Exploring Starts(a sample-efficient variant of MC Basic)
- Initialization: Initial guess $\pi_0$.
Aim: Search for an optimal policy.
For each episode,do
- Episode generation: