随即近似与随机梯度下降

Mean estimation: compute $E[X]$ using $\{x_k\}$, $w_{k+1}=w_k-\frac{1}{k}(w_k-x_k).$

RM algorithm: solve $g(w)=0$ using $\{\widetilde{g}(w_k,\eta_k)\}$, $w_{k+1}=w_k-a_k\widetilde{g}(w_k,\eta_k)$.
Theorem(Robbins-Monro Theorem)
In the Robbins-Monro algorithm, if
(1) $0\lt c_1\leq\nabla_wg(w)\leq c_2,\forall w$;
(2) $\underset{k=1}{\overset{\infty}\sum}a_k=\infty$ and $\underset{k=1}{\overset{\infty}\sum}a_k^2\lt\infty$;
(3) $E[\eta_k|\mathcal{H}_k]=0$ and $E[\eta_k^2|\mathcal{H}_k]\lt \infty$;
where $\mathcal{H}_k=\{w_k,w_{k-1},\cdots\}$, then $w_k$ converges with probability $1$ to the root $w^{*}$ satisfying $g(w^{*})=0$.

SGD algorithm: minimize $J(w)=E[f(w,X)]$ using $\{\nabla_w f(w_k,x_k)\}$, $w_{k+1}=w_k-\alpha_k\nabla_w f(w_k,x_k)$.

Suppose we would like to minimize $J(w)=E[f(w,X)]$ given a set of random samples $\{x_i\}_{i=1}^n$ of $X$. The BGD,SGD,MBGD algorithms solving this problem are, respectively,
(1) BGD: $w_{k+1}=w_k-\alpha_k\frac{1}{n}\underset{i=1}{\overset{n}\sum}\nabla_w f(w_k,x_i)$.
(2) MBGD: $w_{k+1}=w_k-\alpha_k\frac{1}{m}\underset{j\in\mathcal{I}_k}\sum\nabla_w f(w_k,x_j)$.
(3) SGD: $w_{k+1}=w_k-\alpha_k\nabla_kf(w_k,x_k)$.

In the BGD algorithm, all the samples are used in every iteration. when n is large,$\frac{1}{n}\underset{i=1}{\overset{n}\sum}\nabla_w f(w_k,x_i)$ is close to the true-gradient.
In the MBGD algorithm, $\mathcal{I}_k$ is a subset of $\{1,\cdots, n\}$ withe size as $|\mathcal{I}_k|=m$. The set $\mathcal{I}_k$ is obtained by $m$ times i.d.d samplings.
In the SGD algorithm, $x_k$ is randomly sampled from $\{x_i\}_{i=1}^n$ at time $k$.

If $m=1$, MBGD becomes SGD.
If $m=n$, MBGD does not become BGD strictly speaking because MBGD uses randomly fetched $n$ samples whereas BGD usws all $n$ numbers. In particular, MBGD may use a value in $\{x_i\}_i=1^n$ multiple times whereas BGD uses each number one.

随即近似与随机梯度下降

http://example.com/RL/SGD.html

作者

Jiamin Liu

发布于

2025-07-02

更新于

2025-07-02

许可协议

评论