stochastic gradient descent

gradient descent makes a pass over all points to make one gradient step. We can instead approximate gradients on a minibatch of data. This is the idea behind stochastic-gradient-descent.

\begin{equation} \theta^{t+1} = \theta^{t} - \eta \nabla_{\theta} L(f_{\theta}(x), y) \end{equation}

this terminates when theta differences becomes small, or when progress halts: like when \(\theta\) begins going up instead.

we update the weights in SGD by taking a single random sample and moving weights to that direction.

while True:
    subset = sample_window(corpus)
    theta = theta - lr*subset.grad()

In theory this is an approximation of gradient descent; however, Neural Networks works actually BETTER when you jiggle a bit.

batch gradient descent

stochastic gradient descent gives choppy movements because it does one sample at once.

batch gradient descent does it over the entire dataset, which is fine but its slow.

mini-batch gradient

mini-batches helps take advantage of both by training over groups of \(m\) samples

See