\begin{equation} \theta^{t+1} = \theta^{t} - \eta \nabla_{\theta} L(f_{\theta}(x), y) \end{equation}
this terminates when theta differences becomes small, or when progress halts: like when \(\theta\) begins going up instead.
we update the weights in SGD by taking a single random sample and moving weights to that direction.
while True:
subset = sample_window(corpus)
theta = theta - lr*subset.grad()
In theory this is an approximation of gradient descent; however, Neural Networks works actually BETTER when you jiggle a bit.
batch gradient descent
stochastic gradient descent gives choppy movements because it does one sample at once.
batch gradient descent does it over the entire dataset, which is fine but its slow.
mini-batch gradient
mini-batches helps take advantage of both by training over groups of \(m\) samples
Regularization
See Regularization