SU-CS109 DEC042023

Diffusion Models

We can consider a model between random noise and trees.

For every step, we sample Gaussian noise and add it to the image. The original approach adds Gaussian to the pixels, and nowadays people replace the pixel.

Usually, there is a few thousand steps of noising.

Why is it that we can’t have a one-step policy from noise to pictures? Because of a physics result that says the stability of diffusion becomes intractable at too large steps.

loss function

One way we can model our objective is as a MLE. Because we are continuously adding noise, we can assume that

\begin{equation} y \sim \mathcal{N}(\mu = \hat{y}(\theta), \sigma^{2}=k) \end{equation}

If you compute MLE over the choice of \(\hat{y}(\theta)\), you get the squared error.

ELBO

A cool loss function that diffusion actually uses that leverages the fact above but considers the entire diffusion process.

LSTMs

Big text generation flaw with LSTMs: the latent state vector has to contain information about the ENTIRE sentence and have the information propagated through recursion. Information

Cross Entropy

its MLE over a multinomials; the counts of everything that’s not the one-hot thing just so happens to be 0.

We are essentially computing the derivative of:

\begin{equation} \arg\max_{p_{correct}} p_{correct} \end{equation}

which is trying to maximize the categorical of only the correct element.