SU-CS224N MAY092024

Floating Point

4 bytes

\begin{equation} (-1)^{B} + e^{E-127} \times \qty(1 + \sum_{i=1}^{23} b_{23-i}2^{-i}) \end{equation}

usually \(E\) is a 8 bytes, and 23 digits of \(b\).

With more \(E\), we will have more range, with more \(b\), we will have more precision.

Mixed Precision Training

copy the model in FP32
Run forward pass in FP16
Scale loss to be large enough to not be rounded away
Compute gradients in FP16
Convert the gradients onto FP32
Scale the gradients down
apply to the model

BFloat16

To not need to scale, we can use a scheme that has less precision but the same amount of dynamic range (i.e. allocate the same \(E\), chop off \(b\)) —no need to scale, just have more dynamic range.

Distributed Data Parallel

every GPU has a copy of the model
each GPU

all-reduce

reduce each copy of the weights down

Deepspeed Zero

reduce-scatter

squish down and send each part tot the right GPU

all gather

send eveything over everybody else

Stage 1

We cache a slice of the optimizer state on each GPU.

Stage 2

perform a backwards pass
at each layer, compute gradient
look up who in the cluster is responsible for that layer

Stage 3

divide the model parameters into FSDP units
shard each unit across multiple GPUs
run forward pass
run backward pass
each GPU updates its own shard using the full gradient from earlier

(unlike stages 1 and 2, you need to stream in your parameters—more communication overhead!)

Lessons

always use mix precision training
always use bfloat16

PEFT

See PEFT