SU-CS224N APR252024

Transformers

Motivation

Lower Sequence-Length Time Complexity

Minimize Linear Interaction Distance

The interaction distances scale by \(O(l)\) with \(l\) sequence length—gradient is affected by linear interaction distance: linear order is baked in.

Maximize Parallelization

Forward and backward passes require waiting (waiting for it to roll from left to right)—-instead, you can compute attention in parallel.

Key Advantage

Maximum interaction distance is \(O(1)\) — each word is connected to each other word
Unparallizable operation does not increase by sequence length

Self-Attention

Self-attention is formulated as each word in a sequence attending to each word in the same sequence.

Calculating QKV

\begin{equation} \begin{cases} q_{i} = W^{(Q)} x_{i}\\ k_{i} = W^{(K)} x_{i}\\ v_{i} = W^{(V)} x_{i}\\ \end{cases} \end{equation}

and then you have a standard good time using reduced-rank multiplicative attention:

\begin{equation} e_{ij} = q_{i} \cdot k_{j} \end{equation}

and normalize:

\begin{equation} a_{ij} = \text{softmax} (e_{ij}) \end{equation}

to obtain:

\begin{equation} O_{i} = \sum_{j}^{} a_{ij} v_{j} \end{equation}

Vectorized:

\begin{equation} \begin{cases} Q = W^{(Q)} x\\ K = W^{(K)} x\\ V = W^{(V)} x\\ \end{cases} \end{equation}

and

scale dot-product attention
\begin{equation} Out = \text{softmax} \qty(\frac{QK^{\top}}{\sqrt{d_{k}}}) V \end{equation}
why divide by \(\sqrt{d_{k}}\)? see Tricks in Training.

Transformer Block

Naively having Self-Attention can be described as simply a rolling average. To introduce nonlinearity, we apply a linear layer with a ReLU after.

Tricks in Training

skip connections \(x_{l} = F(x_{l-1})+x_{l-1}\)
layernorm (normalize each layer to mean zero and standard deviation of one, so we protect against lower layer’s distribution shifts) \(x^{(l’)} = \frac{x^{(l)}- \mu^{(l)}}{\sigma^{(l)} + \epsilon}\) we use population mean and population standard deviation
1. mean of sum is sum of means, meaning after this the input would have mean \(0\) which is good
2. yet, the mean of variance is sum of variance, so for dimension \(d\), the resulting one-variant layer becomes d-variant. so, we normalize our attention by \(d_{k}\)

Word Order

Sinusoidal Position Embeddings

No one uses it lol. ABSOLUTE position doesn’t really matter. See Relative Position Embeddings.

Relative Position Embeddings

Relative positions are LEARNED and added to the self-attention outputs.

so we learn embeddings

Multi-Head Attention

Perform attention multiple times, get a series of SA embeddings and concatenate. For each single head, divide by number of heads (so you end up doing the same amonut of computation)

Transformer Drawbacks

quadratic compute of self-attention (computing pairs of interaction means that the computation grows quadratic) — linformer, attempts to solve this