- 1990 static word embeddings
- 2003 neural language models
- 2008 multi-task learning
- 2015 attention
- 2017 transformer
- 2018 trainable contextual word embeddings + large scale pretraining
- 2019 prompt engineering
Motivating Attention
Given a sequence of embeddings: \(x_1, x_2, …, x_{n}\)
For each \(x_{i}\), the goal of attention is to produce a new embedding of each \(x_{i}\) named \(a_{i}\) based its dot product similarity with all other words that are before it.
Let’s define:
\begin{equation} score(x_{i}, x_{j}) = x_{i} \cdot x_{j} \end{equation}
Which means that we can write:
\begin{equation} a_{i} = \sum_{j \leq i}^{} \alpha_{i,j} x_{j} \end{equation}
where:
\begin{equation} \alpha_{i,j} = softmax \qty(score(x_{i}, x_{j}) ) \end{equation}
The resulting \(a_{i}\) is the output of our attention.
Attention
From the above, we call the input embeddings \(x_{j}\) the values, and we will create a separate embeddings called key with which we will measure the similarity. We call the word we want the target new embeddings for the query (i.e. \(x_{i}\) from above).