NLP Semantics Timeline

1990 static word embeddings
2003 neural language models
2008 multi-task learning
2015 attention
2017 transformer
2018 trainable contextual word embeddings + large scale pretraining
2019 prompt engineering

Motivating Attention

Given a sequence of embeddings: \(x_1, x_2, …, x_{n}\)

For each \(x_{i}\), the goal of attention is to produce a new embedding of each \(x_{i}\) named \(a_{i}\) based its dot product similarity with all other words that are before it.

Let’s define:

\begin{equation} score(x_{i}, x_{j}) = x_{i} \cdot x_{j} \end{equation}

Which means that we can write:

\begin{equation} a_{i} = \sum_{j \leq i}^{} \alpha_{i,j} x_{j} \end{equation}

where:

\begin{equation} \alpha_{i,j} = softmax \qty(score(x_{i}, x_{j}) ) \end{equation}

The resulting \(a_{i}\) is the output of our attention.

Attention

From the above, we call the input embeddings \(x_{j}\) the values, and we will create a separate embeddings called key with which we will measure the similarity. We call the word we want the target new embeddings for the query (i.e. \(x_{i}\) from above).