we will train a classifier on a binary prediction task: “is context words \(c_{1:L}\) likely to show up near some target word \(W_0\)?”
We estimate the probability that \(w_{0}\) occurs within this window based on the product of the probabilities of the similarity of the embeddings between each context word and the target word.
- we have a corpus of text
- each word is represented by a vector
- go through each position \(t\) in the text, which has a center word \(c\) and set of context words \(o \in O\)
- use similarity of word vectors \(c\) and \(o\) to calculate \(P(o|c)\)
Meaning, we want to devise a model which can predict high probabilities \(P(w_{t-n}|w_{t})\) for small \(n\) and low probabilities for large \(n\)
Word2Vec is a Bag of Words model!
This is a Bag of Words model as our training does not learn any information relating to the ordering and structure between words.
Likelihood
If we wrote the above out:
\begin{equation} L(\theta) = \prod_{t=1}^{T} \prod_{-m \leq j \leq m, j\neq 0}^{} p_{\theta}(w_{t+j} | w_{t}) \end{equation}
Calculating \(p_{\theta}\)
We are going to use TWO VECTORS for each word:
- \(v_{w}\) when \(w\) is the center word
- and \(u_{w}\) when \(w\) is a context words
These vectors are the only parameters of our system. We actually do this only to make the math easy; to get the “word vector” for a word by averaging.
Therefore:
\begin{equation} p(o|c) = \frac{\exp\qty(u_{o} \cdot v_{c})}{ \sum_{w \in V}^{} \exp \qty(u_{w} \cdot v_{c})} \end{equation}
- exponentiation makes anything positive
- normalize over the entire vocabulary
this is a softmax operation.
Objective Function
But we perform:
- descent
- and log on each value to prevent underflow
- and average why not
\begin{equation} J(\theta) = \frac{1}{T} \log L(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j\neq 0}^{} \log p_{\theta}\qty(w_{t+j} | w_{t}) \end{equation}
Recall that:
\begin{equation} p(o|c) = \frac{\exp\qty(u_{o} \cdot v_{c})}{ \sum_{w \in V}^{} \exp \qty(u_{w} \cdot v_{c})} \end{equation}
Because we need to minimize this, we need the derivative of it by the parameter:
\begin{align} \pdv{J}{w_{t}} &= -\frac{1}{T} \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j\neq 0}^{} \pdv w_{t} \log p_{\theta}\qty(w_{t+j} | w_{t}) \end{align}
meaning, we now can just calculate the inner part:
\begin{equation} \pdv{\log p(o|c)}{v_{c}} = \pdv v_{c} \log \exp u_{o} \cdot v_{c} - \pdv v_{c} \log \sum_{w_{j}}^{} \exp \qty(u_{w} \cdot v_{c}) \end{equation}
Look! The first part is a log of an exp, which cancels out, so the derivative is just \(u_{0}\).
For the right part, by the chain rule:
\begin{align} \pdv v_{c} \log \sum_{w_{j}}^{} \exp \qty(u_{w} \cdot v_{c}) &= \frac{\sum_{x_{j}}^{} \pdv v_{c}\exp \qty(u_{x} \cdot v_{c}) }{\sum_{w_{j}}^{} \exp \qty(u_{w} \cdot v_{c})} \\ &= \frac{\sum_{x_{j}}^{}\exp \qty(u_{x} \cdot v_{c}) u_{x}}{\sum_{w_{j}}^{} \exp \qty(u_{w} \cdot v_{c})} \end{align}
Combining this whole thing, we have:
\begin{equation} \pdv{\log p(o|c)}{v_{c}} = u_{0} - \frac{\sum_{x_{j}}^{}\exp \qty(u_{x} \cdot v_{c}) u_{x}}{\sum_{w_{j}}^{} \exp \qty(u_{w} \cdot v_{c})} \end{equation}
Rewriting this slightly:
\begin{equation} \pdv{\log p(o|c)}{v_{c}} = u_{0} - \sum_{x_{j}}^{}\frac{\exp \qty(u_{x} \cdot v_{c}) }{\sum_{w_{j}}^{} \exp \qty(u_{w} \cdot v_{c})} u_{x} \end{equation}
Meaning:
\begin{equation} \pdv{\log p(o|c)}{v_{c}} = u_{0} - \sum_{x_{j}}^{}\frac{\exp \qty(u_{x} \cdot v_{c}) }{\sum_{w_{j}}^{} \exp \qty(u_{w} \cdot v_{c})} u_{x} \end{equation}
The right side is just the softmax probabilty of each \(u_{x}\) times \(u_{x}\), meaning its \(\mathbb{E}[u_{x}]\); so, this loss just minimizes “error between output and expectation”.
Word2Vec Variants
Model
- skip-gram—predict probability of being side words \(P(o|c)\)
- CBOW—predict probability of being center word given side words
Objective
- naive softmax (above)
- hierachichar softmax
- negative sampling (see also skip-gram with negative sampling)
properties
window size
- smaller windows: captures more syntax level information
- large windows: capture more semantic field information
parallelogram model
simple way to solve analogies problems with vector semantics: get the difference between two word vectors, and add it somewhere else to get an analogous transformation.
- only words for frequent words
- small distances
- but not quite for large systems
allocational harm
embeddings bake in existing biases, which leads to bias in hiring practices, etc.
skip-gram with negative sampling
skip-gram trains vectors separately for word being used as target and word being used as context.
the mechanism for training the embedding:
- select some \(k\), which is the count of negative examples (if \(k=2\), every one positive example will be matched with 2 negative examples)
- sample a target word, and generate positive samples paired by words in its immediate window
- sample window size times \(k\) negative examples, where the noise words are chosen explicitly as not being near our target word, and weighted based on unigram frequency
for each paired training sample, we minimize the loss via binary cross entropy loss:
\begin{equation} L_{CE} = -\qty[ \log (\sigma(c_{pos} \cdot w)) + \sum_{i=1}^{k} \log \sigma\qty(-c_{neg} \cdot w)] \end{equation}
recall that:
\begin{equation} \pdv{L_{CE}}{w} = \qty[\sigma(c_{pos} \cdot w) -1]c_{pos} + \sum_{i=1}^{k} \qty[\sigma(c_{neg_{i}}\cdot w)]c_{neg_{i}} \end{equation}
Importantly, because the softmax function is symmetric \(\sigma(-x) = -\sigma(x)\). So really our objective is:
\begin{equation} L_{CE} = -\qty[ \log (\sigma(c_{pos} \cdot w)) - \sum_{i=1}^{k} \log \sigma\qty(c_{neg} \cdot w)] \end{equation}
how to sample \(k\)
We actually sample from:
\begin{equation} P(w) \sim U(w)^{3/4}/Z \end{equation}
to give the less common words slightly higher probability.