SU-CS224N APR042024

stochastic gradient descent

See stochastic gradient descent

Word2Vec

see word2vec

Or, we can even use a simpler approach, window-based co-occurance

GloVe

goal: we want to capture linear meaning components in a word vector space correct
insight: the ratio of co-occurrence probabilities are linear meaning components

Therefore, GloVe vectors comes from a log-bilinear:

\begin{equation} w_{i} \cdot w_{j} = \log P(i|j) \end{equation}

such that:

\begin{equation} w_{x} \cdot (w_{a} - w_{b}) = \log \frac{P(x|a)}{P(x|b)} \end{equation}

Evaluating a NLP System

Intrinsic

evaluate on the specific target task the system is trained on
evaluate speed
evaluate understandability

Extrinsic

real task + attempt to replace older system with new system
maybe expensive to compute

Word Sense Ambiguity

Each word may have multiple different meanings; each of those separate word sense should live in a different place. However, words with polysemy have related senses, so we usually average:

\begin{equation} v = a_1 v_1 + a_2v_2 + a_3v_3 \end{equation}

where \(a_{j}\) is the frequency for the \(j\) th sense of the word, and \(v_1 … v_{3}\) are separate word senses.

sparse coding

if each sense is relatively common, at high enough dimensions, sparse coding allows you to recover the component sense vectors from the average word vectors because of the general sparsity of the vector space.

Word-Vector NER

Create a window +- n words next to each word to classify; feed the entire sequence’s embeddings concatenated into a neural classifier, and use a target to say whether the center word is a entity/person/no label, etc.

These simplest classifiers usually use softmax, which, without other activations, gives a linear decision boundary.

With a neural classifier, we can add enough nonlinearity in the middle to make the entire representation, but the final output layer will still contain a linear classifier.