stochastic gradient descent
See stochastic gradient descent
Word2Vec
see word2vec
Or, we can even use a simpler approach, window-based co-occurance
GloVe
- goal: we want to capture linear meaning components in a word vector space correct
- insight: the ratio of co-occurrence probabilities are linear meaning components
Therefore, GloVe vectors comes from a log-bilinear:
\begin{equation} w_{i} \cdot w_{j} = \log P(i|j) \end{equation}
such that:
\begin{equation} w_{x} \cdot (w_{a} - w_{b}) = \log \frac{P(x|a)}{P(x|b)} \end{equation}
Evaluating a NLP System
Intrinsic
- evaluate on the specific target task the system is trained on
- evaluate speed
- evaluate understandability
Extrinsic
- real task + attempt to replace older system with new system
- maybe expensive to compute
Word Sense Ambiguity
Each word may have multiple different meanings; each of those separate word sense should live in a different place. However, words with polysemy have related senses, so we usually average:
\begin{equation} v = a_1 v_1 + a_2v_2 + a_3v_3 \end{equation}
where \(a_{j}\) is the frequency for the \(j\) th sense of the word, and \(v_1 … v_{3}\) are separate word senses.
sparse coding
if each sense is relatively common, at high enough dimensions, sparse coding allows you to recover the component sense vectors from the average word vectors because of the general sparsity of the vector space.
Word-Vector NER
Create a window +- n words next to each word to classify; feed the entire sequence’s embeddings concatenated into a neural classifier, and use a target to say whether the center word is a entity/person/no label, etc.
These simplest classifiers usually use softmax, which, without other activations, gives a linear decision boundary.
With a neural classifier, we can add enough nonlinearity in the middle to make the entire representation, but the final output layer will still contain a linear classifier.