Main goals: assign a probability of each sequence of words existing:
\begin{equation} P(W) = P(w_1, \dots, w_{n}) \end{equation}
closely related is the NLG formulation of predicting an upcoming word:
\begin{equation} P(w_5|w_1, \dots, w_{n}) \end{equation}
either of these we call a “grammar”, or “Language Model”.
Chain Rule Language Modeling
Recall probability chain rule. Now, the probability of a sequence like:
\begin{equation} P(its\ water\ is\ so\ transparent) \end{equation}
gives:
\begin{equation} P(its) \times P(water|its) \times P(is | its\ water) \dots \end{equation}
That is:
\begin{equation} P(w_1 \dots w_{n}) = \prod_{i}P(w_{i} | w_1 \dots w_{i-1}) \end{equation}
Markov Assumption
Because we can’t make conditional counts over all words all the time, we make an assumption: the probability of the current word is the probability of the current word conditioned on the probability of the last \(k\) words.
\begin{equation} P(w_1, \dots, w_{n}) \approx \prod_{i}P(w_{i} | w_{i-k} \dots w_{i-1}) \end{equation}
Unigrams
The simplest Markov Assumption is unigrams, which will be word salad generation because it has no understanding of language structure.
Naive Bays Language Modeling
You can consider each class in Naive Bayes \(P(word | c)\) as a language model.
So:
\begin{equation} P(sentence|c) = \prox_{i}P(word_{i}|c) \end{equation}
Each class is a separate class-conditioned language model. So, we just want to compute the probability of each sentence, and classify the sentence based on the higher probability result.
Limitations
In general, n gram models are limited because they don’t consider long distance dependencies which are present in English.
Estimating N-Grams
Many counts are results of…
- world (“people want chinese food more often, so want+Chinese appears more”)
- grammar (“want+want is less likely”)
MLE
\begin{equation} P(w_{i} | w_{i-1}) = \frac{C(w_{i-1}, w_{i})}{C(w_{i-1})} \end{equation}
MAP, i.e. Laplace Smoothing
\begin{equation} P(w_{i} | w_{i-1}) = \frac{C(w_{i-1}, w_{i})+1}{C(w_{i-1})+V} \end{equation}
we have to add \(V\) on the denominator because every word could possibly follow \(w_{i-1}\). Note that as \(N\) increases we actually still add \(V\) because we are predicting at each time a single word (just conditioned on more words), so if we are smoothing output we are only adding \(V\) extra counts.
IMPORTANT NOTE THOUGH: this is typically not used for N-Grams (because there are simply so many OOS sequences). Instead, its more frequently used in other cases such as Naive Bayes for Text Classification.
Log Probs
In practice, we keep probability as log probabilities after we computed them.
N-Gram Models
Google n-gram models, SRILM
Backoff
Use trigrams if high probability evidence is found, otherwise bigrams or unigrams
Stupid Backoff
- give the MLE if the conditioning sequence has a non-zero count
- otherwise, start backing off, recursively calculating the probability of the current word given the last n-1-gram, multplied by a discount factor
- if we end up with a unigram, just give the unigram probability
This DOES NOT PRODUCE A PROBABILITY as it is not normalized. Instead of being probabilites, we consider them “scores”.
Unigram Interpolation
In practice, Interpolation works better. Interpolation smoothes the probability between unigram, bigram, and trigrams.
Mostly simply, we mix them with some factors \(\lambda_{1}, \lambda_{2}, \lambda_{3}\), where \(\sum_{i} \lambda_{i} = 1\). This makes a weighted average over probabilities:
\begin{equation} P(comb) = \lambda_{1} P(uni) + \lambda_{2} P(bi)+ \lambda_{3} P(tri) \end{equation}
lambdas could also be a function of the previous tokens.
We sometimes obtain this with a disjoint dataset from the original training set, whereby we train some ngrams from the original dataset, and then identify \(\lambda\) which maximises the probabilities.
OOV Words
we sometimes replace the lowest likelyhood few words with <UNK>
, and train models such that we can have an open vocabulary: whenever we encounter unknown words, we replace it with <UNK>
Scaling Up
Strategies to make LMing with Ngrams more efficient
- pruning: only store ngrams of top k
- use tries (suffix trees, etc.)
- approximations: bloom filter
- storing indicies