Bag of Words is an order-free representation of a corpus. Specifically, each word has a count which we assign to each word without any other information about ordering, etc.
With the Bayes Theorem, we have:
\begin{equation} C_{MAP} = \arg\max_{c \in C} P(d|c)P( c) \end{equation}
where \(d\) is the document, and \(c\) is the class.
So, given a document is a set of a bunch of words:
\begin{equation} C_{MAP} = \arg\max_{c\in C} P(x_1, \dots, x_{n}|c)P( c) \end{equation}
Naive Bayes for Text Classification
Assumptions of Bag of Words for Naive Bayes
\(P( c)\)
The right side is just relative frequencies (count of freq divided by count of class).
\(P(x_1, …, x_{n})\)
- Naive Bayes assumption (each word’s position doesn’t matter)
- Bag of Words assumption (assume position doesn’t matter)
So we have:
\begin{equation} C_{NB} = \arg\max_{c\in C} P(c_{j}) \prod_{x\in X} P(x|c) \end{equation}
We eventually use logs to prevent underflow:
\begin{equation} C_{NB} = \arg\max_{c\in C}\log P(c_{j}) +\sum_{x\in X} \log P(x|c) \end{equation}
Parameter Estimation
Because we don’t know new words completely decimating probability, we want to establish a Beta Distribution prior which smoothes the outcomes, meaning:
\begin{equation} P(w_{k}|c_{j}) = \frac{n_{k} + \alpha }{n + \alpha |V|} \end{equation}
where \(n_{k}\) is the number of occurrences of word \(k\) in class \(C\), and \(n\) is the number of words in total that occurs in class \(C\).
Unknown Words
We ignore them. Because knowing a class has lots of unknown words don’t help.
Binary Naive Bayes
There is another version, which simply clip all the word count \(n_{k}\) to \(1\) for both train and test. You do this by de-duplicating the entire corpus by DOCUMENT (i.e. if a word appears twice in the same document, we count it only once).
Benefits
- Doesn’t have significant fragmentation problems (i.e. many important features clotting up decision)
- Robust to irrelevant features (which cancel each other out)