Naive Bayes

Naive Bayes is a special class of Baysian Network inference problem which follows a specific structure used to solve classification problems.

The Naive Bayes classifier is a Baysian Network of the shape:

(Why is this backwards(ish)? Though we typically think about models as a function M(obs) = cls, the real world is almost kind of opposite; it kinda works like World(thing happening) = things we observe. Therefore, the observations are a RESULT of the class happening.)

We consider, naively, \(o_{1:n}\) are all conditionally independent on \(c\). From this graph, we can therefore use the probability chain rule + conditional probability to write that:

\begin{equation} P(c, o_{1:n}) = P( c) \prod_{i=1}^{n} P(o_{i} | c) \end{equation}

so, to actually compute this, we don’t want to bother going over all the multiplications because of underflow, we write:

\begin{equation} \hat{y} = \arg\max_{y} \log \hat{P}(y) + \sum_{i=1}^{m} \log \hat{P}(x|y) \end{equation}

brute-force Bayes classifier

\begin{equation} \arg\max_{y} P(y|x) = \arg\max_{y} \frac{P(x|y)P(y)}{P(x)} \end{equation}

but because we are taking argmax, we can not normalize:

\begin{equation} \arg\max_{y} P(y|x) = \arg\max_{y} P(x|y)P(y) \end{equation}

this only works if \(x\) is a single value (i.e. you have a one-feature classifier

This system has 6 parameters; they can be MLE for Bernouli from data, but you can also use Baysian Parameter Learning Method

	y = 0	y = 1
x1 = 0	theta0	theta2
x1 = 1	theta1	theta3

	y = 0
y = 0	theta4
y = 1	theta5 (=1-theta4)

to perform estiimation with MAP

\begin{equation} p(X=1| Y=0) = \frac{\text{examples where X=1, Y=0}}{\text{examples where Y=0}} \end{equation}

whith MLE with a Laplace prior:

\begin{equation} p(X=1| Y=0) = \frac{\text{(examples where X=1, Y=0)}+1}{\text{(examples where Y=0)}+\text{(nclass = 2)}} \end{equation}

We can keep going; for instance, if you wave \(x_1, x_2\) two diffferent features:

\begin{equation} \arg\max_{y} P(y|x) = \arg\max_{y} P(x_1, x_2|y)P(y) \end{equation}

but this requires us to have \(2^{2}\) and ultimately \(2^{n}\) parameters, which is exponential blowup. Hence, we need to treat the variables as—naivly—independent so we can multiply them. Hence:

Naive Bayes assumption

we assume independence between the input features. That is, we assume:

\begin{equation} P(x_1, \dots, x_{n}|y) = \prod_{i=1}^{n} P(X_{i}|y) \end{equation}

inference with Naive Bayes

Recall the definition of inference, for our case here:

given observations \(o_{1:n}\), we desire to know what’s the probability of \(c\) happening. That is, from conditional probability:

\begin{equation} P(c | o_{1:n}) = \frac{P(c, o_{1:n})}{P(o_{1:n})} \end{equation}

Now, from above we have \(P(c, o_{1:n})\) already. To get the denominator, we invoke law of total probability to add up the probability of all observations occurring given all classes. That is:

\begin{equation} P(o_{1:n}) = \sum_{c \in C} P(c, o_{1:n}) \end{equation}

You will note that this value \(P(o_{1:n})\) is actually constant as long as the network structure does not change. Therefore, we tend to write:

\begin{align} P(c | o_{1:n}) &= \frac{P(c, o_{1:n})}{P(o_{1:n})} \\ &= \kappa P(c, o_{1:n}) \end{align}

or, that:

\begin{equation} P(c|o_{1:n}) \propto P(c, o_{1:n}) \end{equation}

“the probability of a class occurring given the inputs is proportional to the probability of that class occurring along with the inputs”

Multiple believes

\begin{equation} P(A=a | R_1) \propto P(R_1 | A=a) \cdot P(A=a) \end{equation}

But now

Motivation: Bayes rule

This will give us:

However, what if we don’t want to use the law of total probability to add up \(P(FB’)\)?

We can actually write a relation that essentially reminds us that the fact that \(P(FB’)\) as not dependent on \(TSF\), so we can write:

\begin{equation} P(TSF^{1}|FB^{1}) \porpto P(TSF^{1})P(FB^{1} | TSF^{1}) \end{equation}