Naive Bayes acts to compute \(P(Y|X)\) via the Bayes rule and using the Naive Bayes assumption. What if we can model the value of \(P(Y|X)\) directly?
With \(\sigma\) as the sigmoid function:
\begin{equation} P(Y=1|X=x) = \sigma (\theta^{\top}x) \end{equation}
and we tune the parameters of \(\theta\) until this looks correct.
We always want to introduce a BIAS parameter, which acts as an offset; meaning the first \(x\) should always be \(1\), which makes the first value in \(\theta\) as a “bias”.
For optimizing this function, we have:
\begin{equation} LL(\theta) = y \log \sigma(\theta^{\top} x) + (1-y) \log (1- \theta^{\top} x) \end{equation}
and if we took the derivative w.r.t. a particular parameter slot \(\theta_{j}\):
\begin{equation} \pdv{LL(\theta)}{\theta_{j}} = \sum_{i=1}^{n} \qty[y^{(i)} - \sigma(\theta^{\top}x^{(i)})] x_{j}^{(i)} \end{equation}
logistic regression assumption
We assume that there exists that there are some \(\theta\) which, when multiplied to the input and squashed by th sigmoid function, can model our underlying probability distribution:
\begin{equation} \begin{cases} P(Y=1|X=x) = \sigma (\theta^{\top}x) \\ P(Y=0|X=x) = 1- \sigma (\theta^{\top}x) \\ \end{cases} \end{equation}
We then attempt to compute a set of \(\theta\) which:
\begin{equation} \theta_{MLE} = \arg\max_{\theta} P(y^{(1)}, \dots, y^{(n)} | \theta, x_1 \dots x_{n}) \end{equation}
Log Likelihood of Logistic Regression
To actually perform MLE for the $θ$s, we need to do parameter learning. Now, recall that we defined, though the logistic regression assumption:
\begin{equation} P(Y=1|X=x) = \sigma (\theta^{\top}x) \end{equation}
essentially, this is a Bernouli:
\begin{equation} (Y|X=x) \sim Bern(p=\sigma(\theta^{\top}x)) \end{equation}
We desire to maximize:
\begin{equation} P(Y=y | \theta, X=x) \end{equation}
Now, recall the continous PDF of Bernouli:
\begin{equation} P(Y=y) = p^{y} (1-p)^{1-y} \end{equation}
we now plug in our expression for \(p\):
\begin{equation} P(Y=y|X=x) = \sigma(\theta^{\top}x)^{y} (1-\sigma(\theta^{\top}x))^{1-y} \end{equation}
for all \(x,y\).
Logistic Regression, in general
For some input, output pair, \((x,y)\), we map each input \(x^{(i)}\) into a vector of length \(n\) where \(x^{(i)}_{1} … x^{(i)}_{n}\).
Training
We are going to learn weights \(w\) and \(b\) using stochastic gradient descent; and measure our performance using cross-entropy loss
Test
Given a test example \(x\), we compute \(p(y|x)\) for each \(y\), returning the label with the highest probability.
Logistic Regression Text Classification
Given a series of input/output pairs of text to labels, we want to assign a predicted class to a new input fair.
We represent each text in terms of features. Each feature \(x_{i}\) comes with some weight \(w_{i}\), informing us of the importance of feature \(x_{i}\).
So: input is a vector \(x_1 … x_{n}\), and some weights \(w_1 … w_{n}\), which will eventually gives us an output.
There is usually also a bias term \(b\). Eventually, classification gives:
\begin{equation} z = w \cdot x + b \end{equation}
However, this does not give a probability, which by default this does not. To fix this, we apply a squishing function \(\sigma\), which gives
\begin{equation} \sigma(z) = \frac{1}{1+\exp(-z)} \end{equation}
which ultimately yields:
\begin{equation} z = \sigma(w \cdot x+ b) \end{equation}
with the sigmoid function.
To make this sum to \(1\), we write:
\begin{equation} p(y=1|x) = \sigma(w \cdot x + b) \end{equation}
and
\begin{equation} p(y=0|x) = 1- p(y=1|x) \end{equation}
Also, recall that \(\sigma(-x) = 1- \sigma(x)\), this gives:
\begin{equation} p(y=0|x) = \sigma(-w\cdot x-b) \end{equation}
the probability at which point we make a decision is called a decision boundary. Typically this is 0.5.
We can featurize by counts from a lexicon, by word counts, etc.
For instance:
logistic regression terms
- feature representation: each input \(x\) is represented by a vectorized lit of feature
- classification function: \(p(y|x)\), computing \(y\) using the estimated class
- objective function: the loss to minimize (i.e. cross entropy)
- optimizer: SGD, etc.
- decision boundary: the threshold at which classification decisions are made, with \(P(y=1|x) > N\).
binary cross entropy
\begin{equation} \mathcal{L} = - \qty[y \log \sigmoid(w \cdot x + b) + (1-y) \log (1- \sigmoid(w \cdot x + b))] \end{equation}
or, for neural networks in general:
\begin{equation} \mathcal{L} = - \qty[y \log \hat{y} + (1-y) \log (1- \hat{y})] \end{equation}
gradient descent
\begin{equation} \theta^{t+1} = \theta^{t} - \eta \nabla_{\theta} \mathcal{L} \end{equation}
“update the weight by taking a step in the opposite direction of the gradient by weight”.
Weight gradient for logistic regresison
\begin{equation} \pdv{L_{CE}(\hat(y), y)}{w_{j}} = \qty[\sigma\qty(w \cdot x + b) -y] x_{j} \end{equation}
where \(x_{j}\) is feature \(j\).