Consider a case where there’s only a single binary outcome:
- “success”, with probability \(p\)
- “failure”, with probability \(1-p\)
constituents
\begin{equation} X \sim Bern(p) \end{equation}
requirements
the probability mass function:
\begin{equation} P(X=k) = \begin{cases} p,\ if\ k=1\\ 1-p,\ if\ k=0\\ \end{cases} \end{equation}
This is sadly not Differentiable, which is sad for Maximum Likelihood Parameter Learning. Therefore, we write:
\begin{equation} P(X=k) = p^{k} (1-p)^{1-k} \end{equation}
Which emulates the behavior of your function at \(0\) and \(1\) and we kinda don’t care any other place.
We can use it
additional information
properties of Bernoulli distribution
- expected value: \(p\)
- variance: \(p(1-p)\)
Bernoulli as indicator
If there’s a series of event whose probability you are given, you can use a Bernoulli to model each one and add/subtract
MLE for Bernouli
\begin{equation} p_{MLE} = \frac{m}{n} \end{equation}
\(m\) is the number of events