A random variable is a quantity that can take on different values, whereby there is a separate probability associated with each value:
- discrete: finite number of values
- continuous: infinitely many possible values
probability mass function
A discrete random variable is encoded as a probability mass function
probability density function
A continuous random variable is represented as a probability density function.
summary statistics
- probability mass function is a description for the random variable: and random variables are usually communicated via probability mass functions
- expected value
adding random variables
“what’s the probability of \(X + Y = n\) with IID \(X\) and \(Y\)?” “what’s the probability of two independent samples from the same exact distribution adding up to \(n\)?”
\begin{equation} \sum_{i=-\infty}^{\infty} P(X=i, Y=n-i) \end{equation}
or integrals and PDFs, as appropriate for continuous cases
for every single outcome, we want to create every possible operation which causes the two variables to sum to \(n\).
We can use convolution to figure out every combination of assignments to random variables which add to a value, and sum their probabilities together.
If you add a bunch of IID things together…. central limit theorem
averaging random variables
adding random variables + linear transformers on Gaussian
You end up with:
\begin{equation} \mathcal{N}\qty(\mu, \frac{1}{n} \sigma^{2}) \end{equation}
you note: as you sum together many things that is IID, the average is pretty the same; but the variance gets smaller as you add more.
maxing random variables
Gumbel distribution: fisher tripplett gedembo theorem???
sampling statistics
We assume that there’s some underlying distribution with some true mean \(\mu\) and true variance \(\sigma^{2}\). We would like to model it with some confidence.
Consider a series of measured samples \(x_1, …, x_{n}\), each being an instantiation of a IID random variable drawn from the underlying distribution each being \(X_1, …, X_{n}\).
sample mean
Let us estimate the true population mean… by creating a random variable representing the the averaging \(n\) measured random variables representing the observations:
\begin{equation} \bar{X} = \frac{1}{N} \sum_{i=1}^{n} X_{i} \end{equation}
we can do this because we really would like to know \(\mathbb{E}[\bar{X}] = \mathbb{E}[\frac{1}{N} \sum_{i=1}^{n} X_i] = \frac{1}{N}\sum_{i=1}^{n} \mathbb{E}[X_{i}] = \frac{1}{N} N \mu = \mu\) and so as long as each of the underlying variables have the same expected mean (they do because IID) drawn, we can use the sample mean to estimate the population mean.
sample variance
We can’t just calculate the sample variance with the variance of the sample. This is because the sample mean will be by definition by closer to each of the sampled points than the actual value. So we correct for it. This is a random variable too:
\begin{equation} S^{2} = \frac{1}{n-1} \sum_{i=1}^{N} (X_{i} - \bar{X})^{2} \end{equation}
standard error of the mean
\begin{equation} Var(\bar{X}) = \frac{S^{2}}{n} \end{equation}
this is the ERROR OF the mean given what you measured because of the central limit theorem