Reference: Statistics for Applications
Statistical Modeling
\[\text{Complicated Process} = \text{Simple Process} + \text{Random Noise}\]Good modeling is choosing a plausible simple process and noise distribution.
Basics
Let \(X, X_1, ... X_n\) be i.i.d. random variables from common distribution \(\mathbb{P}\) with mean \(\mu = \mathbb{E}[X]\) and variance \(\sigma^2 = \mathbb{V}[X]\).
Law of Large Number (Strong and Weak)
Sample (empirical) mean converges to true mean as number of samples increases.
\[\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i~\underset{n\rightarrow\infty}{\overset{a.s.,\mathbb{P}}{\longrightarrow}}~\mu\]We can say \(\bar{X}_n\) is a strongly consistent estimator of \(\mu\).
Variance of \(\bar{X}_n\) is:
\[\mathbb{V}[\bar{X}_n] = \frac{\sigma^2}{n}\]Central Limit Theorem
We already know the mean and variance of the \(\bar{X}_n\), CLT tells us that it has the normal distribution.
\[\sqrt{n}\left(\frac{\bar{X}_n - \mu}{\sigma}\right)~\underset{n \rightarrow \infty}{\overset{(d)}{\longrightarrow}}~\mathcal{N}(0, 1)\]or:
\[\bar{X}_n~\underset{n \rightarrow \infty}{\overset{(d)}{\longrightarrow}}~\mathcal{N}(\mu, \frac{\sigma^2}{n})\]or:
\[\lim_{n\rightarrow\infty} \mathbb{P}[\bar{X}_n-\mu \le x] = \Phi(\frac{\sqrt n x}{\sigma})\]Where \(\Phi\) is the CDF of standard normal distribution.
Confidence Interval
So for large enough \(n\) we can derive:
\[\mathbb{P}[|\bar{X}_n-\mu| \le \epsilon] \approx 2\Phi(\frac{\sqrt n \epsilon}{\sigma}) - 1 = 1 - \alpha\]Solving the right hand side equation for \(\epsilon\) we have:
\[\epsilon = \frac{\sigma}{\sqrt{n}}q\left(1 - \frac{\alpha}{2}\right)\]where \(q = \Phi^{-1}\) is the quantile function of the standard gaussian distribution.
In another form:
\[\mathbb{P}\left[\mu - \frac{\sigma}{\sqrt{n}}q_{\frac{\alpha}{2}} < \bar{X}_n \le \mu+\frac{\sigma}{\sqrt{n}}q_{\frac{\alpha}{2}}\right] = 1 - \alpha\]Where \(q_{\frac{\alpha}{2}} = q\left(1 - \frac{\alpha}{2}\right)\).
To replace unknown parameter \(\sigma\) in the bound we can use its empirical estimate \(\hat{\sigma}\) by Slutsky’s Theorem.
The Delta Method
If CLT holds for \(\bar{X}_n\) and \(g: \mathbb{R}\rightarrow\mathbb{R}\) is differentiable at \(\mu\):
\[\sqrt n \left(g(\bar{X}_n) - g(\mu)\right)\underset{n \rightarrow \infty}{\overset{(d)}{\longrightarrow}}\mathcal N(0, g'(\mu)^2\sigma^2)\]Hoeffding’s Inequality
For cases where \(n\) is not large enough (say bellow 50) and \(X\in[a,b]\):
\[\mathbb{P}[|\bar{X}_n-\mu|\ge\alpha] = 2e^{-\frac{2n\alpha^2}{(b-a)^2}}\]Parametric Inference
Statistical Model
Given observed outcome of a statistical experiment be a sample \(X_1, ..., X_n\) of n i.i.d. random variables in some measurable space \(\mathcal X\) and denote \(\mathbb P\) as their common distribution. A Statistical Model associated to that statistical experiment is a pair:
\[\left(\mathcal X, \{\mathbb{P}_\theta\}_{\theta\in\Theta}\right)\]- Usually it is assumed that the statistical model is well specified: \(\mathbb{P}=\mathbb{P}_{\theta^*}\) for some \(\theta^*\in\Theta\).
- The aim of statistical experiment is to estimate \(\theta^*\) or check its properties if they have special meaning.
Parameter Estimation
Given an observed sample \(X_1, ..., X_n\) and a statistical model \(\left(\mathcal X, \{\mathbb{P}_\theta\}_{\theta\in\Theta}\right)\), estimate the parameter \(\theta^*\)
Statistic: Any measurable function of the samples.
Estimator of \(\theta^*\): An statistic that doesn’t depend on \(\theta\).
An estimator \(\hat{\theta}_n\) is consistent (strong and weak) iff:
\[\hat{\theta}_n~\underset{n \rightarrow \infty}{\overset{a.s., \mathbb{P}}{\longrightarrow}}~\theta^*\]w.r.t. \(\mathbb{P}_{\theta^*}\)
Bias of an Estimator
\[bias = \mathbb{E}[\hat{\theta}_n] - \theta^*\]Variance of an Estimator
\[variance = \mathbb{E}\left[\left(\hat{\theta}_n - \mathbb{E}[\hat{\theta}_n]\right)^2\right]\](Quadratic) Risk of an Estimator
\[\mathbb{E}[|\hat{\theta}_n - \theta^*|^2] = bias^2 + variance\]Confidence Intervals
Let \((\mathcal{X}, \{\mathbb{P}_{\theta}\}_{\theta\in\Theta})\) be a statistical model, and let \(\alpha\in(0, 1)\):
Confidence Interval (C.I.) of level \(1 - \alpha\) is any random interval \(\mathcal{I}\) whose boundaries do not depend on \(\theta\) such that:
\[\mathbb{P}_\theta[\theta\in\mathcal{I}] \ge 1 - \alpha, \qquad \forall \theta\in\Theta\]Confidence Interval (C.I.) of asymptotic level of \(1 - \alpha\) is any random interval \(\mathcal{I}\) whose boundaries do not depend on \(\theta\) such that:
\[\lim_{n\rightarrow\infty}\mathbb{P}_\theta[\theta\in\mathcal{I}] \ge 1 - \alpha, \qquad \forall \theta\in\Theta\]Note that \(\mathcal{I}\) should not depend on parameter \(\theta\). It can be either replaced by its estimate (by Slutsky), or a tight upper bound of it.
Maximum Likelihood
\[\begin{equation*} \mathcal{D}_{KL}(\mathbb{P}_{\theta^*}\| \mathbb{P}_{\theta}) = \mathbb{E}_{\mathbb{P}_{\theta^*}}\left[\log\frac{\mathbb{P}_{\theta^*}(X)}{\mathbb{P}_{\theta}(X)}\right] = -\mathcal{H}(X) - \mathbb{E}_{\mathbb{P}_{\theta^*}}\left[\log{\mathbb{P}_{\theta}(X)}\right] \end{equation*}\]To estimate the \(\theta^*\):
\[\hat{\theta}_n = \min_\theta -\frac{1}{n}\sum_{i=1}^n\log \mathbb{P}_\theta(x_i)\]Parametric Hypothesis Testing
Consider a sample \(X_1, ... X_n\) of n i.i.d. random variables \(X\sim\mathbb P_{\theta^*}\), and a statistical model \((\mathcal{X}, \{\mathbb{P}_\theta\}_{\theta\in\Theta})\).
Let \(\Theta_0\) and \(\Theta_1\) be disjoint subsets of \(\Theta\). Consider two hypotheses:
- Null hypothesis: \(H_0: \theta^*\in\Theta_0\)
- Alternate Hypothesis: \(H_1: \theta^*\in\Theta_1\)
We want to decide whether to reject null hypothesis (look for evidence against \(H_0\) in data). The data is only used to try to disprove \(H_0\). Null hypothesis is the “status quo”. Lack of evidence against \(H_0\) does not mean that \(H_0\) is true. Alternate hypothesis is a “discovery” that goes against the status quo null hypothesis. We want to test the null hypothesis against the alternate hypothesis.
Innocent untill proven guilty.
Test
A test is a statistic (function of samples) in the form of an identity function:
\[\psi(X_1, ..., X_n) = \mathbb{I}\{T_n > c\} \in \{0, 1\}\]A test is defined on a test statistic \(T_n\) and a threshold \(c\).
- If \(\psi = 0\), \(H_0\) is not rejected.
- If \(\psi = 1\), \(H_0\) is rejected.
Rejection Region
Region of the sample space that reject the null hypothesis:
\[R_\psi = \{X_1, ...,X_n\in\mathcal X^n: \psi(X_1, ...,X_n)=1\}\]equivalently:
\[R_\psi = \{X_1, ...,X_n\in\mathcal X^n: T_n > c\}\]Error Types
Type 1 Error
Null hypothesis is true, but we rejected it in favor of alternate hypothesis. Probability that the test \(\psi = 1\) under \(\mathbb{P}_\theta\), where \(\theta \in \Theta_0\):
\[\alpha_\psi(\theta) = \mathbb{P}_\theta[\psi=1],\qquad\forall \theta\in\Theta_0\]Type 2 Error
Alternate hypothesis is true, but we failed to reject null hypothesis:
\[\beta_\psi(\theta) = \mathbb{P}_\theta[\psi=0],\qquad\forall \theta\in\Theta_1\]Power of a Test
Probability of correct rejection of null hypothesis in worst case scenario:
\[\pi_\psi = \inf_{\theta\in\Theta_1} \mathbb{P}[\psi=1]\]Level of a Test
A test \(\psi\) has level \(\alpha\) if:
\[\alpha_\psi(\theta) \le \alpha, \qquad \forall \theta\in\Theta_0\]Or asymptotic level of \(\alpha\) if:
\[\lim_{n\rightarrow\infty} \alpha_\psi(\theta) \le \alpha, \qquad \forall \theta\in\Theta_0\]p-Value
The smallest (asymptotic) level \(\alpha\) at which \(\psi\) rejects \(H_0\).
p-value is random and it depends on the sample
\(\chi^2\) Distribution
Sum of squared of \(d\) standard normal distributions has \(\chi^2_d\) distribution:
\[Z_1^2+...+Z_d^2\sim\chi^2_d\]If \(Z\sim\mathcal{N}_d(0, I_d)\) then \(\|Z\|_2^2\sim\chi^2\) \(\chi_2^2 = \exp(\lambda=\frac{1}{2})\)
Sample Variance
For \(X_1,...,X_n\sim^{i.i.d.}\mathcal{N}(\mu, \sigma^2)\) if \(\hat{\sigma}^2_n\) is sample variance:
\[\hat{\sigma}^2_n = \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)^2\]then:
\[\frac{n\hat{\sigma}^2_n}{\sigma^2_n}\sim\chi^2_{n-1}\]Student’s T Distribution
The random variable \(\frac{Z}{\sqrt{V/d}}\) where \(Z\sim\mathcal{N}(0, 1)\) and \(V\sim\chi^2_d\) are independent, is strudent’s T distribution with degree d \(t_d\).
\[\sqrt{n-1}\frac{\bar{X}_n - \mu}{\hat{\sigma}_n}\sim t_{n-1}\]