Notes on Statistics

Reference: Statistics for Applications

Statistical Modeling

\[\text{Complicated Process} = \text{Simple Process} + \text{Random Noise}\]

Good modeling is choosing a plausible simple process and noise distribution.

Basics

Let \(X, X_1, ... X_n\) be i.i.d. random variables from common distribution \(\mathbb{P}\) with mean \(\mu = \mathbb{E}[X]\) and variance \(\sigma^2 = \mathbb{V}[X]\).

Law of Large Number (Strong and Weak)

Sample (empirical) mean converges to true mean as number of samples increases.

\[\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i~\underset{n\rightarrow\infty}{\overset{a.s.,\mathbb{P}}{\longrightarrow}}~\mu\]

We can say \(\bar{X}_n\) is a strongly consistent estimator of \(\mu\).

Variance of \(\bar{X}_n\) is:

\[\mathbb{V}[\bar{X}_n] = \frac{\sigma^2}{n}\]

Central Limit Theorem

We already know the mean and variance of the \(\bar{X}_n\), CLT tells us that it has the normal distribution.

\[\sqrt{n}\left(\frac{\bar{X}_n - \mu}{\sigma}\right)~\underset{n \rightarrow \infty}{\overset{(d)}{\longrightarrow}}~\mathcal{N}(0, 1)\]

or:

\[\bar{X}_n~\underset{n \rightarrow \infty}{\overset{(d)}{\longrightarrow}}~\mathcal{N}(\mu, \frac{\sigma^2}{n})\]

or:

\[\lim_{n\rightarrow\infty} \mathbb{P}[\bar{X}_n-\mu \le x] = \Phi(\frac{\sqrt n x}{\sigma})\]

Where \(\Phi\) is the CDF of standard normal distribution.

Confidence Interval

So for large enough \(n\) we can derive:

\[\mathbb{P}[|\bar{X}_n-\mu| \le \epsilon] \approx 2\Phi(\frac{\sqrt n \epsilon}{\sigma}) - 1 = 1 - \alpha\]

Solving the right hand side equation for \(\epsilon\) we have:

\[\epsilon = \frac{\sigma}{\sqrt{n}}q\left(1 - \frac{\alpha}{2}\right)\]

where \(q = \Phi^{-1}\) is the quantile function of the standard gaussian distribution.

In another form:

\[\mathbb{P}\left[\mu - \frac{\sigma}{\sqrt{n}}q_{\frac{\alpha}{2}} < \bar{X}_n \le \mu+\frac{\sigma}{\sqrt{n}}q_{\frac{\alpha}{2}}\right] = 1 - \alpha\]

Where \(q_{\frac{\alpha}{2}} = q\left(1 - \frac{\alpha}{2}\right)\).

To replace unknown parameter \(\sigma\) in the bound we can use its empirical estimate \(\hat{\sigma}\) by Slutsky’s Theorem.

The Delta Method

If CLT holds for \(\bar{X}_n\) and \(g: \mathbb{R}\rightarrow\mathbb{R}\) is differentiable at \(\mu\):

\[\sqrt n \left(g(\bar{X}_n) - g(\mu)\right)\underset{n \rightarrow \infty}{\overset{(d)}{\longrightarrow}}\mathcal N(0, g'(\mu)^2\sigma^2)\]

Hoeffding’s Inequality

For cases where \(n\) is not large enough (say bellow 50) and \(X\in[a,b]\):

\[\mathbb{P}[|\bar{X}_n-\mu|\ge\alpha] = 2e^{-\frac{2n\alpha^2}{(b-a)^2}}\]

Parametric Inference

Statistical Model

Given observed outcome of a statistical experiment be a sample \(X_1, ..., X_n\) of n i.i.d. random variables in some measurable space \(\mathcal X\) and denote \(\mathbb P\) as their common distribution. A Statistical Model associated to that statistical experiment is a pair:

\[\left(\mathcal X, \{\mathbb{P}_\theta\}_{\theta\in\Theta}\right)\]

Usually it is assumed that the statistical model is well specified: \(\mathbb{P}=\mathbb{P}_{\theta^*}\) for some \(\theta^*\in\Theta\).
The aim of statistical experiment is to estimate \(\theta^*\) or check its properties if they have special meaning.

Parameter Estimation

Given an observed sample \(X_1, ..., X_n\) and a statistical model \(\left(\mathcal X, \{\mathbb{P}_\theta\}_{\theta\in\Theta}\right)\), estimate the parameter \(\theta^*\)

Statistic: Any measurable function of the samples.

Estimator of \(\theta^*\): An statistic that doesn’t depend on \(\theta\).

An estimator \(\hat{\theta}_n\) is consistent (strong and weak) iff:

\[\hat{\theta}_n~\underset{n \rightarrow \infty}{\overset{a.s., \mathbb{P}}{\longrightarrow}}~\theta^*\]

w.r.t. \(\mathbb{P}_{\theta^*}\)

Bias of an Estimator

\[bias = \mathbb{E}[\hat{\theta}_n] - \theta^*\]

Variance of an Estimator

\[variance = \mathbb{E}\left[\left(\hat{\theta}_n - \mathbb{E}[\hat{\theta}_n]\right)^2\right]\]

(Quadratic) Risk of an Estimator

\[\mathbb{E}[|\hat{\theta}_n - \theta^*|^2] = bias^2 + variance\]

Confidence Intervals

Let \((\mathcal{X}, \{\mathbb{P}_{\theta}\}_{\theta\in\Theta})\) be a statistical model, and let \(\alpha\in(0, 1)\):

Confidence Interval (C.I.) of level \(1 - \alpha\) is any random interval \(\mathcal{I}\) whose boundaries do not depend on \(\theta\) such that:

\[\mathbb{P}_\theta[\theta\in\mathcal{I}] \ge 1 - \alpha, \qquad \forall \theta\in\Theta\]

Confidence Interval (C.I.) of asymptotic level of \(1 - \alpha\) is any random interval \(\mathcal{I}\) whose boundaries do not depend on \(\theta\) such that:

\[\lim_{n\rightarrow\infty}\mathbb{P}_\theta[\theta\in\mathcal{I}] \ge 1 - \alpha, \qquad \forall \theta\in\Theta\]

Note that \(\mathcal{I}\) should not depend on parameter \(\theta\). It can be either replaced by its estimate (by Slutsky), or a tight upper bound of it.

Maximum Likelihood

\[\begin{equation*} \mathcal{D}_{KL}(\mathbb{P}_{\theta^*}\| \mathbb{P}_{\theta}) = \mathbb{E}_{\mathbb{P}_{\theta^*}}\left[\log\frac{\mathbb{P}_{\theta^*}(X)}{\mathbb{P}_{\theta}(X)}\right] = -\mathcal{H}(X) - \mathbb{E}_{\mathbb{P}_{\theta^*}}\left[\log{\mathbb{P}_{\theta}(X)}\right] \end{equation*}\]

To estimate the \(\theta^*\):

\[\hat{\theta}_n = \min_\theta -\frac{1}{n}\sum_{i=1}^n\log \mathbb{P}_\theta(x_i)\]

Parametric Hypothesis Testing

Consider a sample \(X_1, ... X_n\) of n i.i.d. random variables \(X\sim\mathbb P_{\theta^*}\), and a statistical model \((\mathcal{X}, \{\mathbb{P}_\theta\}_{\theta\in\Theta})\).

Let \(\Theta_0\) and \(\Theta_1\) be disjoint subsets of \(\Theta\). Consider two hypotheses:

Null hypothesis: \(H_0: \theta^*\in\Theta_0\)
Alternate Hypothesis: \(H_1: \theta^*\in\Theta_1\)

We want to decide whether to reject null hypothesis (look for evidence against \(H_0\) in data). The data is only used to try to disprove \(H_0\). Null hypothesis is the “status quo”. Lack of evidence against \(H_0\) does not mean that \(H_0\) is true. Alternate hypothesis is a “discovery” that goes against the status quo null hypothesis. We want to test the null hypothesis against the alternate hypothesis.

Innocent untill proven guilty.

Test

A test is a statistic (function of samples) in the form of an identity function:

\[\psi(X_1, ..., X_n) = \mathbb{I}\{T_n > c\} \in \{0, 1\}\]

A test is defined on a test statistic \(T_n\) and a threshold \(c\).

If \(\psi = 0\), \(H_0\) is not rejected.
If \(\psi = 1\), \(H_0\) is rejected.

Rejection Region

Region of the sample space that reject the null hypothesis:

\[R_\psi = \{X_1, ...,X_n\in\mathcal X^n: \psi(X_1, ...,X_n)=1\}\]

equivalently:

\[R_\psi = \{X_1, ...,X_n\in\mathcal X^n: T_n > c\}\]

Error Types

Type 1 Error

Null hypothesis is true, but we rejected it in favor of alternate hypothesis. Probability that the test \(\psi = 1\) under \(\mathbb{P}_\theta\), where \(\theta \in \Theta_0\):

\[\alpha_\psi(\theta) = \mathbb{P}_\theta[\psi=1],\qquad\forall \theta\in\Theta_0\]

Type 2 Error

Alternate hypothesis is true, but we failed to reject null hypothesis:

\[\beta_\psi(\theta) = \mathbb{P}_\theta[\psi=0],\qquad\forall \theta\in\Theta_1\]

Power of a Test

Probability of correct rejection of null hypothesis in worst case scenario:

\[\pi_\psi = \inf_{\theta\in\Theta_1} \mathbb{P}[\psi=1]\]

Level of a Test

A test \(\psi\) has level \(\alpha\) if:

\[\alpha_\psi(\theta) \le \alpha, \qquad \forall \theta\in\Theta_0\]

Or asymptotic level of \(\alpha\) if:

\[\lim_{n\rightarrow\infty} \alpha_\psi(\theta) \le \alpha, \qquad \forall \theta\in\Theta_0\]

p-Value

The smallest (asymptotic) level \(\alpha\) at which \(\psi\) rejects \(H_0\).

p-value is random and it depends on the sample

\(\chi^2\) Distribution

Sum of squared of \(d\) standard normal distributions has \(\chi^2_d\) distribution:

\[Z_1^2+...+Z_d^2\sim\chi^2_d\]

If \(Z\sim\mathcal{N}_d(0, I_d)\) then \(\|Z\|_2^2\sim\chi^2\) \(\chi_2^2 = \exp(\lambda=\frac{1}{2})\)

Sample Variance

For \(X_1,...,X_n\sim^{i.i.d.}\mathcal{N}(\mu, \sigma^2)\) if \(\hat{\sigma}^2_n\) is sample variance:

\[\hat{\sigma}^2_n = \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)^2\]

then:

\[\frac{n\hat{\sigma}^2_n}{\sigma^2_n}\sim\chi^2_{n-1}\]

Student’s T Distribution

The random variable \(\frac{Z}{\sqrt{V/d}}\) where \(Z\sim\mathcal{N}(0, 1)\) and \(V\sim\chi^2_d\) are independent, is strudent’s T distribution with degree d \(t_d\).

\[\sqrt{n-1}\frac{\bar{X}_n - \mu}{\hat{\sigma}_n}\sim t_{n-1}\]