Statistics Review

Statistics Review#

Before we dive into Machine Learning, we will do a brief review of the following concepts from statistics:

Probability Distributions
The Binomial Distribution
The Normal Distribution
The Central Limit Theorem
Hypothesis Testing
The Multivariate Normal Distribution

If you are already familiar with these concepts, feel free to skip this section or to only read the sections you need to review.

Probability Distributions:#

A random variable \(X\) is a variable that can take on one of a number of possible values with associated probabilities. The set of possible values attainable by \(X\) is called the support of \(X\). In this workshop, we will use the notation \(\mathcal{X}\) to denote the support of a random variable \(X\).

Random variables are often defined as probability distributions over their respective supports. A probability distribution is a function that assigns a likelihood to each possible value \(x\) in the support \(\mathcal{X}\). Probability distributions can be discrete (i.e. when \(\mathcal{X}\) is countable) or continuous (when the \(\mathcal{X}\) is not countable). For example, the probability distribution of outcomes for rolling a six-sided dice is discrete, whereas the distribution of darts thrown at a dartboard is continuous. In this workshop, we will use the notation \(p(x)\) to denote probability distributions.

In order for a probability distribution to be well-defined, we require the distribution to be normalized, meaning all probabilities add up to 1. This means that:

\[\begin{split}1 = \begin{cases} \sum_{x} p(x) & [\text{for discrete } p(x)]\\ \int_\mathcal{X} p(x)\ dx & [\text{for continuous } p(x)] \end{cases}\end{split}\]

The expected value of a distribution \(p(x)\), denoted \(\mathbb{E}[x]\) is given by:

\[\begin{split}\mathbb{E}[p(x)] = \begin{cases} \sum_{x} p(x)x & [\text{for discrete } p(x)]\quad \\ \int_{\mathcal{X}} p(x)x\ dx & [\text{for continuous } p(x)] \end{cases}\end{split}\]

The expected value of a random variable, sometimes called the average value or mean value, is the average of all possible outcomes weighted according to their likelihoods. The mean of a random variable is also often denoted by \(\mu\).

Note

In physics and quantum chemistry, you might encounter Dirac notation, which uses the notation \(\langle x \rangle\) to denote the expected value of \(x\). Often, this is referred to as the “expectation value”, instead of the “expected value”.

The variance of a random variable \(X\) describes the degree to which the distribution deviates from the mean \(\mu\). It is often denoted by \(\sigma^2\), and is given by:

\[\sigma^2 = \mathbb{E}[ (X - \mu)^2 ] = \sum_{x} (x - \mu)^2 = \int_\mathcal{X} (x - \mu)^2\ dx\]

The variance can also be computed by the equivalent formula:

\[\sigma^2 = \mathbb{E}[X^2] - \mathbb{E}[X]^2\]

The standard deviation of a distribution, denoted by \(\sigma\), is the square root of the variance \(\sigma\). Roughly speaking, \(\sigma\) measures how far we expect a random variable to deviate from its mean. As a general rule of thumb, if an outcome is more than \(2\sigma\) away from \(\mu\), it is considered to be a statistically significant deviation.

The Binomial Distribution:#

The binomial distribution is a discrete probability distribution that models the number of successes in a set of \(N\) independent trials, where each trial succeeds with a fixed probability \(p\). A random variable \(X\) that is binomially distributed has support \(\mathcal{X} = \{ 0, 1, ..., N \}\) and probability distribution:

\[p(x) = p^{x} (1-p)^{N-x} \binom{N}{x} = p^x (1-p)^{N-x} \left[ \frac{N!}{x!(N-x)!} \right]\]

Note

We emphasize that \(p(x)\) is not the same as \(p\). \(p\) is the probability of success within any single, independent trial (experiment), so that \((1-p)\) is the probability of failure in any trial. We interpret \(p(x)\) as the probability that in a set of \(N\) trials, exactly \(x\) trials are successful, and \(N-x\) trials are failures.

Let’s write some Python code to visualize a Binomial distribution. We can compute the probability distribution by hand, or we can use the scipy.stats.binom.pmf function:

../_images/12de071679598628b5b293abf8b46a596ec13053842d228a4433d9c730895ab1.png

The mean and variance of this distribution are \(\mu = Np\) and \(\sigma^2 = np(1-p)\) respectively.

The Normal Distribution:#

The normal distribution (also called the Gaussian distribution) is perhaps the most important continuous distribution in statistics. This distribution is parameterized by its mean \(\mu\) and standard deviation \(\sigma\) and has support \(\mathcal{X} = (-\infty, \infty)\). The distribution is:

\[p(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{1}{2}\left[\frac{x - \mu}{\sigma}\right]^2\right)\]

If we plot this distribution (using scipy.stats.norm.pdf), we obtain the familiar “bell curve” shape:

../_images/5b114876734af2f85d4c22b06463ce1559855c92001d6b9415436045db6e5b6d.png

The Central Limit Theorem#

The Law of Large Numbers and the Central Limit Theorem are two important theorems in statistics. The Law of Large Numbers states states that as the number of samples \(n\) of a random variable \(X\) increases, the average of these samples approaches the distribution mean \(\mu = \mathbb{E}[X]\):

\[\text{For samples } x_1, x_2, ..., x_n,\quad \sum_{i=1}^n \frac{x_i}{n} \rightarrow \mu \quad\text{ as }\quad n \rightarrow \infty.\]

The Central Limit Theorem generalizes the Law of large numbers. It states that for a set of \(n\) independent samples \(x_1, x_2, ..., x_n\) from any random variable \(X\) with bounded mean \(\mu_X\) and variance \(\sigma_X\), the sample mean random variable \(\bar{X}_n \sim \sum_{i=1}^n x_i/n\) is such that:

\[\sqrt{n}(\bar{X}_n - \mu_X)\ \underset{distribution}{\longrightarrow}\ \text{Normal}(\mu=0, \sigma=\sigma_X)\]

This theorem is useful for quantifying the uncertainty of the sample mean. If we divide both sides by \(\sqrt{n}\) and shift by \(\mu_X\), we see that:

\[\bar{X}_n \sim \text{Normal}(\mu=\mu_X,\sigma=\sigma_X/\sqrt{N})\]

In other words, the standard deviation of the sample mean \(\bar{x} = \sum_{i=1}^n x_i/n\) is roughly \(\sigma_X/\sqrt{n}\). This relation quantifies the uncertainty of using the sample mean as an estimate of a population mean.

Hypothesis Testing#

An important part of doing science is the testing of hypotheses. The standard way of doing this is through the steps of the scientific method: Formulate a research question, propose a hypothesis, design an experiment, collect experimental data, analyze the results, and report conclusions. In the analysis of our data, how do we know if our hypothesis is correct? There are many different statistical methods we can apply to test a given hypothesis, each with different strengths and weaknesses. In machine learning, we often use hypothesis testing to determine (hopefully with a high degree of certainty) whether one model is more accurate than another. We can also use hypothesis testing to determine which data features are more significant than other data features when making predictions.

Typically, hypothesis testing involves two competing hypotheses: the null hypothesis (denoted \(H_0\)) and the alternative hypothesis (denoted \(H_1\)). The null hypothesis often is a statement of the “status quo” or a statement of “statistical insignificance”. The alternative hypothesis is the statement of “statistical significance” we are often trying to prove is true. To better illustrate the process of hypothesis testing, we will use the following example:

Example: Conductor vs. Insulator Classifier#

Suppose we are developing a classifier model that predicts whether a material is a conductor or an insulator. For simplicity, we shall assume that roughly half of all materials are insulators and half are insulators. Our two competing hypotheses would then be:

\(H_0\): The accuracy of our classifier is the same as random guessing (accuracy = 0.5)
\(H_1\): The accuracy of our classifier is better than than random guessing (accuracy > 0.5)

Suppose that in order to test our alternative hypothesis \(H_1\), we compile a dataset of 40 materials (20 conductors and 20 insulators) and use these to evaluate our model. We find that the model has an accuracy of 0.6, meaning it correctly classifies \(60\%\) of the dataset. Since the accuracy is greater than 0.5, does this mean we immediately reject \(H_0\) in favor of \(H_1\)? Not necessarily; it could be the case that our model simply got lucky and “randomly guessed” the classification of more than \(50\%\) of the dataset.

First, let’s consider the distribution of accuracies that could be attained by a random guessing strategy. If we treat each guess as one of \(N = 40\) trials with a probability \(p = 0.5\) of succeeding, we can model the distribution of random guessing strategies with a binomial distribution. Let’s write some Python code to visualize this distribution:

../_images/91f90fe6ed817835ccb8c2db181279c85f1c7f9632c0bf0e6ed679416f0d1b77.png

In order to evaluate whether or not our result is statistically significant, we will compute the \(p\)-value associated with our hypothesis testing. A \(p\)-value is a quantity between \(0\) and \(1\) that describes the probability of obtaining a result at least as extreme as the experimentally observed value assuming that \(H_0\) is true. Roughly speaking, we can interpret a \(p\)-value as the probability of observing the experimental data “by coincidence” if \(H_0\) is in fact true. If a \(p\)-value is low, it means that the alternative hypothesis \(H_1\) is likely to be true. In most research settings, a p-value of at most \(0.05\) (\(5\%\) chance of coincidence) is considered sufficient to show that the alternative hypothesis \(H_1\) is true.

From inspecting this plot we see that the accuracy distribution is approximately normal, having mean \(\mu_X \approx p = 0.5\) and variance \(\sigma^2_X \approx p(1-p) = 0.25\). Per the Central Limit Theorem, we conclude that the estimated accuracy of random guessing is normally distributed with mean \(\mu = \mu_X\) and \(\sigma = \sigma_X/\sqrt{40}\). The \(p\)-value corresponds to the area under this normal distribution curve corresponding to accuracies with \(0.6\) or greater. Using the values from the previous code cell, we can compute the \(p\)-value as follows:

../_images/8e7a01aa1ff650d4a42160878dfe2fe9e2df8975deafa4aba0f6ec218e85f155.png

Since the \(p\)-value is \(0.006 \le 0.05\), we conclude that the \(H_1\) is true, meaning the accuracy of our model (\(0.6\)) being greater than random guessing (\(0.5\)) is statistically significant. This proves that the model is better than random guessing; however it is worth noting that a model with an accuracy of \(0.6\) may not be practically useful for distinguishing between insulators and metals.

The Multivariate Normal Distribution:#

Often, we will find that we are working with multi-dimensional data where correlations may exist between more than one variable. Fortunately, these correlations can be described by a multivariate normal distribution. Like the 1-dimensional normal distribution, the multivariate normal distribution is characterized by two parameters, a mean vector \({\boldsymbol{\mu}}\) and a covariance matrix \(\mathbf{\Sigma}\). For a \(d\)-dimensional distribution, these parameters can be written in matrix form:

\[\begin{split}\boldsymbol{\mu} = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_d \end{bmatrix}, \qquad\qquad \mathbf{\Sigma} = \begin{bmatrix} \sigma_{1}^2 & \sigma_{12} & \dots & \sigma_{1d} \\ \sigma_{21} & \sigma_{2}^2 & \dots & \sigma_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{d1} & \sigma_{d2} & \dots & \sigma_{d}^2 \end{bmatrix}\end{split}\]

(For a review of matrices and matrix-vector products see the next section.) The entries \(\mu_i = \mathbb{E}[X_i]\) are the coordinates of the mean \(\boldsymbol{\mu}\). The entries \(\sigma_i^2 = \mathbb{E}[(X_i - \mu_i)^2]\) in \(\Sigma\) are the variances of each individual component of the distribution. Finally, the off-diagonal components \(\sigma_{ij}\) are the covariances of components \(i\) and \(j\). The covariance of two components is given by:

\[\text{Cov}(X_i,X_j) = \mathbb{E}[(X_i - \mu_i)(X_j - \mu_j)] = \iint_{\mathcal{X}_i \times \mathcal{X_j}} p(x_i,x_j)(x_i - \mu_i)(x_j - \mu_j)\ dx_jdx_i\]

The probability distribution of a multivariate normal distribution is given by:

\[p(\mathbf{x}) = \frac{1}{\sqrt{(2\pi)^d \det(\Sigma)}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right)\]

Note

From the definition of \(\text{Cov}(X_i, X_j)\), it follows that \(\text{Cov}(X_i, X_j) = \text{Cov}(X_j, X_i)\). This means that the covariance matrix \(\mathbf{\Sigma}\) is symmetric (\(\mathbf{\Sigma} = \mathbf{\Sigma}^T\)), having up to \(d(d+1)/2\) distinct values that need to be determined.

For any two random variables, if \(\text{Cov}(A,B) = 0\) the random variables are uncorrelated; otherwise, the sign of \(\text{Cov}(A,B)\) indicates whether \(A\) and \(B\) are positively or negatively correlated.

Also, for the multivariate normal distribution to be well-defined, we must impose that the matrix \(\mathbf{\Sigma}\) is invertible. If \(\mathbf{\Sigma}\) is not invertible, \(\det(\mathbf{\Sigma}) = 0\), which means \(p(\mathbf{x})\) cannot be normalized.

To evaluate the density of a multivariate normal distribution, we can use the scipy.stats.multivariate_normal.pdf function:

../_images/dc78a8590f368cc2ed9dce90468b5f6801232c9889db1cdb8c26fb2c54e287c0.png

Statistics Review

Contents

Statistics Review#

Probability Distributions:#

The Binomial Distribution:#

The Normal Distribution:#

The Central Limit Theorem#

Hypothesis Testing#

Example: Conductor vs. Insulator Classifier#

The Multivariate Normal Distribution:#

Exercises#

Solutions:#

Exercise 1: Comparing Two Classifiers#

Exercise 2: Fitting a Multivariate Normal Distribution:#