Home » Miscellanea » Probability & Statistics » Hypothesis Testing

# Hypothesis Testing

## The intuition

Let’s assume that we have a coin and we would like to tell whether it is fair or not (i.e. biased towards heads or tails).
If we flip the coin 100 times and it will come up 51 heads, what should we conclude? What if it came up only 5 heads instead?
Well, very likely in the first case one would be inclined to say that the coin is fair while in the second one could be tempted to infer the coin is actually biased towards tails.
However, the point is: how likely is it that the coin is actually fair in each case?

Questions such as the ones above fall into the so-called domain of hypothesis testing.
Hypothesis testing is a methodology to systematically quantifying how certain one can be of the result of a statistical experiment. For instance, in the example above the “statistical experiment” is represented by flipping the coin 100 times.
More generally, any situation where we are taking a random sample of a population and we measure “something” (i.e. any statistic) about it can be considered as an “experiment”.

## Frequentist vs. Bayesian

As a matter of fact, there are two questions one might want to ask:
1) Assuming the coin is fair, how likely is that we observe the result we get?
2) What is the likelihood that the coin is fair given that we observe the result we get?

More generally, the first statement can be mathematically translated into finding the probability of the observed data $\mathcal{D}$ (i.e. outcome of the experiment) conditioned on the hypothesis $\mathcal{H}$ (i.e. the coin is fair), namely $P(\mathcal{D}|\mathcal{H})$. This quantity, which is generally referred to as the likelihood, is the evidence of $\mathcal{H}$ provided by the data $\mathcal{D}$.
The approach trying to give an answer to this kind of question is better known as frequentist.

The second statement, instead, is concerned with finding the probability of the hypothesis conditioned on the observed data, namely $P(\mathcal{H}|\mathcal{D})$. This probability is also known as the posterior and the rationale of this approach is to find the hypothesis which best explains the data, amongst all the possible hypothesis one may want to consider. To compute the posterior probability, it makes use of the Bayes’ rule:
$P(\mathcal{H}|\mathcal{D}) = \frac{P(\mathcal{D}|\mathcal{H}) P(\mathcal{H})}{P(\mathcal{D})}$
where:

• $P(\mathcal{H})$ is the prior, i.e. the probability of the hypothesis being true before any data is observed
• $P(\mathcal{D}|\mathcal{H})$ is the likelihood as defined above
• $P(\mathcal{D})$ is the marginal probability, i.e. the total probability of the data taking into account all possible hypotheses

Because of the use of the Bayes’ rule, this second approach is mostly known as Bayesian.

Both the frequentist and the Bayesian approaches are used in statistical inference to evaluate evidence against competing hypotheses. The former makes use of hypothesis testing whereas the latter is at the core of Bayesian inference. There isn’t a clear winner strategy between the two as the preference of one over the other can be mostly attributed to a different philosophical interpretation of probability.

## The Null Hypothesis

In the frequentist approach, the most common type of hypothesis testing requires a so-called null hypothesis.
This is usually denoted by $\mathcal{H}_0$ and is a statement about the world which can plausibly account for the observed data.
For instance, in the example above a possible null hypothesis states “the coin is fair”; however, another possible null hypothesis could state the opposite, namely “the coin is biased”.
No matter which logic we want the null hypothesis to represent, it needs to be expressed in simple, mathematical terms.

The main goal of hypothesis testing is thus to come up with a decision on whether there exists enough evidence to reject the null hypothesis. What does that exactly mean? If we are assuming as our null hypothesis that the coin is fair but the outcome of our statistical experiment (i.e. the data we observe after, say, flipping the coin 100 times) provides enough evidence that contradicts the null hypothesis (i.e. we observe only 1 head out of 100 coin flips) then we can safely reject it.
Statistics is responsible for precisely quantifying what “enough evidence” and “safely” really means.

Coming back to our coin flip example, our null hypothesis is assuming the coin is fair.
If we flip it 100 times and we observe 51 heads then we can probably say that the coin is actually fair since the number of expected head under the condition expressed by the null hypothesis would be 50, and 51 is quite close. However, what if we flip the coin 100,000 times and we come up with 51,000 heads. In both experiments (100 vs. 100,000 flips) we see 51% of heads but in the second case the coin seems more likely to be biased.

Lack of evidence to the contrary is not evidence that the null hypothesis is true. In fact, it means that we don’t have enough sufficient evidence to conclude that the null hypothesis is false: the coin might have a 51% bias towards heads, after all.

To put it more formally, a single coin flip is represented by a Bernoulli trial.
Generally speaking, a Bernoulli trial is an experiment which has 2 possible outcomes (e.g., $\{0,1\}$ or $\{\text{Yes}, \text{No}\}$, $\{\text{Head}, \text{Tail}\}$, etc.).
It is therefore associated with a binary random variable $X$ which takes on two values: $X \in \{0,1\}$ (i.e. one for each possible outcome). Moreover, $X=1$ with probability $p$ and $X = 0$, with probability $q = (1-p)$:

$P(X=1) = 1 - P(X=0) = 1 - q = 1 - (1-p) = p$

Usually, we write $X \sim \text{Bernoulli}(p)$ to indicate that the random variable $X$ is distributed according to a Bernoulli distribution parametrized by $p$.
In the case of coin flip we can assign $X=1$ if the outcome of the flip is head, and $X=0$ if it comes up with tail.
Now, let’s assume we repeat the coin flip experiment $n$ times (e.g., $n=100$ in our previous example), and that $X_i$ is the random variable representing the outcome of the $i$-th flip. Then, the random variable $Y$ defined as follows:
$Y = \sum_{i=1}^n X_i$
represents the total number of times we come up with heads (out of the $n$ trials).
The random variable $Y$ is known to be distributed according the Binomial distribution, $Y\sim \text{Binomial}(k; n,p)$. In fact, the probability $P(Y=k)$ (e.g., the probability of the sum of heads is $k)$ is described by two parameters: $n$ as the total number of Bernoulli trials of the experiment, and $p$ as the probability of each outcome to be $1$ (i.e. $P(X=1)$).
The probability density function of a Binomial random variable $Y$ is $f_{Y}(k;n,p)$, defined as:

$f_{Y}(k;n,p) = P(Y=k) = \binom{n}{k}~p^k (1-p)^{n-k}$

So, let’s say we have a set of observed data $\mathcal{D}$ and a null hypothesis $\mathcal{H}_0$ then we calculate $P(\mathcal{D}|\mathcal{H}_0)$, which is the probability we observe what we do given the null hypothesis is true. If this probability is “sufficiently small” we are confident rejecting the null hypothesis.
Typical confidence level to reject the null hypothesis are 90%, 95%, or 99%. For example, if we choose a 95% confidence level we reject the null hypothesis if:

$P(\mathcal{D}|\mathcal{H}_0) \leq 1 - 0.95 \Rightarrow P(\mathcal{D}|\mathcal{H}_0) \leq 0.05$

Now, in the case of coin flips we can use the observed data $\mathcal{D}$ to compute a sample estimate of the true probability of head on each trial $p$. How do we do that? Well, an easy way of computing this statistic just requires to count the proportion of heads over all the trials. If out of $n$ coin flips we come up with $k$ heads, then an estimate of $p$ is $\hat{p}$:

$\hat{p} = \frac{k}{n}$

For instance, if $n=100$ and $k=51$ then $\hat{p} = 51/100 = 0.51$.
This is known as the Maximum Likelihood Estimate of $p$ however it is just a single-point estimate.
Suppose we can repeat this experiment many many times. By the Central Limit Theorem (CLT) we are guaranteed that any statistic computed over a sample of $n$ i.i.d. random variables approximates the Normal distribution as $n$ goes to infinity, independently of what is the underlying distribution of the original sample.
In other words, by the CLT we are guaranteed that the sampling distribution of our estimate $\hat{p}$ approximates the Normal distribution as the sample goes to infinity.

We can translate our null hypothesis that the coin is fair by stating that:

$\mathcal{H}_0: p_0 = 0.5$

Depending on whether we want to test for a specific condition against the null hypothesis or if we just trying to disprove the null hypothesis, one-tail or two-tail tests can be conducted.
In the first case, we can build two alternative hypotheses: one which states that $p \leq p_0 \Rightarrow p \leq 0.5$ (meaning the coin is biased towards tail); the other instead claims that $p \geq p_0 \Rightarrow p \geq 0.5$ (meaning the coin is biased towards head).
In the second case, we are not interested in the “direction” of the test and the alternative hypothesis simply states that $p \neq p_0 \Rightarrow p \neq 0.5$. This is exactly the kind of test we are considering here.

It is worth remembering that given any sample statistic $\hat{\theta}$ which aims to approximate a true population parameter $\theta$, we can compute:

$|\hat{\theta}-\theta| = \epsilon = x^*\cdot \text{StdErr}(\hat{\theta})$

The formula above is usually used to define a confidence interval at a specific confidence level, as described here.
Depending on the sample statistic of interest $\hat{\theta}$, we can compute the $\text{StdErr}(\hat{\theta})$ (i.e. an estimate of the standard deviation of our sample statistic). However, given the same sample statistic its estimated standard deviation (standard error) is computed in two different ways, depending on whether we are (i) testing a null hypothesis or (ii) building a confidence interval.
Specifically, in our running example the statistic of interest is the proportion of coin flips resulting into heads (i.e. $\hat{\theta} = \hat{p}$) and we are interested in (i). In fact, we are assuming by the null hypothesis that such statistic evaluates to a specific value, i.e. $\theta = p_0 = 0.5$, then we can use this value to estimate the standard error of our statistic (as we don’t know the true proportion, we are using the null hypothesis assuming that this is true until otherwise proven). Hence:

$\text{StdErr}(\hat{p}) = \sqrt{\frac{p_0(1-p_0)}{n}}$

If instead we were interested in (ii), i.e. computing a confidence interval, we would not assume anything about the true proportion (i.e. $\theta = p$ and $p$ is left unknown) and we could use the proportion estimated from the sample in the standard error formula above, namely:

$\text{StdErr}(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$

Similarly when comparing 2 proportions the standard error in hypothesis testing uses a composite proportion assuming that the 2 proportions are equal, but the confidence interval does not assume equality and combines the 2 proportions in a different way.

From the equation above we can easily compute the so-called critical value $x^*$, sometimes denoted by $z$ as “z-score”:

$z = x^* = \frac{\hat{p}-p_0}{\text{StdErr}(\hat{p})} = \frac{\hat{p}-p_0}{\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}} = \frac{(\hat{p}-p_0)\sqrt{n}}{\sqrt{\hat{p}(1-\hat{p})}}$

Now, by the CLT we know that the statistics above is normally distributed with 0 mean and standard deviation equal to 1.
Intuitively, the z-score indicates how many standard deviations away from the mean our sample is assuming the null hypothesis is true.

## The p-value

Without loss of generality, the p-value is defined as the probability, under the assumption of the null hypothesis $\mathcal{H}_0$, of obtaining a result statistic equal to or more extreme than what was actually observed. In our case, the “result statistic” is the z-score.
Depending on how we look at it, the “more extreme than what was actually observed” can either mean $\{ X \geq z \}$ (one-tailed test, right-tail) or $\{ X \leq z \}$ (one-tailed test, left-tail) or two times the “smaller” of $\{ X \leq z \}$ and $\{ X \geq z \}$ (two-tailed test). Thus the p-value is given by:

• $P(X \geq z|\mathcal{H}_0)$ (one-tailed test, right-tail)
• $P(X \leq z|\mathcal{H}_0)$ (one-tailed test, left-tail)
• $2~\text{min}\{P(X \geq z|\mathcal{H}_0), P(X \leq z|\mathcal{H}_0)\}$ (two-tailed test)

Therefore we can compute what is the probability of obtaining that particular z-score by looking at the cumulative distribution function (CDF). Indeed, note that it always holds that $P(X \geq z|\mathcal{H}_0) = 1 - P(X \leq z|\mathcal{H}_0)$.
The smaller the p-value, the larger the significance because it tells the investigator that the hypothesis under consideration ($\mathcal{H}_0$) may not adequately explain the observation. The hypothesis $\mathcal{H}_0$ is rejected if any of these probabilities is less than or equal to a small, fixed, but arbitrarily pre-defined, threshold value $\alpha$, which is referred to as the level of significance. Conversely, if the p-value is greater than the level of significance this intuitively means that the estimated z-score falls within a $(1-\alpha)100\%$ confidence interval around the true (yet unknown) value of the statistic. It turns out however, that being not able to reject the null hypothesis does not imply that the hypothesis $\mathcal{H}_0$ is true. In fact, it could still be false but our observed data do not provide enough evidence of that.