Home » Miscellanea » Probability & Statistics » Confidence Intervals

# Confidence Intervals

Any statistic over a population of items (e.g., the population mean $\mu$) can be truly computed as long as one has access to the whole population of items. Unfortunately, this is often unfeasible in practice or even totally impossible if the population contains an infinite number of items. That is why in statistical inference we resort to compute estimates of those statistics from a finite amount of sample data (e.g., the sample mean $\bar{x}$).

Without loss of generality, suppose we are interested in estimating the true, yet unknown, statistic $\theta$ of a certain population of items. In addition, assume we can access a sample of $n$ observations drawn from the population above, e.g., $x_1, x_2, \ldots, x_n$. Finally, assume that we are able to compute the sample statistic (i.e. an estimate) from the sample above, i.e. $\hat{\theta}$. Now, we may wonder how “good” our sample statistic $\hat{\theta}$ approximates the true statistic $\theta$.

From the Central Limit Theorem (CLT) we know that, no matter what is the underlying distribution governing the population, any statistic computed from repeated samples drawn from that population will approach a Normal distribution, as the sample size goes to infinity, i.e. as $n \rightarrow \infty$. Intuitively, if we had access to $k$ samples, each of size $n_i$ (with $i = \{1,2,\ldots,k\}$), drawn from the same unknown underlying distribution, we can compute a sample statistic for each of those samples (i.e. $\hat{\theta}_1, \hat{\theta}_2, \ldots, \hat{\theta}_k$). It turns out that from different samples and different size of samples we may obtain different estimates of $\theta$: some of those estimates will be larger whereas some others will be smaller than the true $\theta$ but if we repeat the sampling process an infinite number of times and we average all the sample estimates obtained eventually the average will be exactly $\theta$.
The distribution of such sample statistics is usually referred to as the sampling distribution (of the sample statistic). Under the assumption made by the CLT (i.e. as the sample size goes to infinity) the sampling distribution approaches a Normal distribution. Indeed, if we plot the frequency histogram of the values of those sample statistics $\hat{\theta}_1, \hat{\theta}_2, \ldots, \hat{\theta}_k$, that will result into the typical Gaussian shape.
In fact, by repeating the sampling estimate multiple times (i.e. $k$) and assuming no (or negligible) overlap between each sample, we are implicitly computing the estimate from a single, larger sample whose size is at most $n = \sum_{i=1}^k n_i$ (if there is no overlapping between the samples). Due to the CLT, the larger is $n$ the more “normal” will result the sampling distribution of the sample statistic.
As already said, the CLT guarantees the result above for any statistic, in particular this is true for the population mean ($\theta = \mu$) estimated via the sample mean ($\hat{\theta} = \bar{x}$). In such a case, we speak about the sampling distribution of the sample mean.

Coming back to our original question, we may be interested in evaluating how close/far our estimated sample statistic is with respect to the true statistic.
A confidence interval gives an estimated range of values, instead of a simple single-point estimate, which is likely to include an unknown population parameter; the estimated range is calculated from a given set of sample data.
Together with the confidence interval comes the confidence level, which describes the uncertainty associated with a sampling method. Suppose we use the same sampling method to select different samples and to compute a different interval estimate for each sample. Some interval estimates would include the true population parameter and some would not. Therefore, a 0.90 confidence level (i.e. 90%) means that we would expect 90% of the interval estimates to include the population parameter; A 0.95 confidence level means that 95% of the intervals would include the parameter; and so on.
Note that this is fundamentally different from saying that a 90% confidence interval indicates that there exists a 90% chance of the true population statistic falling in that interval. In fact, any population statistic is not a random variable yet it is a constant. Therefore, given any interval, the probability of the population statistic being in that interval is either 0 (if the constant does not fall in that range) or 1 (if it does).

To fully express a confidence interval, one needs three pieces of information:

• Confidence level $\gamma=(1-\alpha)$ where $\alpha$ usually denotes the significance level;
• Sample statistic $\hat{\theta}$ (estimate of the true population statistic $\theta$);
• Margin of error $\epsilon$.

Intuitively, we want to give an upper bound to the error $\epsilon$ we make by estimating $\theta$ with $\hat{\theta}$:
$|\hat{\theta} - \theta| \le \epsilon \Rightarrow \hat{\theta} - \epsilon \le \theta \le \hat{\theta} + \epsilon$
Putting all together, the confidence interval $C$ is defined by the following range:
$C = [\hat{\theta} - \epsilon, \hat{\theta} + \epsilon]$
The uncertainty associated with the confidence interval $C$ is specified by the confidence level $\gamma$.

### How to Construct a Confidence Interval

There are 4 steps to constructing a confidence interval.

1) Choose a sample statistic of interest $\hat{\theta}$ (e.g, sample mean, sample proportion) that will be used to estimate a population parameter $\theta$ (assume $n$ is the sample size).

2) Select a confidence level $\gamma=(1-\alpha)$. As we previously noticed, the confidence level describes the uncertainty of a sampling method. Often, researchers choose $\gamma=\{0.90, 0.95, 0.99, \ldots\}$ confidence levels though any percentage can be used.

3) Find the margin of error $\epsilon$. Depending on if the standard deviation of the population is known or not, one of the two following equations can be used to compute the margin of error:
$|\hat{\theta} - \theta| = \epsilon = x^{*} \cdot \text{StdDev}(\hat{\theta}) = x^{*} \cdot \frac{\sigma}{\sqrt{n}}$
$|\hat{\theta} - \theta| = \epsilon = x^{*} \cdot \text{StdErr}(\hat{\theta}) = x^{*} \cdot \frac{s}{\sqrt{n}}$
where $x^{*}$ is the critical value and $\text{StdDev}(\hat{\theta})$ and $\text{StdErr}(\hat{\theta})$ are the standard deviation and the standard error of the sample statistic, respectively. The former can be computed only if the true standard deviation of the population (i.e. $\sigma$) is known, whereas the latter, which is an estimate of the standard deviation of the sample statistic, can be computed from the sample estimate of the standard deviation of the population $s$.

How to find the Critical Value $x^{*}$?
The critical value is a factor used to compute the margin of error. We describe how to find such critical value, when the sampling distribution of the sample statistic is Normal or quasi-Normal.

Remember that the Central Limit Theorem states that the sampling distribution of any sample statistic will be Normal or nearly Normal, if any of the following conditions apply:

• The population distribution is Normal;
• The sampling distribution is symmetric, unimodal, without outliers, and the sample size is 15 or less;
• The sampling distribution is moderately skewed, unimodal, without outliers, and the sample size is between 16 and 40;
• The sample size is greater than 40, without outliers.

When one of these conditions is satisfied, the critical value can be expressed as a z-score or as a t-score. To find the critical value, follow these steps.
– Compute $\alpha = 1-\gamma$;
– Find the critical probability value $p^{*} = 1 - \alpha/2$;
Then, to express the critical value as a z-score, find the value $x^{*} = z$ having a cumulative probability equal to the critical probability $p^{*}$. Let us assume the sample statistic is represented by a random variable $X$ and the density function of the sampling distribution of such sample statistic being $f_X$, which, by the CLT, would approach the density of a Normal distribution. Then, given $F_X(x) = P(X\leq x)$ the corresponding cumulative distribution function, the critical value can be computed as follows using the inverse cumulative function (or quantile function) $F^{-1}$:
$x^{*} = z = F^{-1}(p^{*})$, where $F(z) = P(X \leq z) = \int_{-\infty}^z f_X(x)~dx = p^{*}$

If we want to express the critical value as a t-score, instead, the following steps are needed:
– Find the degrees of freedom (DF). When estimating a population mean or a proportion from a single sample, DF is equal to the sample size minus one. For other applications, the degrees of freedom may be calculated differently.
The critical t-score $x^{*} = t$ is the score having degrees of freedom equal to DF and a cumulative probability equal to the critical probability $p^{*}$ as above.

Should one express the critical value as a z-score or as a t-score? There are several ways to answer this question. As a practical matter, when the sample size is large (greater than 40), it doesn’t make much difference. Both approaches yield similar results. Strictly speaking, when the standard deviation of the population $\sigma$ is unknown or when the sample size is small, the t-score is preferred. Nevertheless, often the z-score is used exclusively.

4) Specify the confidence interval $C$. The uncertainty is denoted by the confidence level $\gamma$ and the range of the confidence interval is defined by the following equation:
$C = [\hat{\theta} - \epsilon, \hat{\theta} + \epsilon]$

Finally, the intuition behind confidence interval is very much related to the idea of hypothesis testing, which we explore in another article.