Home » Miscellanea » Probability & Statistics » Hoeffding’s Inequality

Hoeffding’s Inequality

The Hoeffding’s inequality is a crucial result in probability theory as it provides an upper bound on the probability that the sum of a sample of independent random variables deviates from its expected value.

Though the bound holds for general independent random variables (not necessarily identically distributed) with some slight modifications, here we reduce to the special case of considering a sample of $n$ independent and identically distributed (i.i.d.) Bernoulli random variables $X_1, X_2 \ldots X_n$, that is a Bernoulli process. By definition, the outcome of each variable is either $0$ (failure) or $1$ (success), namley $X_i\in \{0,1\}~\forall i\{1,\ldots,n\}$. Moreover, the probability of success (resp., failure) is described by the same parameter $\theta$ (resp., $1-\theta$) for all the variables $X_i$. More formally, we state that $X_i \sim \mathcal{B}(\theta)~\forall i\{1,\ldots,n\}$, where $\mathcal{B}$ denotes the Bernoulli distribution and $\theta$ is the parameter governing the distribution.

Generally speaking, if we want to provide a (single-point) estimate $\hat{\theta}$ of the true population parameter $\theta$ using the sample of $n$ observed Bernoulli random variables we just need to measure the frequency of successes out of the total number of trials, that is how many variables out of $n$ did output $1$. This is often referred as the Maximum Likelihood Estimation (MLE) as it provides an estimate which maximizes the probability of seeing the observed data.

$\hat{\theta} = \frac{1}{n}\sum_{i=1}^n X_i$

Note also that $\hat{\theta}$ and $\theta$ can be considered as the sample mean and the population mean, respectively.

Hence, the (one-sided) inequality proposed by Hoeffding states that:
$P(\hat{\theta} - \theta \geq \epsilon) \leq e^{-2n\epsilon^2}$
By symmetry, the above inequality is also valid for the other side of the difference:
$P(-\hat{\theta} +\theta \geq \epsilon) \leq e^{-2n\epsilon^2}$
Eventually, adding up the two one-sided inequalities we obtain the two-sided inequality as follows:
$P(|\hat{\theta} -\theta| \geq \epsilon) \leq 2e^{-2n\epsilon^2}$

More generally, the Hoeffding’s inequality states that the probability of the estimation error $|\hat{\theta} -\theta|$ being greater than a positive constant $\epsilon > 0$ is bounded to the quantity $2e^{-2n\epsilon^2}$

Intuitively, we want the estimate $\hat{\theta}$ being close to the true $\theta$. Therefore, for a fixed and very small $\epsilon$, we want the above probability small as well, namely we would like $2e^{-2n\epsilon^2}$ to be close to $0$!
If we look carefully at the inequality we notice that the probability doesn’t depend on $\theta$ at all, which is nice. Also notice that it decays exponentially with $n$ which is also very good: as one might expect the more observations are collected the more likely the probability of a bad estimation error would be small.
However, the exponential function includes also the squared term $\epsilon^2$ which could go rapidly to $0$ as $\epsilon$ is getting smaller, thereby causing the overall power of the exponential function $-2n\epsilon^2$ going to $0$. That means that the “price” we have to pay in terms of number of observations needed if we want a very restrictive bound on the error could be very high.
In the unlucky edge case where $-2n\epsilon^2 \leadsto 0$, the bound is in fact practically useless since it would simply tell us that:
$P(|\hat{\theta} - \theta| \geq \epsilon) \leq 2e^{0} \longrightarrow P(|\hat{\theta} - \theta| \geq \epsilon) \leq 2$
which is something that we already know from the basic axioms of probability! (In fact, those axioms provide even stricter bounds since the probability of any event to occur ranges always between $0$ and $1$).

The probability described by the Hoeffding’s inequality can be interpreted as the level of significance $\alpha$ (i.e. the upper bound on the probability of making an error) for a confidence interval around $\theta$ of size $2\epsilon$.

$P(|\hat{\theta} - \theta| \geq \epsilon) = P(\hat{\theta} \notin [\theta -\epsilon, \theta + \epsilon]) \leq 2e^{-2n\epsilon^2}\leq \alpha$

Solving the equation above for $n$ gives us the number of observations (or the number of times the Bernoulli trial should be repeated) in order for $\hat{\theta}=\theta\pm \epsilon$ with $(1-\alpha)$ confidence level.

$2e^{-2n \epsilon^2} \leq \alpha$

$e^{-2n\epsilon^2} \leq \frac{\alpha}{2}$

Applying the $\textrm{log}_{e} = \textrm{ln}$ to both sides of the inequality above results into

$\textrm{ln}(e^{-2n\epsilon^2}) \leq \textrm{ln}\big(\frac{\alpha}{2}\big)$

$-2n \epsilon^2 \leq \textrm{ln} \big(\frac{\alpha}{2}\big)$

$-n \leq \frac{1}{2 \epsilon^2}\textrm{ln} \big(\frac{\alpha}{2}\big)$

$n \geq -\frac{1}{2 \epsilon^2}\textrm{ln} \big(\frac{\alpha}{2}\big)$

Therefore, this gives us a lower bound on the number of observations we would need in order to be at least $100(1-\alpha) \%$ confident that our estimate $\hat{\theta}$ differs at most $\epsilon$ from the true value $\theta$.

As an example, consider how large the sample size $n$ would have to be at least if we want to be (at least) $95\%$ confident (i.e. $\alpha = 0.05$) that our estimate is correct within an error $\epsilon = 0.1$. According to the result above, it mush hold

$n \geq - \frac{1}{2\cdot 0.1^2}\textrm{ln}\bigg(\frac{0.05}{2}\bigg)$

$n \geq - \frac{1}{2\cdot 0.01}\textrm{ln}(0.025)$

$n \geq - \frac{1}{\frac{2}{100}}\textrm{ln}(0.025)$

$n \geq (-50)\cdot\textrm{ln}(0.025)$

$n \geq (-50)\cdot (-3.69) \approx 185$

So, with $185$ observations we can be at least $95\%$ confident that our estimation error is bound by $\epsilon = 0.1$. Of course, if we would like a smaller error bound and/or a smaller significance level (resp., higher confidence level) $n$ should increase accordingly.