Home » Miscellanea » Probability & Statistics » Maximum Likelihood Estimation (MLE)

# Maximum Likelihood Estimation (MLE)

Roughly speaking, a typical statistical/machine learning problem involves finding a function estimate $\hat{f}$ (among a set of possible candidate functions $\mathcal{H}$ called the hypotheses space), which best approximates a target yet unknown function $f$, namely such that $\hat{f} \approx f$.
Usually, the estimate $\hat{f}$ can be derived from a sample of observed data of interest. Moreover, sometimes it is useful to make some “reasonable” assumptions on the shape of the target $f$.

As an example, let us assume we are interested in the outcome of an experiment which is represented by the continuous random variable $X$ (similar considerations can be done for the case of $X$ discrete). Also, let the values of $X$ come from a distribution with an unknown probability density function (PDF) denoted by $f_X(x)$ and a cumulative distribution function (CDF) denoted by $F_X(x)$, so that:

$F_X(x) = P(X \leq x) = \int_{-\infty}^x f_X(t)~dt$

Finally, let us conjecture that, though unknown, the target pdf $f_X$ belongs to a certain family of parametric distribution defined as $\mathcal{F}=\{f(x;{\boldsymbol \theta}), {\boldsymbol \theta} \in {\boldsymbol \Theta}\}$, where ${\boldsymbol \theta}$ is a vector of parameters governing the parametric model. Note that sometimes a function $f$ parametrized by the vector ${\boldsymbol \theta}$ can also be also written as $f_{{\boldsymbol \theta}}(x)$ or $f(x|{\boldsymbol \theta})$, i.e., “a function $f$ of the variable $x$ given the parameter vector ${\boldsymbol \theta}$”.
We usually denote by ${\boldsymbol \theta}_X \in {\boldsymbol \Theta}$ the unknown and true value of the parameter vector specific of $f_X \in \mathcal{F}$ and we write $f_X = f(x;{\boldsymbol \theta}_X)$. If we consider another random variable $Y$ whose probability density function $f_Y$ belongs to the same parametric model, i.e. $f_Y \in \mathcal{F}$, this will be regulated by another vector of parameters ${\boldsymbol \theta}_Y$ which however still belongs to ${\boldsymbol \Theta}$.
However, to keep the notation simple, let us write $f_X = f(x;{\boldsymbol \theta})$, where we have removed the subscript $X$ from the vector of parameters. Since $f_X$ is and will remain unknown, at least it would be desirable to find an estimate $\hat{{\boldsymbol \theta}}$ which would be as close to the true value ${\boldsymbol \theta}$ as possible.
By doing so, we are actually reducing the problem of approximating $f_X$ with $\hat{f}_X$ to that of estimating ${\boldsymbol \theta}$ with $\hat{{\boldsymbol \theta}}$, i.e. $f_X(x;{\boldsymbol \theta}) \approx \hat{f}_X(x;\hat{{\boldsymbol \theta}})$.

Maximum Likelihood Estimation (MLE) is a technique which allows us to find such an estimate from a sample of independent and indentically distributed (i.i.d.) random variables $X_1, \ldots, X_n$, where each $X_i$ has exactly the same CDF $F_X$ (and PDF $f_X$) of the random variable of interest $X$. Concretely, we have access to the actual value $x_i$ of each random variable $X_i$ which composes the sample. Those values, also called random variates, are the actual observations/realizations of the random variables of the sample, i.e. $X_1=x_1,\ldots X_n=x_n$.
The i.i.d. assumption means two things:
i) independence between observations (i.e. each observation is not affected by any of the others);
ii) all the observations are drawn from the same underlying and unknown PDF, i.e., $f_X = f(x;{\boldsymbol \theta})$.
To use the method of MLE, we first need to specify what is the joint density function for all observations. For an independent and identically distributed sample, this joint density function is:

$f(x_1, \ldots, x_n;{\boldsymbol \theta}) = f(x_1;{\boldsymbol \theta}) \times \ldots \times f(x_n;{\boldsymbol \theta}) = \prod_{i=1}^n f(x_i;{\boldsymbol \theta})$

So far, we have been considering the function above where $x_1, \ldots, x_n$ have been the variables and ${\boldsymbol \theta}$ the vector of fixed parameters. If we instead consider the observed value $x_1, \ldots, x_n$ as the fixed parameters and ${\boldsymbol \theta}$ the function’s (vector) variable which is allowed to freely change then the function above turns into the likelihood function $L$:

$L({\boldsymbol \theta};x_1, \ldots, x_n) = f(x_1, \ldots, x_n;{\boldsymbol \theta}) = \prod_{i=1}^n f(x_i;{\boldsymbol \theta})$

The aim of MLE is thus to find a value of ${\boldsymbol \theta}$ – i.e. an estimate $\hat{{\boldsymbol \theta}}_{MLE}$ – which maximizes the likelihood function, that is:

$\hat{{\boldsymbol \theta}}_{MLE} = \text{argmax}_{{\boldsymbol \theta}\in {\boldsymbol \Theta}} \bigg\{L({\boldsymbol \theta};x_1, \ldots, x_n) \bigg\} = \text{argmax}_{{\boldsymbol \theta}\in {\boldsymbol \Theta}} \bigg\{ \prod_{i=1}^n f(x_i;{\boldsymbol \theta})\bigg\}$

In fact, it is mathematically more convenient to work with the (natural) logarithm of the likelihood function, which is called log-likelihood, and maximize that one instead of just the likelihood.
Note that the logarithmic transformation does not affect the final estimate produced because, as long as the maximum of the likelihood function exists, so it does the maximum of the log-likelihood, and an MLE estimate is the same regardless of whether we maximize the likelihood or the log-likelihood function, since the logarithm is a strictly monotonically increasing function.
Therefore, the log-likelihood function can be written as follows:

$ln \big(L({\boldsymbol \theta};x_1, \ldots, x_n)\big) = ln \big(\prod_{i=1}^n f(x_i;{\boldsymbol \theta})\big) = \sum_{i=1}^n ln\big(f(x_i;{\boldsymbol \theta})\big)$

Finally, the maximum of the log-likelihood function is computed as follows:

$\hat{{\boldsymbol \theta}}_{MLE} = \text{argmax}_{{\boldsymbol \theta}\in {\boldsymbol \Theta}} \bigg\{ ln \big(L({\boldsymbol \theta};x_1, \ldots, x_n)\big) \bigg\} = \text{argmax}_{{\boldsymbol \theta}\in {\boldsymbol \Theta}} \bigg\{ \sum_{i=1}^n ln \big(f(x_i;{\boldsymbol \theta})\big)\bigg\}$

In order to find the parameter vector $\hat{{\boldsymbol \theta}}_{MLE}$ which maximizes the (log-) likelihood function, one typically computes all the partial derivatives with respect to each of the parameters of the vector ${\boldsymbol \theta}$, namely the gradient, and set them all to 0. If, for instance, ${\boldsymbol \theta} = (\theta_1, \ldots, \theta_m)$, then we compute the following:

$\frac{\partial}{\partial \theta_1}\bigg\{ \sum_{i=1}^n ln \big(f(x_i;{\boldsymbol \theta})\big)\bigg\} = \frac{\partial}{\partial \theta_1}\bigg\{ \sum_{i=1}^n ln \big(f(x_i;\theta_1, \ldots, \theta_m)\big)\bigg\} =$ $0$
$\frac{\partial}{\partial \theta_2}\bigg\{ \sum_{i=1}^n ln \big(f(x_i;{\boldsymbol \theta})\big)\bigg\} =\frac{\partial}{\partial \theta_2}\bigg\{ \sum_{i=1}^n ln \big(f(x_i;\theta_1, \ldots, \theta_m)\big)\bigg\} =$ $0$

$\frac{\partial}{\partial \theta_m}\bigg\{ \sum_{i=1}^n ln \big(f(x_i;{\boldsymbol \theta})\big)\bigg\} =\frac{\partial}{\partial \theta_m}\bigg\{ \sum_{i=1}^n ln \big(f(x_i;\theta_1, \ldots, \theta_m)\big)\bigg\} =$ $0$