Home » Miscellanea » Probability & Statistics » Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE)

Roughly speaking, a typical statistical/machine learning problem involves finding a function estimate \hat{f} (among a set of possible candidate functions \mathcal{H} called the hypotheses space), which best approximates a target yet unknown function f, namely such that \hat{f} \approx f.
Usually, the estimate \hat{f} can be derived from a sample of observed data of interest. Moreover, sometimes it is useful to make some “reasonable” assumptions on the shape of the target f.

As an example, let us assume we are interested in the outcome of an experiment which is represented by the continuous random variable X (similar considerations can be done for the case of X discrete). Also, let the values of X come from a distribution with an unknown probability density function (PDF) denoted by f_X(x) and a cumulative distribution function (CDF) denoted by F_X(x), so that:

F_X(x) = P(X \leq x) = \int_{-\infty}^x f_X(t)~dt

Finally, let us conjecture that, though unknown, the target pdf f_X belongs to a certain family of parametric distribution defined as \mathcal{F}=\{f(x;{\boldsymbol \theta}), {\boldsymbol \theta} \in {\boldsymbol \Theta}\}, where {\boldsymbol \theta} is a vector of parameters governing the parametric model. Note that sometimes a function f parametrized by the vector {\boldsymbol \theta} can also be also written as f_{{\boldsymbol \theta}}(x) or f(x|{\boldsymbol \theta}), i.e., “a function f of the variable x given the parameter vector {\boldsymbol \theta}”.
We usually denote by {\boldsymbol \theta}_X \in {\boldsymbol \Theta} the unknown and true value of the parameter vector specific of f_X \in \mathcal{F} and we write f_X = f(x;{\boldsymbol \theta}_X). If we consider another random variable Y whose probability density function f_Y belongs to the same parametric model, i.e. f_Y \in \mathcal{F}, this will be regulated by another vector of parameters {\boldsymbol \theta}_Y which however still belongs to {\boldsymbol \Theta}.
However, to keep the notation simple, let us write f_X = f(x;{\boldsymbol \theta}), where we have removed the subscript X from the vector of parameters. Since f_X is and will remain unknown, at least it would be desirable to find an estimate \hat{{\boldsymbol \theta}} which would be as close to the true value {\boldsymbol \theta} as possible.
By doing so, we are actually reducing the problem of approximating f_X with \hat{f}_X to that of estimating {\boldsymbol \theta} with \hat{{\boldsymbol \theta}}, i.e. f_X(x;{\boldsymbol \theta}) \approx \hat{f}_X(x;\hat{{\boldsymbol \theta}}).

Maximum Likelihood Estimation (MLE) is a technique which allows us to find such an estimate from a sample of independent and indentically distributed (i.i.d.) random variables X_1, \ldots, X_n, where each X_i has exactly the same CDF F_X (and PDF f_X) of the random variable of interest X. Concretely, we have access to the actual value x_i of each random variable X_i which composes the sample. Those values, also called random variates, are the actual observations/realizations of the random variables of the sample, i.e. X_1=x_1,\ldots X_n=x_n.
The i.i.d. assumption means two things:
i) independence between observations (i.e. each observation is not affected by any of the others);
ii) all the observations are drawn from the same underlying and unknown PDF, i.e., f_X = f(x;{\boldsymbol \theta}).
To use the method of MLE, we first need to specify what is the joint density function for all observations. For an independent and identically distributed sample, this joint density function is:

f(x_1, \ldots, x_n;{\boldsymbol \theta}) = f(x_1;{\boldsymbol \theta}) \times \ldots \times f(x_n;{\boldsymbol \theta}) = \prod_{i=1}^n f(x_i;{\boldsymbol \theta})

So far, we have been considering the function above where x_1, \ldots, x_n have been the variables and {\boldsymbol \theta} the vector of fixed parameters. If we instead consider the observed value x_1, \ldots, x_n as the fixed parameters and {\boldsymbol \theta} the function’s (vector) variable which is allowed to freely change then the function above turns into the likelihood function L:

L({\boldsymbol \theta};x_1, \ldots, x_n) = f(x_1, \ldots, x_n;{\boldsymbol \theta}) = \prod_{i=1}^n f(x_i;{\boldsymbol \theta})

The aim of MLE is thus to find a value of {\boldsymbol \theta} – i.e. an estimate \hat{{\boldsymbol \theta}}_{MLE} – which maximizes the likelihood function, that is:

\hat{{\boldsymbol \theta}}_{MLE} = \text{argmax}_{{\boldsymbol \theta}\in {\boldsymbol \Theta}} \bigg\{L({\boldsymbol \theta};x_1, \ldots, x_n) \bigg\} = \text{argmax}_{{\boldsymbol \theta}\in {\boldsymbol \Theta}} \bigg\{ \prod_{i=1}^n f(x_i;{\boldsymbol \theta})\bigg\}

In fact, it is mathematically more convenient to work with the (natural) logarithm of the likelihood function, which is called log-likelihood, and maximize that one instead of just the likelihood.
Note that the logarithmic transformation does not affect the final estimate produced because, as long as the maximum of the likelihood function exists, so it does the maximum of the log-likelihood, and an MLE estimate is the same regardless of whether we maximize the likelihood or the log-likelihood function, since the logarithm is a strictly monotonically increasing function.
Therefore, the log-likelihood function can be written as follows:

ln \big(L({\boldsymbol \theta};x_1, \ldots, x_n)\big) = ln \big(\prod_{i=1}^n f(x_i;{\boldsymbol \theta})\big) = \sum_{i=1}^n ln\big(f(x_i;{\boldsymbol \theta})\big)

Finally, the maximum of the log-likelihood function is computed as follows:

\hat{{\boldsymbol \theta}}_{MLE} = \text{argmax}_{{\boldsymbol \theta}\in {\boldsymbol \Theta}} \bigg\{ ln \big(L({\boldsymbol \theta};x_1, \ldots, x_n)\big) \bigg\} = \text{argmax}_{{\boldsymbol \theta}\in {\boldsymbol \Theta}} \bigg\{ \sum_{i=1}^n ln \big(f(x_i;{\boldsymbol \theta})\big)\bigg\}

In order to find the parameter vector \hat{{\boldsymbol \theta}}_{MLE} which maximizes the (log-) likelihood function, one typically computes all the partial derivatives with respect to each of the parameters of the vector {\boldsymbol \theta}, namely the gradient, and set them all to 0. If, for instance, {\boldsymbol \theta} = (\theta_1, \ldots, \theta_m), then we compute the following:

\frac{\partial}{\partial \theta_1}\bigg\{ \sum_{i=1}^n ln \big(f(x_i;{\boldsymbol \theta})\big)\bigg\} = \frac{\partial}{\partial \theta_1}\bigg\{ \sum_{i=1}^n ln \big(f(x_i;\theta_1, \ldots, \theta_m)\big)\bigg\} = 0
\frac{\partial}{\partial \theta_2}\bigg\{ \sum_{i=1}^n ln \big(f(x_i;{\boldsymbol \theta})\big)\bigg\} =\frac{\partial}{\partial \theta_2}\bigg\{ \sum_{i=1}^n ln \big(f(x_i;\theta_1, \ldots, \theta_m)\big)\bigg\} = 0

\frac{\partial}{\partial \theta_m}\bigg\{ \sum_{i=1}^n ln \big(f(x_i;{\boldsymbol \theta})\big)\bigg\} =\frac{\partial}{\partial \theta_m}\bigg\{ \sum_{i=1}^n ln \big(f(x_i;\theta_1, \ldots, \theta_m)\big)\bigg\} = 0


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: