Roughly speaking, a typical statistical/machine learning problem involves finding a function *estimate* (among a set of possible candidate functions called the hypotheses space), which best approximates a *target* yet *unknown* function , namely such that .

Usually, the estimate can be derived from a sample of observed data of interest. Moreover, sometimes it is useful to make some “reasonable” assumptions on the shape of the target .

As an example, let us assume we are interested in the outcome of an experiment which is represented by the continuous random variable (similar considerations can be done for the case of discrete). Also, let the values of come from a distribution with an unknown *probability density function* (PDF) denoted by and a *cumulative distribution function* (CDF) denoted by , so that:

Finally, let us conjecture that, though unknown, the target pdf belongs to a certain family of parametric distribution defined as , where is a vector of parameters governing the parametric model. Note that sometimes a function parametrized by the vector can also be also written as or , i.e., “a function of the variable given the parameter vector ”.

We usually denote by the *unknown* and *true* value of the parameter vector specific of and we write . If we consider another random variable whose probability density function belongs to the same parametric model, i.e. , this will be regulated by another vector of parameters which however still belongs to .

However, to keep the notation simple, let us write , where we have removed the subscript from the vector of parameters. Since is and will remain unknown, at least it would be desirable to find an estimate which would be as close to the true value as possible.

By doing so, we are actually reducing the problem of approximating with to that of estimating with , i.e. .

*Maximum Likelihood Estimation* (MLE) is a technique which allows us to find such an estimate from a **sample** of *independent and indentically distributed* (i.i.d.) random variables , where each has exactly the same CDF (and PDF ) of the random variable of interest . Concretely, we have access to the *actual value* of each random variable which composes the sample. Those values, also called *random variates*, are the actual observations/realizations of the random variables of the sample, i.e. .

The i.i.d. assumption means two things:

*i)* independence between observations (i.e. each observation is not affected by any of the others);

*ii)* all the observations are drawn from the same underlying and unknown PDF, i.e., .

To use the method of MLE, we first need to specify what is the *joint density function* for all observations. For an independent and identically distributed sample, this joint density function is:

So far, we have been considering the function above where have been the *variables* and the *vector of fixed parameters*. If we instead consider the observed value as the fixed parameters and the function’s (vector) variable which is allowed to freely change then the function above turns into the *likelihood function* :

The aim of MLE is thus to find a value of – i.e. an estimate – which maximizes the likelihood function, that is:

In fact, it is mathematically more convenient to work with the (natural) logarithm of the likelihood function, which is called *log-likelihood*, and maximize that one instead of just the likelihood.

Note that the logarithmic transformation does not affect the final estimate produced because, as long as the maximum of the likelihood function exists, so it does the maximum of the log-likelihood, and an MLE estimate is the same regardless of whether we maximize the likelihood or the log-likelihood function, since the logarithm is a strictly monotonically increasing function.

Therefore, the log-likelihood function can be written as follows:

Finally, the maximum of the log-likelihood function is computed as follows:

In order to find the parameter vector which maximizes the (log-) likelihood function, one typically computes all the partial derivatives with respect to each of the parameters of the vector , namely the **gradient**, and set them all to 0. If, for instance, , then we compute the following:

…