Logistic Regression is a well-known example of linear model which complements the other two most famous linear models, i.e. Linear Classification and Linear Regression.

As for other supervised learning approaches, there are 3 key components that must be specified in order to properly define logistic regression:

- The
**model**, which describes the set of hypotheses (*hypothesis space*) that can be possibly implemented; - The
**error measure**(or*cost function*), which measures the price that must be payed if a misclassification occurs; - The
**learning algorithm**, which is responsible of picking the best hypothesis (according to the error measure) by searching through the hypothesis space.

All the three components above are different from their counterparts for linear classification and linear regression. Let’s discuss each of them separately.

**The Model**

Generally speaking, for a model to be *linear* means that, given a certain input as a -dimensional vector whose components are , we consider the family of real-valued functions having variables (a.k.a. *parameters* or *weights*) and whose output is a real number obtained as linear combination of the input with the parameters . More formally:

The symbol means “the application of the function parametrized by to the input ; often, this is written also as and it is referred to as *signal*. This is then usually passed through a “thresholding” filter, i.e. another real-valued function . Finally, the resulting composite function defines the hypothesis space :

It turns out that, depending on the parametric model () and on the thresholding function () the set of hypotheses change as well. With a linear parametric model, we can generally associate the following thresholding functions:

This is the function used in linear classification, which takes as input the output of the linear parametric model and produces a*binary*output, i.e. or .

Generally speaking it is defined as follows:

If we plug our parametric linear model into the function we obtain the following kind of hypotheses:

This is the identity function used in linear regression, which outputs the output of the linear parametric model as it is.

It is defined as follows:

If we plug our parametric linear model into the function we obtain the following kind of hypotheses:

This is the logistic function used in logistic regression, which takes as input the output of the linear parametric model and produces a real-valued output in the range .

The logistic function is defined as follows:

Therefore, if we plug our parametric linear model into the we obtain the following kind of hypotheses:

Concretely, with the logistic function we are actually applying a *non-linear* transformation to our (linear) signal. Moreover, the output of this function can be genuinely be interpreted as a *probability value*.

The logistic function is also known as *sigmoid* due to its “S” shape or *soft threshold* (compare to the hard threshold imposed by the function) because it encapsulates the notion of “uncertainty”.

**The probabilistic interpretation of the model**

Having the set of hypotheses described using the logistic function couldn’t be enough to state that the output is a probability value. All in all, the only thing we know upto now about the logistic function is that its output lays always between and . But this thing alone is not enough to claim that we can treat such output as a probability!

The point is that the output of the logistic function can be genuinely treated as a probability *even during learning*.

In fact, remembering that we are approaching a supervised learning problem, we are provided with a *training set* of labelled examples where , i.e. each is a *binary* variable.

Of course, this is the only information that we have from the training set, namely we **cannot** have access to the individual probability associated with each example. However, this is still enough to assume that the examples on our training set are labelled as positive () or negative () according to an underlying and unknown probability function (i.e. a noisy target); from the data we can only access to the binary labels but we can still assume those labels are generated by an underlying probability function which we want to estimate.

More formally, given the generic training example we claim there exists a conditional probability defined as follows:

where is the *noisy target function*.

Differently from a *deterministic* target function which, given , will always output **or** in a mutually-exclusive way, a noisy target function , when applied to , in general would output **and** , each with an associated “degree of certainty”, i.e. a probability.

Therefore, if we assume is the underlying and unknown noisy target which generates our examples, our aim is to find an estimate which *best approximates* , i.e. .

We assume that such an estimate is picked from the set of hypotheses defined by the logistic function, namely

But how do we choose which best approximates ?

Well, it turns out that given the training set the only thing which we can operate on are the set of parameters . Moreover, in order to find the best estimate we need to explicitly define an error measure (i.e. a cost function) in terms of the parameters to minimize.

**The Error Measure**

When using a probabilistic approach, namely when we consider as our hypotheses space a family of parametric models which can be interpreted as an *approximation* of the true, unknown probability distribution which generates our observed labelled samples (as for the case of logistic regression), what we are ultimately interested in is finding such a hypothesis within which maximizes the following quantity:

That is, we want to maximize the probability of the hypothesis given the observed data.

If we want to measure the error we are making by assuming that our hypothesis approximates the true noisy target , we can measure how likely it is for the observed data to be generated by our selected hypothesis. This somehow flips the coin as we are now interested in knowing the (data) likelihood, namely to find the hypothesis which maximizes the probability of the observed data given that particular hypothesis .

__What does the likelihood of a single example look like?__

Remember from above that, given the generic training example , we claim there exists a conditional probability defined as follows:

where is the unknown target function we would like to approximate/learn.

A plausible way to measure the error we make by approximating the true target with a generic hypothesis is given by the **likelihood**.

Let us see how to measure the likelihood that a single training example has been generated by a generic hypothesis , then we will see how this can be generalised to the a whole training set .

In formula, this means that we are assuming:

If we substitute the following from the above (i.e. assuming our hypothesis is in fact the logistic function), and noting that (i.e. the logistic function is symmetric) we can rewrite everything as follows:

__What does the likelihood of a whole training set look like?__

Assuming we have a full training set of independent and identically distributed examples , the overall likelihood is computed as:

.

Note that this measure behaves as expected. For example, suppose that the computed signal is strongly positive (resp. negative). Then, if the true labelled value for that example is **concordant**, namely (resp. ), the corresponding argument of the likelihood function is strongly positive (i.e. either because both and are positives or because they are both negatives). Either way, the resulting value of the logistic function is consistently approaching to , as our “prediction” is actually agreeing with the true label. Vice versa, if the prediction made by the hypothesis on the example (i.e. ) is strongly positive (resp. negative) but the corresponding label disagrees (i.e. or , resp.) then the argument of the logistic function is negative and therefore it results into a value which approaches to .

__How to maximize the likelihood?__

To maximize the likelihood function described above, we are implicitly asking to find the vector of parameters so that the quantity is maximized, i.e.:

__From maximizing the likelihood to minimizing the in-sample error function__

What we are ultimately interested in is to find a measure for the in-sample error which looks like the following:

.

Let’s start from what we had so far, namely with the formula to maximize the likelihood:

It turns out that the formula above can also be rewritten as follows:

In fact, the is simply a multiplicative, strictly positive, constant factor; moreover, the logarithm is a strictly monotonically-increasing function. Therefore, maximizing the first function is the same as maximizing the second one!

As we are interested in an error measure, we may want to do the following:

,

where we have just changed the sign to the main expression and minimize the resulting quantity accordingly.

Note that the formula above can be also written as:

as .

as .

By noticing that and dividing both the numerator and the denominator by the same factor , we can re-write logistic function as follows:

Therefore, the error function to be minimized can be rewritten as follows:

Please, check out my slides on Slideshare if you would like to know more about Logistic Regression.