Home » Miscellanea » Machine Learning » Supervised Learning » Logistic Regression

# Logistic Regression

Logistic Regression is a well-known example of linear model which complements the other two most famous linear models, i.e. Linear Classification and Linear Regression.
As for other supervised learning approaches, there are 3 key components that must be specified in order to properly define logistic regression:

• The model, which describes the set of hypotheses (hypothesis space) that can be possibly implemented;
• The error measure (or cost function), which measures the price that must be payed if a misclassification occurs;
• The learning algorithm, which is responsible of picking the best hypothesis (according to the error measure) by searching through the hypothesis space.

All the three components above are different from their counterparts for linear classification and linear regression. Let’s discuss each of them separately.

The Model
Generally speaking, for a model to be linear means that, given a certain input ${\bf x}\in \mathbb{R}^{d+1}$ as a $d+1$-dimensional vector whose components are ${\bf x}^T = (x_0, x_1, \ldots, x_d)$, we consider the family of real-valued functions $\mathcal{F}$ having $d+1$ variables (a.k.a. parameters or weights) ${\boldsymbol \theta^T = (\theta_0, \theta_1, \ldots, \theta_d)}$ and whose output is a real number obtained as linear combination of the input ${\bf x}$ with the parameters ${\boldsymbol \theta}$. More formally:

$\mathcal{F} = \{f_{\boldsymbol \theta}:\mathbb{R}^{d+1}\longmapsto \mathbb{R}~|~f_{\boldsymbol \theta}({\bf x}) ={\boldsymbol \theta^T} {\bf x} = \sum_{i=0}^d \theta_i x_i \}$

The symbol $f_{\boldsymbol \theta}({\bf x})$ means “the application of the function $f$ parametrized by ${\boldsymbol \theta}$ to the input ${\bf x}$; often, this is written also as $f({\bf x};{\boldsymbol \theta})$ and it is referred to as signal. This is then usually passed through a “thresholding” filter, i.e. another real-valued function $g: \mathbb{R} \longmapsto \mathbb{R}$. Finally, the resulting composite function $h_{\boldsymbol \theta}({\bf x}) = (g\circ f_{\boldsymbol \theta})({\bf x}) = g(f_{\boldsymbol \theta}({\bf x}))$ defines the hypothesis space $\mathcal{H}$:

$\mathcal{H} = \{h_{\boldsymbol \theta}: \mathbb{R}^{d+1}\longmapsto \mathbb{R}~|~h_{\boldsymbol \theta}({\bf x}) = g(f_{\boldsymbol \theta}({\bf x})) = g({\boldsymbol \theta^T} {\bf x}) = g\big(\sum_{i=0}^d \theta_i x_i \big) \}$

It turns out that, depending on the parametric model ($f_{\boldsymbol \theta}$) and on the thresholding function ($g$) the set of hypotheses change as well. With a linear parametric model, we can generally associate the following thresholding functions:

• $g = \textrm{sign}$
This is the function used in linear classification, which takes as input the output of the linear parametric model and produces a binary output, i.e. $+1$ or $-1$.
Generally speaking it is defined as follows:
$\textrm{sign}(z) = +1 \text{ if } z > 0, -1 \text{ if } z < 0.$
If we plug our parametric linear model into the $\textrm{sign}$ function we obtain the following kind of hypotheses:
$h_{\boldsymbol \theta}({\bf x}) = \textrm{sign}(f_{\boldsymbol \theta}({\bf x})) = \textrm{sign}({\boldsymbol \theta^T} {\bf x}) = \textrm{sign}\big(\sum_{i=0}^d \theta_i x_i \big)$

• $g = \textrm{identity}$
This is the identity function used in linear regression, which outputs the output of the linear parametric model as it is.
It is defined as follows:
$\textrm{identity}(z) = z.$
If we plug our parametric linear model into the $\textrm{identity}$ function we obtain the following kind of hypotheses:
$h_{\boldsymbol \theta}({\bf x}) = \textrm{identity}(f_{\boldsymbol \theta}({\bf x})) = f_{\boldsymbol \theta}({\bf x}) = {\boldsymbol \theta^T} {\bf x} = \sum_{i=0}^d \theta_i x_i$

• $g = \ell$
This is the logistic function used in logistic regression, which takes as input the output of the linear parametric model and produces a real-valued output in the range $[0,1]$.
The logistic function is defined as follows:
$\ell(z) = \frac{e^z}{1+e^z}$
Therefore, if we plug our parametric linear model into the $\ell$ we obtain the following kind of hypotheses:
$h_{\boldsymbol \theta}({\bf x}) = \ell(f_{\boldsymbol \theta}({\bf x})) = \ell({\boldsymbol \theta^T} {\bf x}) = \frac{e^{({\boldsymbol \theta^T} {\bf x})}}{1+e^{({\boldsymbol \theta^T} {\bf x})}}$

Concretely, with the logistic function we are actually applying a non-linear transformation to our (linear) signal. Moreover, the output of this function can be genuinely be interpreted as a probability value.
The logistic function is also known as sigmoid due to its “S” shape or soft threshold (compare to the hard threshold imposed by the $\textrm{sign}$ function) because it encapsulates the notion of “uncertainty”.

The probabilistic interpretation of the model
Having the set of hypotheses described using the logistic function couldn’t be enough to state that the output is a probability value. All in all, the only thing we know upto now about the logistic function is that its output lays always between $0$ and $1$. But this thing alone is not enough to claim that we can treat such output as a probability!
The point is that the output of the logistic function can be genuinely treated as a probability even during learning.
In fact, remembering that we are approaching a supervised learning problem, we are provided with a training set of labelled examples $\mathcal{D}=\{({\bf x_1},y_1)\}, \ldots, ({\bf x_m}, y_m)\}$ where $y_i \in \{-1,+1\}$, i.e. each $y_i$ is a binary variable.
Of course, this is the only information that we have from the training set, namely we cannot have access to the individual probability associated with each example. However, this is still enough to assume that the examples on our training set are labelled as positive ($+1$) or negative ($-1$) according to an underlying and unknown probability function (i.e. a noisy target); from the data we can only access to the binary labels but we can still assume those labels are generated by an underlying probability function which we want to estimate.
More formally, given the generic training example $({\bf x}, y)$ we claim there exists a conditional probability $P(y~|~{\bf x})$ defined as follows:
$P(y~|~{\bf x}) = \phi({\bf x}) \text{ if } y = +1; 1 - \phi({\bf x}) \text{ if } y = -1$
where $\phi(\cdot)$ is the noisy target function.
Differently from a deterministic target function which, given ${\bf x}$, will always output $y=+1$ or $y=-1$ in a mutually-exclusive way, a noisy target function $\phi(\cdot)$, when applied to ${\bf x}$, in general would output $y=+1$ and $y=-1$, each with an associated “degree of certainty”, i.e. a probability.
Therefore, if we assume $\phi: \mathbb{R}^{d+1} \longmapsto [0,1]$ is the underlying and unknown noisy target which generates our examples, our aim is to find an estimate $\phi^*$ which best approximates $\phi$, i.e. $\phi^* \approx \phi$.
We assume that such an estimate $\phi^*$ is picked from the set of hypotheses defined by the logistic function, namely $\phi^*({\bf x}) = h_{{\boldsymbol \theta}}^*({\bf x}) = \ell({\boldsymbol \theta^T} {\bf x}) \approx \phi({\bf x})$
But how do we choose $h_{{\boldsymbol \theta}}^*$ which best approximates $\phi$?
Well, it turns out that given the training set the only thing which we can operate on are the set of parameters ${\boldsymbol \theta}$. Moreover, in order to find the best estimate we need to explicitly define an error measure (i.e. a cost function) in terms of the parameters to minimize.

The Error Measure
When using a probabilistic approach, namely when we consider as our hypotheses space $\mathcal{H}$ a family of parametric models which can be interpreted as an approximation of the true, unknown probability distribution which generates our observed labelled samples (as for the case of logistic regression), what we are ultimately interested in is finding such a hypothesis $h_{{\boldsymbol \theta}}^*$ within $\mathcal{H}$ which maximizes the following quantity:
$h_{{\boldsymbol \theta}}^* = \text{{\tt argmax}}_{h_{{\boldsymbol \theta}}\in \mathcal{H}}~P(h_{{\boldsymbol \theta}}~|~\mathcal{D})$
That is, we want to maximize the probability of the hypothesis given the observed data.
If we want to measure the error we are making by assuming that our hypothesis approximates the true noisy target $\phi$, we can measure how likely it is for the observed data to be generated by our selected hypothesis. This somehow flips the coin as we are now interested in knowing the (data) likelihood, namely to find the hypothesis which maximizes the probability of the observed data given that particular hypothesis $P(\mathcal{D}~|~h_{{\boldsymbol \theta}})$.

What does the likelihood of a single example look like?
Remember from above that, given the generic training example $({\bf x}, y)$, we claim there exists a conditional probability $P(y~|~{\bf x})$ defined as follows:
$P(y~|~{\bf x}) = \phi({\bf x}) \text{ if } y = +1; 1 - \phi({\bf x}) \text{ if } y = -1$
where $\phi$ is the unknown target function we would like to approximate/learn.
A plausible way to measure the error we make by approximating the true target $\phi$ with a generic hypothesis $h_{{\boldsymbol \theta}}$ is given by the likelihood.
Let us see how to measure the likelihood that a single training example $({\bf x},y)$ has been generated by a generic hypothesis $h_{{\boldsymbol \theta}}$, then we will see how this can be generalised to the a whole training set $\mathcal{D}$.
In formula, this means that we are assuming:
$P(y~|~{\bf x}) = h_{{\boldsymbol \theta}}({\bf x}) \text{ if } y = +1; 1 - h_{{\boldsymbol \theta}}({\bf x}) \text{ if } y = -1$
If we substitute the following $h_{\boldsymbol \theta}({\bf x}) = \ell({\boldsymbol \theta^T} {\bf x})$ from the above (i.e. assuming our hypothesis is in fact the logistic function), and noting that $\ell(-z) = 1 - \ell(z)$ (i.e. the logistic function is symmetric) we can rewrite everything as follows:
$P(y~|~{\bf x}) = \ell(y{\boldsymbol \theta^T} {\bf x})$

What does the likelihood of a whole training set look like?
Assuming we have a full training set of independent and identically distributed examples $\mathcal{D} = \{({\bf x_1},y_1), ({\bf x_2},y_2), \ldots, ({\bf x_m},y_m)\}$, the overall likelihood is computed as:
$\prod_{i=1}^m P(y_i~|~{\bf x_i}) = \prod_{i=1}^m \ell(y_i{\boldsymbol \theta^T} {\bf x_i})$.
Note that this measure behaves as expected. For example, suppose that the computed signal ${\boldsymbol \theta^T} {\bf x_i}$ is strongly positive (resp. negative). Then, if the true labelled value $y_i$ for that example ${\bf x_i}$ is concordant, namely $y_i = +1$ (resp. $y_i = -1$), the corresponding argument of the likelihood function is strongly positive (i.e. either because both $y_i$ and ${\bf x_i}$ are positives or because they are both negatives). Either way, the resulting value of the logistic function is consistently approaching to $1$, as our “prediction” is actually agreeing with the true label. Vice versa, if the prediction made by the hypothesis on the example ${\bf x_i}$ (i.e. ${\boldsymbol \theta^T} {\bf x_i}$) is strongly positive (resp. negative) but the corresponding label disagrees (i.e. $y_i = -1$ or $y_i=+1$, resp.) then the argument of the logistic function is negative and therefore it results into a value which approaches to $0$.

How to maximize the likelihood?
To maximize the likelihood function described above, we are implicitly asking to find the vector of parameters ${\boldsymbol \theta}$ so that the quantity is maximized, i.e.:

$\text{{\tt argmax}}_{\boldsymbol \theta} \bigg(\prod_{i=1}^m P(y_i~|~{\bf x_i})\bigg) = \text{{\tt argmax}}_{\boldsymbol \theta} \bigg(\prod_{i=1}^m \ell(y_i{\boldsymbol \theta^T} {\bf x_i})\bigg)$

From maximizing the likelihood to minimizing the in-sample error function
What we are ultimately interested in is to find a measure for the in-sample error which looks like the following:
$E_{\text{in}}({\boldsymbol \theta}) = \frac{1}{m}\sum_{i=1}^m e(h_{\boldsymbol \theta}({\bf x_i}), y_i)$.
Let’s start from what we had so far, namely with the formula to maximize the likelihood:

$\text{{\tt argmax}}_{\boldsymbol \theta} \bigg(\prod_{i=1}^m \ell(y_i{\boldsymbol \theta^T} {\bf x_i})\bigg)$

It turns out that the formula above can also be rewritten as follows:

$\text{{\tt argmax}}_{\boldsymbol \theta} \bigg(\frac{1}{m}~\text{ln} \big(\prod_{i=1}^m \ell(y_i{\boldsymbol \theta^T} {\bf x_i})\big)\bigg)$

In fact, the $\frac{1}{m}$ is simply a multiplicative, strictly positive, constant factor; moreover, the logarithm $\text{ln}$ is a strictly monotonically-increasing function. Therefore, maximizing the first function is the same as maximizing the second one!

As we are interested in an error measure, we may want to do the following:

$\text{{\tt argmax}}_{\boldsymbol \theta} \bigg(\frac{1}{m}~\text{ln} \big(\prod_{i=1}^m \ell(y_i{\boldsymbol \theta^T} {\bf x_i})\big)\bigg) = \text{{\tt argmin}}_{\boldsymbol \theta} \bigg(-\frac{1}{m}~\text{ln} \big(\prod_{i=1}^m \ell(y_i{\boldsymbol \theta^T} {\bf x_i})\big)\bigg)$,
where we have just changed the sign to the main expression and minimize the resulting quantity accordingly.
Note that the formula above can be also written as:

$\text{{\tt argmin}}_{\boldsymbol \theta} \bigg(-\frac{1}{m}~\text{ln} \big(\prod_{i=1}^m \ell(y_i{\boldsymbol \theta^T} {\bf x_i})\big)\bigg) =$
$= \text{{\tt argmin}}_{\boldsymbol \theta} \bigg(-\frac{1}{m}~\text{ln}\big(\ell(y_1{\boldsymbol \theta^T} {\bf x_1})\big) -\ldots -\frac{1}{m}~\text{ln}\big(\ell(y_m{\boldsymbol \theta^T} {\bf x_m})\big) \bigg)$
as $k~\text{ln}(a\cdot b) = k\big(\text{ln}(a) + \text{ ln}(b)\big) = k~\text{ln}(a) + k~\text{ln}(b)$.

$= \text{{\tt argmin}}_{\boldsymbol \theta} \bigg(\frac{1}{m}~\sum_{i=1}^m~-\text{ln} (\ell(y_i{\boldsymbol \theta^T} {\bf x_i}))\bigg)$

$= \text{{\tt argmin}}_{\boldsymbol \theta} \bigg(\frac{1}{m}~\sum_{i=1}^m~\text{ln} \bigg( \frac{1}{\ell(y_i{\boldsymbol \theta^T} {\bf x_i})}\bigg)\bigg)$

as $-\text{ln}(a) = \text{ln}(\frac{1}{a})$.

By noticing that $\ell(z) = \frac{e^z}{1+e^z}$ and dividing both the numerator and the denominator by the same factor $e^z$, we can re-write logistic function as follows:

$\ell(z) = \frac{e^z}{1+e^z} = \frac{1}{e^{-z} + 1}$

Therefore, the error function $E_{\text{in}}({\boldsymbol \theta})$ to be minimized can be rewritten as follows:

$E_{\text{in}}({\boldsymbol \theta}) = \frac{1}{m}~\sum_{i=1}^m~\text{ln} \big( e^{-y_i{\boldsymbol \theta^T} {\bf x_i}} + 1\big)$

Please, check out my slides on Slideshare if you would like to know more about Logistic Regression.