Home » Miscellanea » Machine Learning » Supervised Learning

# Supervised Learning

Generally speaking, the supervised learning problem can be described as follows.

Let $X$ be a population of objects, where each object $\mathbf{x_i}\in X$ is represented by a set of $n$ features, i.e., $\mathbf{x_i} = \{x_{i,1}, x_{i,2}, \ldots x_{i,n}\}$.
Each feature $x_{i,j}$ is actually a property derivable from the particular object instance.
In addition, we assume each object $\mathbf{x_i}$ is paired with a value $y_i$, so that $y_i \in Y$, and we call any $(\mathbf{x_i},y_i)$ a labeled example (or observation), namely any object instance with its associated value.
Now, suppose we have $\mathcal{D} = \{(\mathbf{x_1},y_1),\ldots,(\mathbf{x_m},y_m)\}$ as a set of $m$ labeled examples, namely a sample drawn from the population of objects above together with their associated values, and let us call it the training set.

Using matrix notation, the training set $\mathcal{D}$ can be represented as follows.
First, we write each $\mathbf{x_i}$ as a $n\times 1$ (column) vector of features:
$\mathbf{x_i} = \begin{pmatrix} x_{i,1} \\ x_{i,2} \\ \vdots \\ x_{i,n} \end{pmatrix}$
Then, we represent all the object instances by a $m\times n$ matrix $\mathbf{X}$, so that:
$\mathbf{X} = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,n} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m,1} & x_{m,2} & \cdots & x_{m,n} \end{pmatrix} = \begin{pmatrix} \mathbf{x_1}^T\\ \mathbf{x_2}^T\\ \vdots \\ \mathbf{x_m}^T \end{pmatrix}$
where $T$ stands for the transpose operator, i.e., $\mathbf{x_i}^T = (x_{i,1}, x_{i,2}, \ldots x_{i,n})$.
In addition, the set of $m$ values associated with the objects can be written as an $m\times 1$ column vector $\mathbf{y}$, as follows:
$\mathbf{y} = \begin{pmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{pmatrix}$
Finally, the training set $\mathcal{D}$ can be overall represented by $\mathcal{D} = \mathbf{X},\mathbf{y}.$

Intuitively, supervised learning is the machine learning task of inferring an unknown function, called the target function, from labeled training data so that this function can be in turn used for predicting which value $y'\in Y$ should be assigned to unseen instances of population $\mathbf{x'}\in X$, i.e., instances $(\mathbf{x'},y) \notin \mathcal{D}$ which lay outside the training set and whose actual value is unknown.
Note that if $Y\subseteq \mathbb{R}$ (i.e., if each $y_i$ takes on continuous values) then supervised learning problem turns into a regression problem. On the other hand, if $Y\subseteq \mathbb{N}$ and $Y = \{0,\ldots,k-1\}$ (i.e., if each $y_i$ takes on $k$ discrete values) the supervised learning problem turns into a $k$-ary classification problem.

To put it more formally, assuming the existence of an unknown target function $f: X\longmapsto Y$, which maps each object to its correct value, the goal of a supervised learning algorithm is to exploit the training set $\mathcal{D}$ to learn another function $h^{\ast}: X\longmapsto Y$ called hypothesis and selected from a fixed family of functions, namely a fixed space of possible hypotheses $\mathcal{H}$, which best approximates $f$.

We assume there exists a joint probability distribution $P(X=\mathbf{x},Y=y)$ over all possible values $\mathbf{x} \in X$ and $y\in Y$, which we will denote by $P(X,Y)$ for short. In addition, the training set $\mathcal{D}$ consists of $m$ instances drawn independently identically distributed (i.i.d.) from $P(X,Y)$. This allows us to model uncertainty in predictions (e.g., due to noise in data) so that $y$ is not a deterministic function of $\mathbf{x}$ yet it is a random variable with a conditional probability distribution $P(Y=y|X=\mathbf{x})$ for a fixed value of $\mathbf{x}$. Similarly to the joint probability distribution, we denote by $P(Y|X)$ the conditional probability distribution over all the possible values $\mathbf{x} \in X$ and $y\in Y$.
It turns out that the hypothesis $h^{\ast}$ we would like to find may either lead to learning the whole joint probability distribution $P(X,Y)$ or to learning directly the conditional probability distribution $P(Y|X)$. The former leads to so-called generative models (e.g., Naïve Bayes), whereas the latter refers to so-called discriminative models (e.g., Logistic Regression).

Anyway, since the hypothesis space $\mathcal{H}$ which the function $h^{\ast}$ can be selected from is usually very large (in fact, $\mathcal{H}$ may be also infinite!), how do we choose $h^{\ast}$?
There are two fundamental approaches for choosing $h^{\ast}$: empirical risk minimization and structural risk minimization. The former picks the hypothesis function that best fits the training data. The latter includes also a penalty function in order to control the bias/variance tradeoff.

No matter what approach is used to select the hypothesis, we are given a non-negative, real-valued loss function $L: Y\times Y\longmapsto \mathbb{R}^{\geq 0}$.
Generally speaking, $L(y,\hat{y})$ measures how different the prediction $\hat{y}$ made by a certain hypothesis $h\in \mathcal{H}$ when applied to $\mathbf{x}$ is from the actual value $y$. A typical choice for the function $L$ is the 0-1 loss function, which is defined as:

$L(y,\hat{y}) = 1_{y\neq \hat{y}}(y,\hat{y})$, where $1_{y\neq \hat{y}}$ is the indicator function. In this particular case, the indicator function is defined on the set of all the possible pairs $Y\times Y$, and indicates membership of an element in a specific subset of the original domain, namely $A=\{(y,\hat{y}) \in Y\times Y | y\neq \hat{y}\}$. The indicator function returns $1$ for all the elements of $A$ and $0$ for all the other elements.

Finally, we define the risk $R(h)$ associated with the hypothesis $h$ as the expectation of the loss function:

$R(h) = E[L(y,h(\mathbf{x})]$.

The ultimate goal of the learning task is thus to find such a hypothesis $h^{\ast}\in \mathcal{H}$ for which the risk $R(h)$ is minimal:

$h^{\ast} = \texttt{argmin}_{h\in \mathcal{H}} R(h)$.

Empirical Risk Minimization
In general, the risk $R(h)$ cannot be computed because the whole joint probability distribution $P(X,Y)$ is unknown to the learning algorithm. In fact, we can only operate on a sample drawn from such distribution, namely the training set $\mathcal{D}$. Thus, an approximation, called empirical risk (or in-sample error), can be computed by averaging the loss function on the training set $\mathcal{D}$:

$R_{\mathcal{D}}(h) = \frac{1}{m}\sum_{i=1}^{m}L(y_i,h(\mathbf{x_i}))$.

Empirical risk minimization states that the learning algorithm should provide a hypothesis $h^{\ast}$ which minimizes the empirical risk above:

$h^{\ast} = \texttt{argmin}_{h\in \mathcal{H}} R_{\mathcal{D}}(h) = \texttt{argmin}_{h\in \mathcal{H}} \sum_{i=1}^{m}L(y_i,h(\mathbf{x_i}))$.

Note that we removed the factor $1/m$ from the empirical risk minimization since it doesn’t affect the optimization (i.e., the value $v$ for which a sum $S(v)$ gets minimized also minimizes $\frac{1}{m} S(v)$).

Therefore, the supervised learning algorithm consists in solving the above optimization problem.
However, as we stated before, a truly effective learning algorithm should be able to devise a hypothesis which in turn can be used to predict unseen examples (i.e., those outside the training set).
Unfortunately, with emprirical risk minimization we are guaranteed of very few errors on the training set (i.e., a low in-sample error) but we cannot ensure the learned hypothesis is able to generalize on unseen example as well (i.e., a low out-of-sample error). For instance, a hypothesis function $h$, which is a lookup table that simply memorizes all the training examples, has the minimum in-sample error ($R_{\mathcal{D}}(h) = 0$) but no ability to generalize on unseen examples. This problem is called overfitting and it is mostly caused by the huge size of the hypothesis space $\mathcal{H}$ which $h^{\ast}$ is picked from. Intuitively, not every hypothesis space has the same size. By increasing the size of the set of functions that the learning algorithm can choose from we implicitly increase the complexity, and therefore the flexibility, of the hypothesis we are going to select. That means that our hypothesis will almost perfectly fit our training examples (i.e., low bias) but will not be able to predict unseen examples correctly (i.e., high variance).

As a side note, let us consider how the size of the hypothesis space $\mathcal{H}$ may vary.
Generally speaking, $\mathcal{H}$ can be expressed as follows:

$\mathcal{H} = \{h|h:X\longmapsto Y\}$.

Given two sets $X$ and $Y$ the possible number of functions that can be derived using $X$ as the domain and $Y$ as the range are $|Y|^{|X|}$.
Therefore,

Let us first consider $\mathcal{H}_1 = \{h|h:X\longmapsto Y\}$ as our hypothesis space, which is made of all the purely conjunctive concepts over the same $n$ boolean attributes. Each attribute is required to be true, false, or ignored.
It turns out that $X$ is the set of $n$ boolean attributes (i.e., $|X| = n$), whereas $Y$ is the set of possible outcomes for each attribute (i.e., $|Y|=3$). Thus, $|\mathcal{H}_1| = |Y|^{|X|} = 3^n$.

Now, assume instead to consider another hypothesis space $\mathcal{H}_2$ as the set of all the boolean functions whose domain consists of all the possible boolean $n$-uples. That is, $X = \{0,1\}^n$ and $Y\in \{0,1\}$:

$\mathcal{H}_2 = \{h|h:\{0,1\}^n\longmapsto \{0,1\}\}$.

Then, what can we say about $|\mathcal{H}_2|$?

It is straightforward to see that the set of hypotheses coincides with all the possible truth tables which can be derived from all the possible $n$-uple boolean inputs. The number of possible inputs to any of this hypothesis is thus $|X| = 2^n$, namely all the possible binary assignments of a tuple of length $n$. For each of those $2^n$ possible inputs, any function can either output $0$ or $1$, i.e., $|Y| = 2$. Overall, we have a total number of $|Y|^{|X|} = 2^{2^{n}}$ possible hypotheses, namely the number of possible hypotheses is exactly the number of possible boolean assignments for a tuple of length $2^n$.
As it turns out, with just $n=6$ there are $2^{2^{6}}\approx 18\times 10^{18}$ possible hypotheses we can choose from!

It results that the latter hypothesis space is far greater than the former, i.e., $|\mathcal{H}_2| \gg |\mathcal{H}_1|$. Intuitively, this means that $\mathcal{H}_2$ is a more expressive and complex hypothesis space. From a side, this increases the chance the target function can be expressed (i.e., it is more likely to pick the “right” hypothesis from a larger family of functions). On the other hand, this also increases the number of hypotheses that are consistent with our training set, so it may get worse predictions on unseen examples.
Things get even worst and need to be properly fixed within a specific theory when the size of the hypothesis space becomes infinite. For instance, how many distinct linear separators exist in $n$-dimensional Euclidean space ($\mathbb{R}^n$)? Infinitely many! In fact, if we consider the easy case of $n=2$ we are focussing on finding how many lines the $\mathbb{R}^2$ space can be partitioned to. More generally, for $n>2$ the hypothesis space coincides with all the possible hyperplanes that divides $\mathbb{R}^n$.
As I said before, there is a measure of the size even for such infinite hypothesis spaces that requires setting up a more advanced theory framework. Please, refer to Vapnik-Chervonenkis (VC) dimension and Probabilistic Approximately Correct (PAC) learning.

Structural Risk Minimization
Structural risk minimization seeks to prevent overfitting by incorporating a regularization penalty into the optimization of the risk function. The regularization penalty can be viewed as implementing a form of Occam’s razor, according to which simpler hypotheses should be preferred over more complex ones.
A wide variety of penalties have been employed that correspond to different definitions of complexity.
For instance, consider a typical linear regression problem where $\mathbf{x}$ is the generic input and $\hat{y}\in \mathbb{R}$ its predicted value, which is computed from a linear function $h$ parametrized by $\theta$:

$\hat{y} = h(\mathbf{x};\theta) = \theta_0 + \theta_1 x_{1} + \ldots + \theta_n x_{n} = \theta_0 + \sum_{j=1}^{n}\theta_j x_{j}$.

In such a case, a popular regularization penalty is given by $\|\theta\|^{2}_2 = \sum_{j=1}^n \theta^{2}_j$ which is the squared Euclidean norm of the weights, also known as the $L_2$ norm. Other well-known penalties include the so-called $L_1$ norm, which is computed as $\|\theta\|_1 = \sum_{j=1}^n |\theta_j|$, and the $L_0$ norm, which is simply the count of non-zero $\theta_j$s, i.e., $\|\theta\|_0 = \sum_{j=1}^n 1_{\theta_j\neq 0}(\theta_j)$.
Whether you go for $L_2$, $L_1$, or $L_0$ you may indicate such penalty as $C(h)$.
Also, in this scenario a typical empirical risk function $R_{\mathcal{D}}(h)$ is the average of the squared losses, i.e., $L(y,\hat{y}) = (y-\hat{y})^2$

$R_{\mathcal{D}}(h) = \frac{1}{m}\sum_{i=1}^{m}(y_i-h(\mathbf{x_i};\theta))^2$

Without considering regularization penalty, the above risk function can be minimized according to the empirical risk minimization we seen before, as follows:

$h^{\ast} = \texttt{argmin}_{h\in \mathcal{H}} R_{\mathcal{D}}(h) = \texttt{argmin}_{\theta} \sum_{i=1}^{m}(y_i-h(\mathbf{x_i};\theta))^2$.

In fact, if we consider also the regularization the supervised learning problem aims to find a hypothesis $h^{\ast}$ which minimizes this revisited risk function:

$h^{\ast} = \texttt{argmin}_{h\in \mathcal{H}} R_{\mathcal{D}}(h) + \lambda C(h)$.

Note that the above optimization differs from the empirical risk minimization due to the penalty factor $\lambda C(h)$.
For instance, if $C(h) = \|\theta\|^{2}_2$, the structural risk minimization is computed as follows:

$h^{\ast} = \texttt{argmin}_{\theta} \sum_{i=1}^{m}(y_i-h(\mathbf{x_i};\theta))^2 + \lambda \|\theta\|^{2}_2$.

The parameter $\lambda$ controls the bias-variance tradeoff. More specifically, when $\lambda = 0$, the structural risk minimization turns into empirical risk minimization, thus providing a hypothesis with low bias and high variance. On the other hand, as $\lambda$ increases, the learned hypothesis will have high bias and low variance. The value of $\lambda$ can be chosen empirically via cross validation.