*This post is the sixth one of our series on* *the history and foundations of econometric and machine learning models. The first fours were on econometrics techniques. Part 5 is online here.*

## The probabilistic formalism in the 80’s

We have a training sample, with observations (\mathbf{x}_i,y_i) where the variables y are in a set \mathcal{Y}. In the case of classification, \mathcal{Y}=\{-1,+1\}, but a relatively general set can be considered (note that if econometricians prefer \mathcal{Y}=\{0,1\} – because of the Bernoulli distribution and because 0 and 1 are lower and upper bounds of probabilities, people in the “machine learning” community prefer \mathcal{Y}=\{-1,+1\}). A predictor m is an function taking values in \mathcal{Y}, used to label (or classify) future new observations, using some features that lie in a set \mathcal{X}. It is assumed that the labels are produced by an (unknown) classifier f called target. For a statistician, this function would be the real model. Naturally, we want to build m as close as possible to f. Let \mathbb{P} be a (unknown) distribution on \mathcal{X}. The error of m with respect to target f is defined by \mathcal{R}_{\mathbb{P},f}(m)=\mathbb{P}[m(\boldsymbol{X})\neq f(\boldsymbol{X})]\text{ where }\boldsymbol{X}\sim\mathbb{P}or equivalently,\mathcal{R}_{\mathbb{P},f}(m)=\mathbb{P}\big[\{\boldsymbol{x}\in\mathcal{X}:m(\boldsymbol{x})\neq f(\boldsymbol{x})\}\big]To obtain our “optimal” classifier, it becomes necessary to assume that there is a link between the data in our sample and the pair (\mathbb{P},f) , i.e. a data generation model. We will then assume that the \mathbf{x}_i are obtained by independent draws according to \mathbb{P}, and that then y_i=f(\mathbf{x}_i) . We can define the empirical risk of a classifier m, as \widehat{{R}}(m)=\frac{1}{n}\sum_{i=1}^n \boldsymbol{1}(m(\boldsymbol{x}_i)\neq y_i)

It is important to recognize that a perfect model cannot be found, in the sense that R_{\mathbb{P},f} (m)=0. Indeed, if we consider the simplest case, with \mathcal{X}=\{x_1,x_2\} and \mathbb{P} is such that \mathbb{P}(\{x_1\})=p and \mathbb{P}(\{x_2\})=1-p. The probability of never observing \{x_2\} among the n observations is (1-p)^n, and if p<1/n, it is quite likely never to observe \{x_2\} so it can never be predicted. We cannot therefore hope to have a zero risk whatever \mathbb{P}. And more generally, it is also possible to observe \{x_1\} and \{x_2\}, and despite everything, to make mistakes on the labels. Also, instead of looking for a perfect model, we can try to have an “approximately correct” model. We will then try to find m such that R_{\mathbb{P},f} (m)\leq\varepsilon, where \varepsilon is an a priori specified threshold. But even this condition is too strong, and cannot be fulfilled. Thus, we will usually as to have R_{\mathbb{P},f} (m)\leq\varepsilon with some probability 1-\delta. Hence, we will try to be “probably approximately correct” (PAC), allowing to make a mistake with a probability \delta, again fixed a priori.

Also, when we build a classifier, we do not know either \mathbb{P} or f, but we give ourselves a precision criterion \varepsilon , and a confidence parameter \delta, and we have n observations. Note that n, \varepsilon and \delta can be linked. We then look for a model m such that R_{\mathbb{P},f} (m)\leq\varepsilon with probability (at least) 1-\delta, so that we are probably approximately correct. Wolpert (1996) has shown (see details in Wolpert & Macready (1997)) that there is no universal learning algorithm. In particular, it can be shown that there is \mathbb{P} such that R_{\mathbb{P},f} (m) is relatively high, with a relatively high probability (also).

The interpretation is that since we cannot learn (in the PAC sense) about all the functions m, we will then force m to belong to a particular class, noted \mathcal{M}. Let us suppose, to start with, that \mathcal{M} contains a finite number of possible models. We can then show that for all \varepsilon and \delta, that for all \mathbb{P} and f, if we have enough observations (more precisely n\geq \varepsilon^{-1} \log[\delta^{-1} |\mathcal{M}|], then with a greater probability than 1-\delta, R_{\mathbb{P},f} (m^\star)\leq\varepsilon where m^\star \in \underset{m\in\mathcal{M}}{\text{argmin}}\Big\lbrace\frac{1}{n}\sum_{i=1}^n \boldsymbol{1}(m(\boldsymbol{x}_i)\neq y_i)\Big\rbracein other words m^\star is a model in \mathcal{M} that minimizes empirical risk.

We can go a little further, staying in the case where \mathcal{Y}=\{-1,+1\}. An \mathcal{M} class of classifiers will be called PAC-learnable if there is n_M:[0,1]^2\rightarrow \mathbb{N} such that, for all \varepsilon, \delta, \mathbb{P} and if it is assumed that the target f belongs to \mathcal{M}, then using n>n_M (\varepsilon,\delta) observations \mathbf{x}_i drawn from \mathbb{P}, labelled y_i by f, then there is m\in\mathcal{M} such that, with probability 1-\delta, R_{\mathbb{P},f} (m)\leq\varepsilon. The n_M function is then called “sample complexity to learn”. In particular, we have seen that if M contains a finite number of classifiers, then \mathcal{M} is PAC-learnable with complexity n_M (\varepsilon,\delta)=\varepsilon^{-1} \log[\delta^{-1} |M|].

Naturally, we would like to have a more general result, especially if \mathcal{M} is not finite. To do this, the VC dimension of Vapnik-Chervonenkis must be used, which is based on the idea of shattering points (for a binary classification). Consider k points \{x_1,\cdot,x_k\}, and consider the set {E}_k=\big\lbrace(m(\boldsymbol{x}_1),\cdots,m(\boldsymbol{x}_k))\text{ for }m\in\mathcal{M})\big\rbrace Note that the elements of E_k belong to \{-1,+1\}^k. In other words, |E_k |\leq 2^k. We will say that M shatter all the points if all the combinations are possible, i. e. |E_k |=2^k. Intuitively, the labels of the set of points do not provide enough information on target f, because anything is possible. The VC dimension of \mathcal{M} is then VC(\mathcal{M})=\sup\big\lbrace k\text{ such that }\mathcal{M}\text{ shatters }\{\boldsymbol{x}_1,\cdots\boldsymbol{x}_k\}\big\rbrace

For example, if \mathcal{X}=\mathbb{R} and all (simple) models of the form **[1] **m_{a,b}=\mathbf{1}_{\pm}(x\in[a,b]) are considered. No set of \{x_1,x_2,x_2,x_3\} ordered points can be shattered because it is sufficient to assign respectively +1, -1 and +1 to x_1, x_2 and x_3 respectively, therefore VC<3. On the other hand \{0,1\} is shattered, so VC\geq 2. The dimension of this predictor set is 2: If we increase by one dimension, \mathcal{X}=\mathbb{R}^2 and consider all (simple) models of the form m_{a,b}=\mathbf{1}_{\pm} (x\in[a,b]) (where [a,b] refers to the rectangle), then the dimension of \mathcal{M} is here 4.

To introduce SVMs, let’s place ourselves in the case where \mathcal{X}=\mathbb{R}^k, and consider separations by hyperplanes passing through the origin (we will say homogeneous), in the sense that m_{\mathbf{w}} (\mathbf{x})=\mathbf{1}_{\pm}(\mathbf{w}^T \mathbf{x}\geq 0) . It can be shown that no set of k+1 points can be shattered by these two homogeneous spaces in \mathbb{R}^k, and therefore VC(M)=k. If we add a constant, in the sense that m_{\mathbf{w},b} (\mathbf{x})=\mathbf{1}_{\pm}(\mathbf{w}^T \mathbf{x}+b\geq 0), we can show that no set of k+2 points can be sprayed by these two (non-homogeneous) spaces in \mathbb{R}^k, and therefore VC(M)=k+1. This dimension reminds us of the dimension of the model we’ve seen in the econometric context.

From this dimension VC, we deduce the so-called fundamental theorem of learning: if \mathcal{M} is a class of dimension d=VC(M) , then there are positive constants \underline{C} and \overline{C} such as the sample complexity for M to be PAC-learnable satisfies \underline{C}\epsilon^{-1}\big(d+\log[\delta^{-1}]\big)\leq n_{\mathcal{M}}(\epsilon,\delta) \leq \overline{C}\epsilon^{-1}\big(d\log[\epsilon^{-1}]+\log[\delta^{-1}]\big)The link between the notion of learning (as defined in Vailiant (1984)) and the VC dimension was clearly established in Blumer et al (1989).

Nevertheless, while the work of Vapnik and Chervonenkis is considered to be the foundation of statistical learning, Thomas Cover’s work in the 1960s and 1970s should also be mentioned, in particular Cover (1965) on the capacities of linear models, and Cover & Hart (1967) on learning in the context of the algorithm of the k-nearest neighbors. These studies have linked learning, information theory (with the textbook Cover & Thomas (1991)), complexity and statistics. Other authors have subsequently brought the two communities closer together, in terms of learning and statistics. For example, Halbert White proposed to see neural networks in a statistical context in White (1989), going so far as to state that « *learning procedures used to train artificial neural networks are inherently statistical techniques. It follows that statistical theory can provide considerable insight into the properties, advantages, and disadvantages of different network learning methods* ». This turning point in the late 1980s will anchor learning theory in a probabilistic context.

## Objective and loss function

These choices (of objective and loss function) are essential, and very dependent on the problem under consideration. Let us begin by describing a historically important model, Rosenblatt’s (1958) “perceptron”, introduced into classification problems, where y\in\{-1,+1\}, inspired by McCulloch & Pitts (1943). We have data \{(y_i,\mathbf{x}_i)\}, and we will iteratively build a set of m_k[\mathbf{x} models, where at each step, we will learn from the errors of the previous model. In the perceptron, a linear model is considered so that :m(\mathbf{x})=\boldsymbol{1}_{\pm}(\beta_0+\mathbf{x}^T \boldsymbol{\beta}\geq 0)=\left\lbrace\begin{array}{l}+1\text{ si }\beta_0+\mathbf{x}^T \boldsymbol{\beta}\geq 0\\-1\text{ si }\beta_0+\mathbf{x}^T \boldsymbol{\beta}< 0\end{array}\right.where \beta coefficients are often interpreted as “weights” assigned to each of the explanatory variables. We give ourselves initial weights (\beta_0^{(0)},\beta^{(0)} , which we will update taking into account the prediction error made, between y_i and the prediction \widehat{y}_i^{(k)} :\widehat{y}_i^{(k)}=m^{(k)}(\mathbf{x}_i)=\boldsymbol{1}_{\pm}(\beta_0^{(k)}+\mathbf{x}^T \boldsymbol{\beta}^{(k)}\geq 0), with, in the case of the perceptron:\beta_j^{(k+1)}={\beta}_j^{(k)}+\eta\underbrace{(\mathbf{y}-\widehat{\mathbf{y}}^{(k)})^T}_{=\ell({\mathbf{y}},\widehat{\mathbf{y}}^{(k)})}\mathbf{x}_jHere \ell(y,y')=\mathbf{1}(y\neq y') is a loss function, which will allow to give a price to an error made, by predicting \widehat{y}=m(\mathbf{x}) and observing y. For a regression problem, we can consider a quadratic error \ell_2, such that \ell(y,m(\mathbf{x}))=(y-m(\mathbf{x}))^2 or in absolute value \ell_1, with \ell(y,m(\mathbf{x}))=|y-m(\mathbf{x})|. Here, for our classification problem, we used a mis-qualification indicator (we could discuss the symmetry of this loss function, suggesting that a false positive costs as much as a false negative). Once this loss function has been specified, we recognize in the problem previously described a gradient descent, and we see that we are trying to solve:m^\star(\mathbf{x})=\underset{m\in\mathcal{M}}{\text{argmin}}\left\lbrace\sum_{i=1}^n \ell(y_i,m(\mathbf{x}_i))\right\rbrace~~~(6)for a predefined set of predictors \mathcal{M}. Any machine learning problem is mathematically formulated as an optimization problem, whose solution determines a set of model parameters (if the \mathcal{M} family is described by a set of parameters – which can be coordinates in a functional database). We can note \mathcal{M}_0 the space of the hyperplanes of \mathbb{R}^p in the sense thatm\in\mathcal{M}_0 \text{\quad means \quad}m(\mathbf{x})=\beta_0+\beta^T\mathbf{x}\text{ where }\beta\in\mathbb{R}^p generating the class of linear predictors. We will then have the estimator that minimizes the empirical risk. Some of the recent work in statistical learning aims to study the properties of the estimator \widehat{m}^\star, known as “oracle”, in a family of \mathcal{M} estimators, \widehat{m}^{\star} =\underset{\widehat{m}\in\mathcal{M}}{\text{argmin}}\big\lbrace\mathcal{R}(\widehat{m},m)\big\rbraceThis estimator is, of course, impossible to define because it depends on m, the real model, unknown.

But let’s come back a little more to these loss functions. A loss function \ell is a function \mathbb{R}^d\times\mathbb{R}^d\rightarrow\mathbb{R}_+, symmetric, which checks the triangular inequality, and such that \ell(x,y)=0 if and only if x=y. The associated norm is \|\cdot\|, such that \ell(x,y)=\|x-y\|=\ell(x-y,0) (using the fact that \ell(x,y+z)=\ell(x-y,z) – we will review this fundamental property later).

For a quadratic loss function, it should be noted that we can have a particular interpretation of this problem, since:\overline{y}=\underset{m\in\mathbb{R}}{\text{argmin}} \left\lbrace\sum_{i=1}^n\frac{1}{n} [y_i-m]^2\right\rbrace=\underset{m\in\mathbb{R}}{\text{argmin}} \left\lbrace \sum_{i=1}^n \ell_2(y_i,m)\right\rbrace where \ell_2 is the usual quadratic distance If we assume – as we did in econometrics – that there is an underlying probabilistic model, and observe that : \displaystyle{\mathbb{E}(Y)=\underset{m\in\mathbb{R}}{\text{argmin}}\left\lbrace\mathbb{E}\left([Y-m]^2\right)\right\rbrace=\underset{m\in\mathbb{R}}{\text{argmin}}\left\lbrace\mathbb{E}\big[\ell_2(Y,m)\big]\right\rbrace}it should be noted that what we are trying to obtain here, by solving the problem (6) by taking the norm \ell_2, is an approximation (in a given functional space, \mathcal{M}) of the conditional expectation x\mapsto\mathbb{E}[Y|\mathbf{X}=\mathbf{x}]. Another particularly interesting loss function is the loss \ell_1, \ell_1 (y,m)=|y-m|[\latex]. It should be recalled that [latex display="true"]\displaystyle{\text{median}(\boldsymbol{y})=\underset{m\in\mathbb{R}}{\text{argmin}}\left\lbrace\sum_{i=1}^n\ell_1(y_i,m)\right\rbrace}The optimization problem :\widehat{m}^{\star}=\underset{m\in\mathcal{M}_0}{\text{argmin}}\left\lbrace\sum_{i=1}^n\vert y_i-m(\mathbf{x}_i)\vert\right\rbrace is obtained in econometrics by assuming that the conditional law of Y follows a Laplace law centered on m(\mathbf{x}), and by maximizing the likelihood (log) (the sum of the absolute values of the errors corresponds to the log-reasonableness of a Laplace law). It should also be noted that if the conditional law of Y is symmetrical with respect to 0, the median and the mean coincide If this loss function is rewritten \ell_1(y,m)=\vert (y-m)(1/2-\boldsymbol{1}_{y\leq m})\vert a generalization can be obtained for \tau\in[0.1]:\widehat{m}^\star_\tau=\underset{m\in\mathcal{M}_0}{\text{argmin}}\left\lbrace\sum_{i=1}^n \ell_\tau^{ q} (y_i,m(\mathbf{x}_i)) \right\rbracewhere\ell_{\tau}^{q}(x,y)= (x-y)(\tau-\boldsymbol{1}_{x\leq y}) is then the quantile regression of level \tau (Koenker, 2003; d'Haultefœuille & Givord, 2014). Another loss function, introduced by Aigner et al (1977) and analysed in Waltrup et al (2014), is the function associated with the notion of expectations: \displaystyle{\ell}^{\text{ e}}_{\tau}(x,y)= (x-y)^2\cdot\big\vert\tau-\boldsymbol{1}_{x\leq y}\big\vertwith \tau\in[0.1]. We see the parallel with the quantile function: \displaystyle{\ell}^{\text{ q}}_{\tau}(x,y)= \vert x-y\vert \cdot\big\vert\tau-\boldsymbol{1}_{x\leq y}\big\vertKoenker & Machado (1999) and Yu & Moyeed (2001) also noted a link between this condition and the search for maximum likelihood when Y's conditional law follows an asymmetric Laplace law.

In connection with this approach, Gneiting (2011) introduced the notion of "ellicable statistics" - or "ellicable measurement" in its probabilistic (or distributional) version: a statistic T will be said to be "ellicitable" if there is a loss function \ell:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R}_+ such that:T(Y)=\underset{x\in\mathbb{R}}{\text{argmin}}\left\lbrace\int_{\mathbb{R}} \ell(x,y)dF(y)\right\rbrace=\underset{x\in\mathbb{R}}{\text{argmin}}\left\lbrace\mathbb{E}\big[ \ell(x,Y)\big]\text{ where }Y\overset{\mathcal{L}}{\sim} F\right\rbrace The mean (mathematical expectation) is thus ellicable by the quadratic distance, \ell_2, while the median is ellicable by the distance \ell_1. According to Gneiting (2011), this property is essential for obtain predictions and forecasts. There may then be a strong link between measures associated with probabilistic models and loss functions. Finally, Bayesian statistics provide a direct link between the form of the a priori law and the loss function, as studied by Berger (1985) and Bernardo & Smith (2000). We will come back to the use of these different norms in the section on penalization.

*To be continued (keep in mind that references are online here)…*

**[1]** Where the indicator \mathbf{1}_{\pm} does not take values 0 or 1 (like the classical \mathbf{1} function), but -1 and +1.