Online Complements
Total Page:16
File Type:pdf, Size:1020Kb
Econometrics and Machine Learning* Arthur Charpentier, Emmanuel Flachaire and Antoine Ly Compléments en ligne / Online complements Online Complement – History and Foundations of Econometric and Machine Learning Models Econometrics and the Probabilistic Model The importance of probabilistic models in economics is rooted in Working’s (1927) questions and the attempts to answer them in Tinbergen’s two volumes (1939). The latter have subsequently generated a great deal of reflexions, as recalled by Duo (1993) in his book on the foundations of econometrics, and more particularly in the first chapter “The Probability Foundations of Econometrics”. Trygve Haavelmo, who was awarded the Nobel Prize in Economics in 1989 for his “clarification of the foundations of the probabilistic theory of econometrics” recognized that influence. More specifically, Haavelmo (1944), which initiated a profound change in econometric theory (as recalled in the Chapter 8 of Morgan (1990)), stated that econometrics is fundamentally based on a Probabilistic Model, for two main reasons. First, the use of statistical quantities (or “measures”) such as means, standard errors and correlation coefficients for inferential purposes can only be justified if the process generating the data can be expressed in terms of a probabilistic model. Second, the probability approach is relatively general, and is particularly well suited to the analysis of “dependent” and “non-homogeneous” observations, as they are often found on economic data. In this framework, we will assume that there is a probabilistic space (훺, ℱ, ℙ) such that observations (푦푖, 푥푖) are seen as realizations of random variables (푌푖, 푋푖). In practice, we are not very interested by the joint distribution of the pair (푌, 푋) : the law of 푋 is unknown (actually all the analysis is done conditional on observations 푥푖) and it is the law of 푌 conditional on 푋 that we will be interested in. In the following, we will denote 푥 a single observation, 푥 a vector of observations, 푋 a random variable, and 푋 a random vector. Abusively, 푋 may also designate the matrix of individual observations (denoted 푥푖), depending on the context. Foundations of Mathematical Statistics As recalled in Vapnik (1998)’s introduction, inference in parametric statistics is based on the following belief: the statistician knows the problem to be analyzed well, in particular, he knows the physical law that generates the stochastic properties of the data, and the function to be found is written via a finite number of parameters1. To find these parameters, the maximum likelihood method is usually considered. There are many theoretical justification for this approach. We will see that in learning, philosophy is very different, since we do not have a priori reliable information on the statistical law underlying the problem, nor even on the function we would like to approach (we will then propose methods to construct an approximation from the data at our disposal, as in Vapnik (1998)). A “golden age” of parametric inference, from 1930 to 1960, laid the foundations for mathematical statistics, which can be found in all statistical textbooks, including those used nowadays as references in many courses. As Vapnik (1998) states, the classical parametric paradigm is based on the following three beliefs: - To find a functional relationship from the data, the statistician is able to define a set of functions, linear in their parameters, that contain a good approximation of the desired function. The number of parameters describing this set is small. - The statistical law underlying the stochastic component of most real-life problems is the normal law. This belief has been supported by reference to the central limit theorem, which stipulates that under large conditions the sum of a large number of random variables can be approximated by the normal distribution. - The maximum likelihood method is a good tool for estimating parameters. In this section we will come back to the construction of the econometric paradigm, directly inspired by that of classical inferential statistics. Conditional Distributions and Likelihood Linear econometrics has been constructed under the assumption of individual data, which amounts to assuming 2 independent variables (푌푖, 푋푖). More precisely, we will assume that, conditionally to the explanatory variables 푋푖, variables 푌푖 are independent. We will also assume that these conditional distributions remain in the same parametric family, but that the parameter is a function of 푥. In the Gaussian linear model it is assumed that: ℒ 2 푇 푝 (푌|푋 = 푥) ∼ (휇(푥), 휎 ) where 휇(푥) = 훽0 + 푥 훽, and 훽 ∈ ℝ . (1) 1 This approach can be compared to structural econometrics, as presented for example in Kean (2010). 2 It is possible to consider temporal observations, then we would have time series (푌푡, 푋푡), but we will not discuss those in this article. 1 * Economie et Statistique / Economics and Statistics, 505-506, 2018 Econometrics and Machine Learning* Arthur Charpentier, Emmanuel Flachaire and Antoine Ly Compléments en ligne / Online complements 푇 It is usually called a ‘linear’ model since 피[푌|푋 = 푥] = 훽0 + 푥 훽 is a linear combination of the covariates. It is said to be a homoscedastic model if Var[푌|푋 = 푥] = 휎2, where 휎2 is a positive constant. To estimate the parameters, the traditional approach is to use the Maximum Likelihood estimator, as initially suggested by Ronald Fisher. In the case of the Gaussian linear model, log-likelihood is written: 푛 푛 1 logℒ(훽 , 훽, 휎2|푦, 푥) = − log[2휋휎2] − ∑( 푦 − 훽 − 푥 푇훽)2. 0 2 2휎2 푖 0 푖 푖=1 Note that the term on the right, measuring a distance between the data and the model, will be interpreted as deviance in generalized linear models. Then we will set: ˆ ˆ 2 2 (훽0 , 훽 , 휎ˆ ) = 푎푟푔푚푎푥{logℒ(훽0, 훽, 휎 |푦, 푥)} From the right term, the maximum likelihood estimator is obtained by minimizing the sum of the error squares (the so-called “least squares” estimator) that we will find in the “machine learning” approach. The first order conditions allow to find the normal equations, whose matrix writing is 푋푇[푦 − 푋 훽ˆ ] = 0, which can also be written (푋푇푋) 훽ˆ = 푋푇푦. If 푋 is a full (column) rank matrix, then we find the classical estimator: 훽ˆ = (푋푇푋)−1푋푇푦 = 훽 + (푋푇푋)−1푋푇ε (2) using residual-based writing (as often in econometrics), 푦 = x푇β + 휀. Gauss Markov’s theorem ensures that this estimator is the unbiased linear estimator with minimum variance. It can then be shown that ℒ 훽ˆ ∼ (훽, 휎2[푋푇푋]−1), and in particular, if we simply need the first two moments: 피[훽ˆ ] = 훽 and Var[훽ˆ ] = 휎2[XTX]−1 In fact, the normality hypothesis makes it possible to make a link with mathematical statistics, but it is possible to construct this estimator given by equation (2) without that Gaussian assumption. Hence, if we assume that 푌|푋 = ℒ 푇 2 푥 ∼ 푥 훽 + 휀, where 휀 have the same distribution, with 피[휀] = 0, Var[휀] = 휎 , and Cov[휀, 푋푗] = 0 for all 푗, then 훽ˆ is an unbiased estimator of β with smallest variance among unbiased linear estimators. Furthermore, if we cannot ℒ get normality at finite distance, asymptotically this estimator is Gaussian, with √푛(훽ˆ − 훽) → (0, 훴) as 푛 → ∞, for some matrix Σ. The condition of having a full rank X matrix can be (numerically) strong in large dimensions. If it is not satisfied, 훽ˆ = (푋푇푋)−1푋푇푦 does not exist. If 핀 denotes the identity matrix, however, it should be noted that (푋푇푋 + 휆핀)−1푋푇푦 always exists, whatever 휆 > 0. This estimator is called the ridge estimator of level \lambda (introduced in the 1960s by Hoerl (1962), and associated with a regularization studied by Tikhonov, 1963). This estimator naturally appears in a Bayesian econometric context. Residuals It is not uncommon to introduce the linear model from the distribution of the residuals, as we mentioned earlier. Also, equation (1) is written as often: 푇 푦푖 = 훽0 + 푥푖 훽 + 휀푖 (3) 2 where 휀푖’s are realizations of independent and identically distributed random variables (i.i.d.) from some (0, 휎 ). ℒ distribution. With a vector notation, we will write ε ∼ (0, 휎2핀); the estimated residuals are defined as: ˆ 푇 ˆ 휀ˆ 푖 = 푦푖 − [훽0 + 푥푖 훽] Those (estimated) residuals are basic tools for diagnosing the relevance of the model. An extension of the model described by equation (1) has been proposed to take into account a possible heteroscedastic feature: ℒ (푌|푋 = 푥) ∼ (휇(푥), 휎2(푥)) where 휎2(푥) is a positive function of the explanatory variables. In that case, this model can be rewritten as : 푇 2 푦푖 = 훽0 + 푥푖 훽 + 휎 (푥i) ⋅ 휀푖 where residuals are always i.i.d., with unit variance: 2 * Economie et Statistique / Economics and Statistics, 505-506, 2018 Econometrics and Machine Learning* Arthur Charpentier, Emmanuel Flachaire and Antoine Ly Compléments en ligne / Online complements 푇 푦푖 − [훽0 + 푥푖 훽] 휀푖 = 휎(푥푖) While equations based on residuals are popular in linear econometrics when the dependent variable is continuous, it is no longer popular with counting models, or logistic regression. However, writing using an error term (as in equation (3)) raises many questions about the representation of an economic relationship between two quantities. For example, it can be assumed that there is a relationship (linear to begin with) between the quantities of a traded good, 푞 and its price 푝. This allows us to imagine a supply equation: 푞푖 = 훽0 + 훽1푝푖 + 푢푖 (푢푖 being an error term) where the quantity sold depends on the price, but in an equally legitimate way, one can imagine that the price depends on the quantity produced (what one could call a demand equation): 푝푖 = 훼0 + 훼1푞푖 + 푣푖 (푣푖 denoting another error term).