Machine Learning Methods: An Overview

A. Colin Cameron U.C.-Davis . Presented at University of California - Riverside

April 23 2019

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 1/93 Introduction Abstract

This talk summarizes key machine learning methods and presents them in a language familiar to economists. The focus is on regression of y on x. The talk is based on the two books

I Introduction to Statistical Learning (ISL) I Elements of Statistical Learning (ESL). It emphasizes the ML methods most used currently in economics.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 2/93 Introduction Introduction Machine learning methods include data-driven algorithms to predict y given x.

I there are many machine learning (ML) methods I the best ML methods vary with the particular data application I and guard against in-sample over…tting. The main goal of the machine learning literature is prediction of y

I and not estimation of β. For economics prediction of y is sometimes a goal

I e.g. do not provide hip transplant to individuals with low predicted one-year survival probability I then main issue is what is the best standard ML method. But often economics is interested in estimation of β or estimation of a partial e¤ect such as average treatment e¤ects

I then new methods using ML are being developed by econometricians.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 3/93 Introduction for Machine Learning

The separate slides on ML methods in Economics consider the following microeconomics examples. Treatment e¤ects under an unconfoundedness assumption

I Estimate β in the model y = βx1 + g(x2) + u I the assumption that x1 is exogenous is more plausible with better g(x2) I machine learning methods can lead to a good choice of g(x2). Treatment e¤ects under endogeneity using instrumental variables

I Now x1 is endogenous in the model y = β1x1 + g(x2) + u I given instruments x3 and x2 there is a potential many instruments problem I machine learning methods can lead to a good choice of instruments. Average treatment e¤ects for heterogeneous treatment

I ML methods may lead to better regression imputation and better propensity score matching.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 4/93 Introduction Econometrics for Machine Learning (continued)

ML methods involve data mining

I using traditional methods data mining leads to the complications of pre-test bias and multiple testing. In the preceding econometrics examples, the ML methods are used in such a way that these complications do not arise

I an asymptotic distribution for the estimates of β1 or ATE is obtained I furthermore this is Gaussian.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 5/93 Introduction Overview

1 Terminology 2 - especially cross-validation. 3 Shrinkage methods

1 -bias trade-o¤ 2 Ridge regression, LASSO, elastic net.

4 Dimension reduction - principal components 5 High-dimensional data 6 Nonparametric and semiparametric regression 7 Flexible regression (splines, sieves, neural networks,...) 8 Regression trees and random forests 9 Classi…cation - including support vector machines 10 Unsupervised learning (no y) - k- clustering

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 6/93 1. Terminology 1. Terminology

The topic is called machine learning or statistical learning or data learning or data analytics where data may be big or small. Supervised learning = Regression

I We have both outcome y and regressors (or features) x I 1. Regression: y is continuous I 2. Classi…cation: y is categorical. Unsupervised learning

I We have no outcome y - only several x I 3. : e.g. determine …ve types of individuals given many psychometric measures. This talk

I focus on 1. I brie‡y mention 2. I even more brie‡y mention 3.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 7/93 1. Terminology Terminology (continued)

Consider two types of data sets

I 1. training data set (or estimation sample)

F used to …t a model.

I 2. test data set (or hold-out sample or validation set)

F additional data used to determine how good is the model …t F a test observation (x0, y0) is a previously unseen observation.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 8/93 2. Model Selection 2.1 Forwards selection, backwards selection and best subsets 2.1 Model selection: Forwards, backwards and best subsets

Forwards selection (or speci…c to general)

I start with simplest model (intercept-only) and in turn include the variable that is most statistically signi…cant or most improves …t. I requires up to p + (p 1) + + 1 = p(p + 1)/2 regressions where p is number of regressors    Backwards selection (or general to speci…c)

I start with most general model and in drop the variable that is least statistically signi…cant or least improves …t. I requires up to p(p + 1)/2 regressions Best subsets

I for k = 1, ..., p …nd the best …tting model with k regressors p p p p I in theory requires ( ) + ( ) + + ( ) = 2 regressions 0 1    p I but leaps and bounds procedure makes this much quicker I p < 40 manageable though recent work suggests p in thousands.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 9/93 2. Model Selection 2.2 Statistical Signi…cance 2.2 Selection using Statistical Signi…cance

Use as criteria p < 0.05

I and do forward selection or backward selection I common in . Problem:

I pre-testing changes the distribution of β I this problem is typically ignored (but should not be ignored). b

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 10/93 2. Model Selection 2.3 Goodness-of-…t measures 2.3 Goodness-of-…t measures

We wish to predict y given x = (x1, ..., xp ). A training data set d yields prediction rule f (x)

I we predict y at point x0 using y0 = f (x0). 1 b I e.g. for OLS y0 = x0(X0X) X0y. For regression consider squared errorb b loss (y y)2 b I some methods adapt to other loss functions

F e.g. absolute error loss and log-likelihood loss b

I and for classi…cation is 1(y = y). 6 We wish to estimate the true prediction error 2 b I Err = E [(y0 y0) ] d F I for test data set point (x0, y0) F .  b

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 11/93 2. Model Selection 2.3 Goodness-of-…t measures Models over…t in sample We want to estimate the true prediction error 2 I E [(y0 y0) ] for test data set point (x0, y0) F . F  The obvious criterion is in-sample squared error 1 b n 2 I MSE = ∑ (y y ) where MSE = mean squared error. n i=1 i i Problem: in-sample MSE under-estimates the true prediction error b I Intuitively models “over…t” within sample. Example: suppose y = Xβ + u 1 I then u = (y Xβ ) = (I M)u where M = X(X X) X OLS 0 F so ui < ui (OLS residual is smaller than the true unknown error) j j j jb b 2 2 1 n 2 1 n 2 I and use σ = s = n k ∑i=1(yi yi ) and not n ∑i=1(yi yi ) b Two solutions: b 2 b b I penalize for over…tting e.g. R¯ , AIC, BIC, Cp I use out-of-estimation sample prediction (cross-validation).

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 12/93 2. Model Selection 2.4 Penalized Goodness-of-…t Measures 2.4 Penalized Goodness-of-…t Measures

Standard measures for general parametric model are

I Akaike’sinformation criterion (small penalty)

F AIC= 2 ln L + 2k I BIC: Bayesian information criterion (larger penalty)

F BIC= 2 ln L + (ln n) k  Models with smaller AIC and BIC are preferred n 2 I for regression with i.i.d. errors replace 2 ln L with ln σ + constant. 2 For OLS other measures are b 2 I R¯ : adjusted R-squared I Mallows Cp I AICC=AIC+2(K + 1)(K + 2)/(N K 2).

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 13/93 2. Model Selection 2.4 Penalized Goodness-of-…t Measures Stata Example

These slides use Stata

I most machine learning code is initially done in R. Generated data: n = 40 Three correlated regressors.

x1i x1i 1 0.5 0.5 I x2i N x2i , 0.5 1 0.5 2 3  02 3 2 31 x3i x3i 0.5 0.5 1 4 5 @4 5 4 5A But only x1 determines y 2 I y = 2 + x + u where u N(0, 3 ). i i i 

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 14/93 2. Model Selection 2.4 Penalized Goodness-of-…t Measures Fitted OLS regression

As expected only x1 is statistically signi…cant at 5%

I though due to randomness this is not guaranteed.

. * OLS regression of y on x1•x3 . regress y x1 x2 x3, vce(robust)

Linear regression Number of obs = 40 F(3, 36) = 4.91 Prob > F = 0.0058 R•squared = 0.2373 Root MSE = 3.0907

Robust y Coef. Std. Err. t P>|t| [95% Conf. Interval]

x1 1.555582 .5006152 3.11 0.004 .5402873 2.570877 x2 .4707111 .5251826 0.90 0.376 •.5944086 1.535831 x3 •.0256025 .6009393 •0.04 0.966 •1.244364 1.193159 _cons 2.531396 .5377607 4.71 0.000 1.440766 3.622025

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 15/93 2. Model Selection 2.4 Penalized Goodness-of-…t Measures Example of penalty measures

We will consider all 8 possible models based on x1, x2 and x3.

. * Regressor lists for all possible models . global xlist1

. global xlist2 x1

. global xlist3 x2

. global xlist4 x3

. global xlist5 x1 x2

. global xlist6 x2 x3

. global xlist7 x1 x3

. global xlist8 x1 x2 x3

. . * Full sample estimates with AIC, BIC, Cp, R2adj penalties . quietly regress y $xlist8

. scalar s2full = e(rmse)^2 // Needed for Mallows Cp

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 16/93 2. Model Selection 2.4 Penalized Goodness-of-…t Measures Example of penalty measures (continued)

Manually get various measures

I all (but MSE) favor model with just x1.

. forvalues k = 1/8 { 2. quietly regress y ${xlist`k'} 3. scalar mse`k' = e(rss)/e(N) 4. scalar r2adj`k' = e(r2_a) 5. scalar aic`k' = •2*e(ll) + 2*e(rank) 6. scalar bic`k' = •2*e(ll) + e(rank)*ln(e(N)) 7. scalar cp`k' = e(rss)/s2full • e(N) + 2*e(rank) 8. display "Model " "${xlist`k'}" _col(15) " MSE=" %6.3f mse`k' /// > " R2adj=" %6.3f r2adj`k' " AIC=" %7.2f aic`k' /// > " BIC=" %7.2f bic`k' " Cp=" %6.3f cp`k' 9. } Model MSE=11.272 R2adj= 0.000 AIC= 212.41 BIC= 214.10 Cp= 9.199 Model x1 MSE= 8.739 R2adj= 0.204 AIC= 204.23 BIC= 207.60 Cp= 0.593 Model x2 MSE= 9.992 R2adj= 0.090 AIC= 209.58 BIC= 212.96 Cp= 5.838 Model x3 MSE=10.800 R2adj= 0.017 AIC= 212.70 BIC= 216.08 Cp= 9.224 Model x1 x2 MSE= 8.598 R2adj= 0.196 AIC= 205.58 BIC= 210.64 Cp= 2.002 Model x2 x3 MSE= 9.842 R2adj= 0.080 AIC= 210.98 BIC= 216.05 Cp= 7.211 Model x1 x3 MSE= 8.739 R2adj= 0.183 AIC= 206.23 BIC= 211.29 Cp= 2.592 Model x1 x2 x3 MSE= 8.597 R2adj= 0.174 AIC= 207.57 BIC= 214.33 Cp= 4.000

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 17/93 2. Model Selection 2.5 Cross-validation 2.5 Cross-validation

Begin with single-split validation

I for pedagogical reasons. Then present K-fold cross-validation

I used extensively in machine learning 1 n I generalizes to loss functions other than MSE such as ∑ y y n i=1 j i i j I though more computation than e.g. BIC. And present leave-one-out cross validation b

I widely used for bandwidth choice in . Given a selected model the …nal estimation is on the full dataset

I usual inference ignores the data-mining.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 18/93 2. Model Selection 2.5 Cross-validation Single split validation

Randomly divide available data into two parts

I 1. model is …t on training set I 2. MSE is computed for predictions in validation set.

Example: estimate all 8 possible models with x1, x2 and x3

I for each model estimate on the training set to get β0s, predict on the validation set and compute MSE in the validation set. I choose the model with the lowest validation set MSEb . Problems with this single-split validation

I 1. Lose precision due to smaller training set

F so may actually overestimate the test error rate (MSE) of the model.

I 2. Results depend a lot on the particular single split.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 19/93 2. Model Selection 2.5 Cross-validation Single split validation example

Randomly form

I training sample (n = 14) I test sample (n = 26)

. * Form indicator that determines training and test datasets . set seed 10101

. gen dtrain = runiform() > 0.5 // 1 for training set and 0 for test data

. count if dtrain == 1 14

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 20/93 2. Model Selection 2.5 Cross-validation Single split validation example (continued)

In-sample (training sample) MSE minimized with x1, x2, x3.

Out-of-sample (test sample) MSE minimized with only x1.

. * Split sample validation • training and test MSE for the 8 possible models . forvalues k = 1/8 { 2. quietly reg y ${xlist`k'} if dtrain==1 3. qui predict y`k'hat 4. qui gen y`k'errorsq = (y`k'hat • y)^2 5. qui sum y`k'errorsq if dtrain == 1 6. scalar mse`k'train = r(mean) 7. qui sum y`k'errorsq if dtrain == 0 8. qui scalar mse`k'test = r(mean) 9. display "Model " "${xlist`k'}" _col(16) /// > " Training MSE = " %7.3f mse`k'train " Test MSE = " %7.3f mse`k'test 10. } Model Training MSE = 10.496 Test MSE = 11.852 Model x1 Training MSE = 7.625 Test MSE = 9.568 Model x2 Training MSE = 10.326 Test MSE = 10.490 Model x3 Training MSE = 7.943 Test MSE = 13.052 Model x1 x2 Training MSE = 7.520 Test MSE = 10.350 Model x2 x3 Training MSE = 7.798 Test MSE = 15.270 Model x1 x3 Training MSE = 7.138 Test MSE = 10.468 Model x1 x2 x3 Training MSE = 6.830 Test MSE = 12.708

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 21/93 2. Model Selection 2.5 Cross-validation K-fold cross-validation K-fold cross-validation

I splits data into K mutually exclusive folds of roughly equal size I for j = 1, ..., K …t using all folds but fold j and predict on fold j I standard choices are K = 5 and K = 10. The following shows case K = 5

Fit on folds Test on fold j = 1 2,3,4,5 1 j = 2 1,3,4,5 2 j = 3 1,2,4,5 3 j = 4 1,2,3,5 4 j = 5 1,2,3,4 5

The K-fold CV estimate is K CV 1 MSE , where MSE is the MSE for fold j. K = K ∑j=1 (j) (j)

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 22/93 2. Model Selection 2.5 Cross-validation K-fold cross validation example User-written crossfold command (Daniels 2012) implements this

I do so for the model with all three regressors and K = 5 I set seed for replicability.

. * Five•fold cross validation example for model with all regressors . set seed 10101

. crossfold regress y x1 x2 x3, k(5)

RMSE

est1 3.739027 est2 2.549458 est3 3.059801 est4 2.532469 est5 3.498511

. matrix RMSEs = r(est)

. svmat RMSEs, names(rmse)

. quietly generate mse = rmse^2

. quietly sum mse

. display _n "CV5 (average MSE in 5 folds) = " r(mean) " with st. dev. = " r(sd)

CV5 (average MSE in 5 folds) = 9.6990836 with st. dev. = 3.3885108

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 23/93 2. Model Selection 2.5 Cross-validation K-fold Cross Validation example (continued) Now do so for all eight models with K = 5

I model with only x1 has lowest CV(5)

. * Five•fold cross validation measure for all possible models . drop rm* // Drop variables created by previous crossfold

. drop _est* // Drop variables created by previous crossfold

. forvalues k = 1/8 { 2. set seed 10101 3. quietly crossfold regress y ${xlist`k'}, k(5) 4. matrix RMSEs`k' = r(est) 5. svmat RMSEs`k', names(rmse`k') 6. quietly generate mse`k' = rmse`k'^2 7. quietly sum mse`k' 8. scalar cv`k' = r(mean) 9. scalar sdcv`k' = r(sd) 10. display "Model " "${xlist`k'}" _col(16) " CV5 = " %7.3f cv`k' /// > " with st. dev. = " %7.3f sdcv`k' 11. } Model CV5 = 11.960 with st. dev. = 3.561 Model x1 CV5 = 9.138 with st. dev. = 3.069 Model x2 CV5 = 10.407 with st. dev. = 4.139 Model x3 CV5 = 11.776 with st. dev. = 3.272 Model x1 x2 CV5 = 9.173 with st. dev. = 3.367 Model x2 x3 CV5 = 10.872 with st. dev. = 4.221 Model x1 x3 CV5 = 9.639 with st. dev. = 2.985 Model x1 x2 x3 CV5 = 9.699 with st. dev. = 3.389

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 24/93 2. Model Selection 2.5 Cross-validation Leave-one-out Cross Validation (LOOCV)

Use a single observation for validation and (n 1) for training I y( i) is yi prediction after OLS on observations 1, .., i 1, i + 1, ..., n. I Cycle through all n observations doing this. Thenb LOOCVb measure is

n n CV 1 MSE 1 y y 2 (n) = n ∑i=1 ( i) = n ∑i=1( i ( i)) Requires n regressions in general b 2 1 n yi yi I except for OLS can show CV(n) = n ∑i=1 1 h ii  b  F where yi is …tted value from OLS on the full training sample th 1 F and hii is i diagonal entry in the hat matrix X(X0X) X. Used for bandwidthb choice in local nonparametric regression

I such as k-nearest neighbors, kernel and local I but not used for machine learning (see below).

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 25/93 2. Model Selection 2.5 Cross-validation How many folds?

LOOCV is the special case of K-fold CV with K = N

I it has little bias

F as all but one observation is used to …t.

I but large variance

F as the n predicted y( i) are based on very similar samples F so subsequent averaging does not reduce variance much. The choice K = 5 or K =b 10 is found to be a good compromise

I neither high bias nor high variance. Remember: For replicability set the seed as this determines the folds.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 26/93 3. Shrinkage Estimation 3. Shrinkage Estimation

Consider linear regression model with p potential regressors where p is too large. Methods that reduce the model complexity are

I choose a subset of regressors I shrink regression coe¢ cients towards zero

F ridge, LASSO, elastic net

I reduce the dimension of the regressors

F principal components analysis. Linear regression may predict well if include interactions and powers as potential regressors. And methods can be adapted to alternative loss functions for estimation.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 27/93 3. Shrinkage Estimation 3.1 Variance-bias trade-o¤ 3.1 Variance-bias trade-o¤

Consider regression model

y = f (x) + u with E [u] = 0 and u x. ?

For out-of-estimation-sample point (y0, x0) the true prediction error

2 2 E [(y0 f (x0)) ] = Var[f (x0)] + Bias(f (x0)) + Var(u) f g The last termbVar(u) is calledb irreducible errorb I we can do nothing about this. So need to minimize sum of variance and bias-squared!

I more ‡exible models have less bias (good) and more variance (bad). I this trade-o¤ is fundamental to machine learning.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 28/93 3. Shrinkage Estimation 3.1 Variance-bias trade-o¤ Variance-bias trade-o¤ and shrinkage

Shrinkage is one method that is biased but the bias may lead to lower squared error loss

I …rst show this for estimation of a parameter β I then show this for prediction of y. The mean squared error of a scalar estimator β is

MSE(β) = E [(β β)2] e = E [ (β E [β]) + (E [β] β) 2] f g e = E [(eβ E [β])2] + (E [β] β)2 + 2 0  = Var(βe) + Biase 2(β) e e e e

I as the cross product terme 2 E [(β eE [β])(E [β] β)] =  constant E [(β E [β])] = 0.  e e e e e A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 29/93 3. Shrinkage Estimation 3.1 Variance-bias trade-o¤ Bias can reduce estimator MSE: a shrinkage example Suppose scalar estimator β is unbiased for β with

I E [β] = β and Var[β] = v so MSE(β) = v. Consider the shrinkage estimatorb b b b I β = aβ where 0 a 1.   Bias: Bias(β) = E [β] β = aβ β = (a 1)β. e b Variance: Var[β] = Var[aβ] = a2Var(β) = a2v. e e MSE(β) = Var[β] + Bias2(β) = a2v + (a 1)2 β2 e b b 1 + a MSE(β) < MSE[β] if β2 < v. e e e1 a So MSE(β)

Now consider prediction of y0 = βx0 + u where E [u] = 0

I using y0 = βx0 where treat scalar x0 as …xed. Then mean squared error of a scalar estimator β is e e 2 2 MSE(y0) = x0 Bias (β) + Vare(u). So bias in that reduces MSE also reduces MSE y . β e (β) e ( 0)

e e e

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 31/93 3. Shrinkage Estimation 3.2 Shrinkage methods 3.2 Shrinkage Methods

Shrinkage estimators minimize RSS (residual sum of squares) with a penalty for model size

I this shrinks parameter estimates towards zero. The extent of shrinkage is determined by a tuning parameter

I this is determined by cross-validation or e.g. AIC. Ridge, LASSO and elastic net are not invariant to rescaling of regressors, so …rst standardize

I so x below is actually (x x¯ )/s ij ij j j I and demean y so below y is actually y y¯ i i i I xi does not include an intercept nor does data matrix X I we can recover intercept β0 as β0 = y¯.

So work with y = x β + ε = β x1 + β x2 + + β xp + ε 0 1 b 2    p

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 32/93 3. Shrinkage Estimation 3.3 Ridge Regression 3.3 Ridge Regression

The ridge estimator βλ of β minimizes n 2 p 2 2 Q (β) = (yi x0 β) + λ β = RSS + λ( β 2) λ ∑i=1 b i ∑j=1 j jj jj

I where λ 0 is a tuning parameter to be determined  p 2 I β 2 = ∑ β is L2 norm. jj jj j=1 j Equivalently theq ridge estimator minimizes n 2 p 2 (yi x0 β) subject to β s. ∑i=1 i ∑j=1 j  The ridge estimator is 1 βλ = (X0X + λI) X0y.

Derivation: Q(β) = (y Xβ) (y Xβ) + λβ0β b 0 I ∂Q(β)/∂β = 2X (y Xβ) + 2λβ = 0 0 I X0Xβ + λIβ = X0y ) 1 I β = (X X + λI) X y. ) λ 0 0 A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 33/93 b 3. Shrinkage Estimation 3.3 Ridge Regression More on Ridge

Features

I β β as λ 0 and β 0 as λ ∞. λ ! OLS ! λ ! ! I best when many predictors important with coe¤s of similar size I bestb whenb LS has high varianceb I algorithms exist to quickly compute βλ for many values of λ I then choose λ by cross validation. Hoerl and Kennard (1970) proposed ridgeb as a way to reduce MSE of β. Ridge is the posterior mean for y N(Xβ, σ2I) with prior  βe N(0, γ2I)  I though γ is a speci…ed prior parameter whereas λ is data-determined. 2 ∑i xi For scalar regressor and no intercept βλ = aβOLS where a = 2 ∑i xi +λ I like earlier example of β = aβ. b b

A. Colin Cameron U.C.-Davis . Presented at UniversityMachinee of California Learning:b - Riverside Overview () April232019 34/93 3. Shrinkage Estimation 3.4 LASSO 3.4 LASSO (Least Absolute Shrinkage And Selection)

The LASSO estimator βλ of β minimizes

n 2 p Q (β) = (yi x0 β) + λ β = RSS + λ β 1 λ ∑i=1 b i ∑j=1 j j j jj jj

I where λ 0 is a tuning parameter to be determined p I β 1 = ∑ β is L1 norm. jj jj j=1 j j j Equivalently the LASSO estimator minimizes

n 2 p (yi x0 β) subject to β s. ∑i=1 i ∑j=1 j j j  Features

I best when a few regressors have β = 0 and most β = 0 j 6 j I leads to a more interpretable model than ridge.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 35/93 3. Shrinkage Estimation 3.4 LASSO LASSO versus Ridge (key …gure from ISL) LASSO is likely to set some coe¢ cients to zero.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 36/93 3. Shrinkage Estimation 3.5 Elastic net 3.5 Elastic net There are many extensions of LASSO. Elastic net combines ridge regression and LASSO with objective function

n 2 p 2 Q , (β) = (yi x0 β) + λ α β + (1 α)β . λ α ∑i=1 i ∑j=1f j j j j g

I ridge penalty λ averages correlated variables I LASSO penalty α leads to sparsity. Here I use the elasticregress package (Townsend 2018)

I ridgeregress (alpha=0) I lassoregress (alpha=1) I elasticregress. K-fold classi…cation is used with default K = 10

I set seed for replicability. In part 3 I instead use the lassopack package.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 37/93 3. Shrinkage Estimation 3.6 Examples 3.6 LASSO example

OLS was y = 1.556x1 + 0.471x2 0.026x3 + 2.532 OLS on x1 and x2 is y = 1.554x1 + 0.468x2 + 2.534. b . * LASSO with lambda determined by cross validation . set seed 10101 b

. lassoregress y x1 x2 x3, numfolds(5)

LASSO regression Number of observations = 40 R•squared = 0.2293 alpha = 1.0000 lambda = 0.2594 Cross•validation MSE = 9.3871 Number of folds = 5 Number of lambda tested = 100

y Coef.

x1 1.3505 x2 .2834002 x3 0 _cons 2.621573

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 38/93 4. Dimension Reduction 4. Dimension Reduction

Reduce from p regressors to M < p linear combinations of regressors

I Form X = XA where A is p M and M < p   I y = β0 + Xδ + u after dimension reduction I y = β0 + Xβ + u where β = Aδ. Two methods mentioned in ISL

I 1. Principal components

F use only X to form A (unsupervised)

I 2. Partial least squares

F also use relationship between y and X to form A (supervised) F I have not seen this used in practice. For both should standardize regressors as not scale invariant. And often use cross-validation to determine M.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 39/93 4. Dimension Reduction 4.1 Principal Components Analysis 4.1 Principal Components Analysis (PCA)

th Suppose X is normalized to have zero means so ij entry is xji x¯j . The …rst principal component has the largest sample variance among all normalized linear combinations of the columns of n p matrix X  I the …rst component is Xh1 where h1 is p 1  I normalize h1 so that h10 h1 = 1 I then h1 max Var(Xh1) = h10 X0Xh1 subject to h10 h1 = 1 I the maximum is the largest eigenvalue of X0X and h1 is the corresponding eigenvector. The second principal component has the largest variance subject to being orthogonal to the …rst, and so on.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 40/93 4. Dimension Reduction 4.1 Principal Components Analysis Principal Components Analysis Example

Compare correlation coe¢ cient from OLS on …rst principal component (r = 0.4444) with OLS on all three regressors (r = 0.4871) and each single regressor.

. * Compare R from OLS on all three regressors, on pc1, on x1, on x2, on x3 . quietly regress y x1 x2 x3

. predict yhat (option xb assumed; fitted values)

. correlate y yhat pc1 x1 x2 x3 (obs=40)

y yhat pc1 x1 x2 x3

y 1.0000 yhat 0.4871 1.0000 pc1 0.4219 0.8661 1.0000 x1 0.4740 0.9732 0.8086 1.0000 x2 0.3370 0.6919 0.7322 0.5077 1.0000 x3 0.2046 0.4200 0.7824 0.4281 0.2786 1.0000

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 41/93 4. Dimension Reduction 4.1 Principal Components Analysis Principal Components Analysis (continued)

PCA is unsupervised so seems unrelated to y but

I Elements of Statistical Learning says does well in practice. I PCA has the smallest variance of any estimator that estimates the model y = Xβ + u with i.i.d. errors subject to constraint Cβ = c where dim[C] dim[X].  I PCA discards the p M smallest eigenvalue components whereas ridge does not, though ridge does shrink towards zero the most for the smallest eigenvalue components (ESL p.79). Partial least squares is supervised

I partial least squares turns out to be similar to PCA 2 I especially if R is low.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 42/93 5. High-Dimensional Models 5. High-Dimensional Models

High dimensional simply means p is large relative to n

I in particular p > n I n could be large or small. Problems with p > n: 2 I Cp , AIC, BIC and R cannot be used. I due to multicollinearity cannot identify best model, just one of many good models. I cannot use regular on training set Solutions

I Forward stepwise, ridge, lasso, PCA are useful in training I Evaluate models using cross-validation or independent test data 2 F using e.g. MSE or R .

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 43/93 6. Nonparametric and Semiparametric Regression 6.1 Nonparametric Regression 6.1 Nonparametric regression

Nonparametric regression is the most ‡exible approach

I but it is not practical for high p due to the curse of dimensionality. Consider explaining y with scalar regressor x

I we want f (x0) for a of values x0. With many observations with x = x we would just use the average b i 0 of y for those observations n 1 n ∑i=1 1[xi =x0 ]yi I f (x0) = ∑ yi = n . n0 i :xi =x0 ∑i=1 1[xi =x0 ] Rewrite as b

n 1[xi =x0 ] f (x0) = ∑i=1 w(xi , x0)yi , where w(xi , x0) = n . ∑j=1 1[xj =x0 ] b

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 44/93 6. Nonparametric and Semiparametric Regression 6.1 Nonparametric Regression Kernel-weighted local regression

In practice there are not many observations with xi = x0. Nonparametric regression methods borrow from nearby observations

I k-nearest neighbors

F average yi for the k observations with xi closest to x0.

I kernel-weighted local regression

F use a weighted average of yi with weights declining as xi x0 increases. j j The original kernel regression estimate is

n f (x0) = ∑i=1 w(xi , x0, λ)yi ,

b xi , x0 I where w(xi , x0, λ) = w( λ ) are kernel weights I and λ is a bandwidth parameter to be determined.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 45/93 6. Nonparametric and Semiparametric Regression 6.1 Nonparametric Regression Local constant and local linear regression

The local constant estimator f (x0) = α0 where α0 minimizes

n 2 ∑ w(xi , x0, λ)(yi α0) i=1 b b n I this yields α0 = ∑i=1 w(xi , x0, λ)yi .

The local linear estimator f (x0) = α0 where α0 and β minimize b 0 n 2 ∑ w(xi , x0, λ) yi α0 β (xi x0) . i=1 b f b 0 g

Can generalize to local maximum likelihood that maximizes over θ0

n w x , x , ln f y , x , . ∑i=1 ( i 0 λ) ( i i θ0)

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 46/93 6. Nonparametric and Semiparametric Regression 6.1 Nonparametric Regression Local linear example

lpoly y z, degree(1)

Local poly nomial smooth 15 10 5 y 0 •5

•4 •2 0 2 4 z kernel = epanechnikov, degree = 1, bandwidth = .49

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 47/93 6. Nonparametric and Semiparametric Regression 6.2 Curse of Dimensionality 6.2 Curse of Dimensionality

Nonparametric methods do not extend well to multiple regressors. Consider p-dimensional x broken into bins

I for p = 1 we might average y in each of 10 bins of x 2 I for p = 2 we may need to average over 10 bins of (x1, x2) I and so on.

On average there may be few to no points with high-dimensional xi close to x0

I called the curse of dimensionality. Formally for local constant kernel regression with bandwidth λ 2 p I bias is O(λ ) and variance is O(nλ ) 1/(p+4) I optimal bandwidth is O(n )

F gives asymptotic bias so standard conf. intervals not properly centered 2/(p+4) 0.5 I convergence rate is then n << n

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 48/93 6. Nonparametric and Semiparametric Regression 6.3 Semiparametric Models 6.3 Semiparametric Models

Semiparametric models provide some structure to reduce the nonparametric component from K dimensions to fewer dimensions.

I Econometricians focus on partially linear models and on single-index models. I Statisticians use generalized additive models and project pursuit regression. Machine learning methods can outperform nonparametric and semiparametric methods

I so wherever econometricians use nonparametric and semiparametric regression in higher-dimensional models it may be useful to use ML methods.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 49/93 6. Nonparametric and Semiparametric Regression 6.3 Semiparametric Models Partially linear model A partially linear model speci…es

yi = f (xi , zi ) + ui = xi0 β + g(zi ) + ui

I simplest case z (or x) is scalar but could be vectors I the nonparametric component is of dimension of z. The di¤erencing estimator of Robinson (1988) provides a root-n consistent asymptotically normal β as follows

I E [y z] = E [x z] β + g(z) as E [u z] = 0 given E [u x, z] = 0 j j 0 j j I y E [y z] = (x E [x z]) β +bu subtracting j j 0 I so OLS estimate y my = (x mz) β+ error. 0 Robinson proposed nonparametric kernel regression of y on z for my b b and x on z for mx

I recent econometrics articles instead use a machine learner such as b LASSO b 1/4 I in general need m converges at rate at least n .

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 50/93 b 6. Nonparametric and Semiparametric Regression 6.3 Semiparametric Models Single-index model

Single-index models specify

f (xi ) = g(xi0 β)

I with g( ) determined nonparametrically  I this reduces nonparametrics to one dimension. We can obtain β root-n consistent and asymptotically normal 1/4 I provided nonparametric g( ) converges at rate n . b  The recent economics ML literature has instead focused on the partially linear model. b

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 51/93 6. Nonparametric and Semiparametric Regression 6.3 Semiparametric Models Generalized additive models and project pursuit Generalized additive models specify f (x) as a linear combination of scalar functions p f x f x ( i ) = α + ∑j=1 j ( ij )

th I where xj is the j regressor and fj ( ) is (usually) determined by the data  I advantage is interpretability (due to each regressor appearing additively). I can make more nonlinear by including interactions such as xi1 xi2 as a separate regressor. 

Project pursuit regression is additive in linear combinations of the x 0s M f x g x ( i ) = ∑m=1 m ( i0ωm )

I additive in derived features x0ωm rather than in the xj0s I the gm ( ) functions are unspeci…ed and nonparametrically estimated.  I this is a multi-index model with case M = 1 being a single-index model.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 52/93 6. Nonparametric and Semiparametric Regression 6.3 Semiparametric Models How can machine learning methods do better?

In theory there is scope for improving nonparametric methods. k-nearest neighbors usually has a …xed number of neighbors

I but it may be better to vary the number of neighbors with data sparsity Kernel-weighted local regression methods usually use a …xed bandwidth

I but it may be better to vary the bandwidth with data sparsity. There may be advantage to basing neighbors in part on relationship with y.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 53/93 7. Flexible Regression 7. Flexible Regression

Basis function models

I global polynomial regression I splines: step functions, regression splines, smoothing splines I I polynomial is global while the others break range of x into pieces. Other methods

I neural networks.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 54/93 7. Flexible Regression 7.1 Basis Functions 7.1 Basis Functions

Also called series expansions and sieves. General approach (scalar x for simplicity)

yi = β + β b1(xi ) + + β (xi ) + εi 0 1    K

I where b1( ), ..., b ( ) are basis functions that are …xed and known.  K  j Global polynomial regression sets bj (xi ) = xi

I typically K 3 or K 4.   I …ts globally and can over…t at boundaries.

Step functions: separately …t y in each interval x (cj , cj+1) 2 I could be piecewise constant or piecewise polynomial. Splines smooth so that not discontinuous at the cut points. Wavelets are also basis functions, richer than Fourier series.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 55/93 7. Flexible Regression 7.2 Regression splines 7.2 Regression Splines The cubic regression spline is the standard. Piecewise cubic model with K knots

I require f (x), f 0(x) and f 00(x) to be continuous at the K knots Then can do OLS with

2 3 3 3 f (x) = β + β x + β x + β x + β (x c1) + + β (x cK ) 0 1 2 3 4 +    (3+K ) +

I where h+(z) = z+ if z > 0 and h+(z) = 0 otherwise. This is the lowest degree regression spline where the graph of f (x) on x seems smooth and continuous to the naked eye. There is no real bene…t to a higher-order spline. b Regression splines over…t at boundaries.

I A natural or restricted cubic spline is an adaptation that restricts the relationship to be linear past the lower and upper boundaries of the data.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 56/93 7. Flexible Regression 7.2 Regression splines Spline Example Natural or restricted cubic spline with …ve knots at the 5, 27.5, 50, 72.5 and 95

I mkspline zspline = z, cubic nknots(5) displayknots I regress y zspline*

Natural cubic spline: y=a+f(z)+u 15 10 5 f(z) 0 •5

•4 •2 0 2 4 z

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 57/93 7. Flexible Regression 7.3 Wavelets 7.3 Wavelets

Wavelets are used especially for signal processing and extraction

I they are richer than a Fourier series basis I they can handle both smooth sections and bumpy sections of a series. I they are not used in cross-section econometrics but may be useful for some . Start with a mother or father function ψ(x) 1 1 0 x < 2 1  I example is the Haar function ψ(x) = 1 < x < 1 8 2 < 0 otherwise Then both translate by b and scale by a:to give basis functions ab 1/2 x b ψ (x) = a ψ( ). j j a

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 58/93 7. Flexible Regression 7.4 Neural Networks 7.4 Neural Networks

A neural network is a richer model for f (xi ) than project pursuit

I but unlike project pursuit all functions are speci…ed I only parameters need to be estimated. A neural network involves a series of nested logit regressions. A single hidden layer neural network explaining y by x has

I y depends on z0s (a hidden layer) I z0s depend on x0s. A neural network with two hidden layers explaining y by x has

I y depends on w0s (a hidden layer) I w0s depend on z0s (a hidden layer) I z0s depend on x0s.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 59/93 7. Flexible Regression 7.4 Neural Networks Two-layer neural network

y depends on M z0s and the z0s depend on p x0s

f (x) = β0 + z0β is usual choice for g( ) 1  zm = m = 1, ..., M 1+exp[ (α0m +x αm )] 0 More generally we may use

f (x) = h(T ) usually h(T ) = T

T = β0 + z0β v zm = g(α0m + x0αm ) usually g(v) = 1/(1 + e )

This yields the nonlinear model

M 1 f x ( i ) = β0 + ∑m=1 βm  1 + exp[ (α0m + x αm )] 0

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 60/93 7. Flexible Regression 7.4 Neural Networks Neural Networks (continued) Neural nets are good for prediction I especially in speech recognition (Google Translate), image recognition, ... I but very di¢ cult (impossible) to interpret. They require a lot of …ne tuning - not o¤-the-shelf I we need to determine the …nd the number of hidden layers, the number of M of hidden units within each layer, and estimate the α0s, β0s,.... Minimize the sum of squared residuals but need a penalty on α0s to avoid over…tting. I since penalty is introduced standardize x0s to (0,1). I best to have too many hidden units and then avoid over…t using penalty. I initially back propagation was used I now use gradient methods with di¤erent starting values and average results or use bagging. Deep learning uses nonlinear transformations such as neural networks I deep nets are an improvement on original neural networks. A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 61/93 7. Flexible Regression 7.4 Neural Networks Neural Networks Example This example uses user-written Stata command brain (Doherr)

. * Example from help file for user•written brain command . clear

. set obs 200 number of observations (_N) was 0, now 200

. gen x = 4*_pi/200 *_n

. gen y = sin(x)

. brain define, input(x) output(y) hidden(20) Defined matrices: input[4,1] output[4,1] neuron[1,22] layer[1,3] brain[1,61]

. quietly brain train, iter(500) eta(2)

. brain think ybrain

. sort x

. twoway (scatter y x) (lfit y x) (line ybrain x)

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 62/93 7. Flexible Regression 7.4 Neural Networks Neural Networks Example (continued) We obtain 1 .5 0 •.5 •1 0 5 10 15 x

y Fitted values ybrain

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 63/93 7. Flexible Regression 7.4 Neural Networks

This …gure from ESL is for classi…cation with K categories

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 64/93 8. Regression Trees and Random Forests 8. Regression Trees and Random Forests: Overview Regression Trees sequentially split regressors x into regions that best predict y

I e.g., …rst split is income < or > $12,000 second split is on gender if income > $12,000 third split is income < or > $30,000 (if female and income > $12,000). Trees do not predict well

I due to high variance I e.g. split data in two then can get quite di¤erent trees I e.g. …rst split determines future splits 9a greedy method). Better methods are then given

I bagging (bootstrap averaging) computes regression trees for di¤erent samples obtained by bootstrap and averages the predictions. I random forests use only a subset of the predictors in each bootstrap sample I boosting grows trees based on residuals from previous stage I bagging and boosting are general methods (not just for trees).

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 65/93 8. Regression Trees and Random Forests 8.1 Regression Trees 8.1 Regression Trees

Regression trees

I sequentially split x0s into rectangular regions in way that reduces RSS I then yi is the average of y 0s in the region that xi falls in J 2 I with J blocks RSS= ∑j=1 ∑i R (yi y¯Rj ) . 2 j Need to determineb both the regressor j to split and the split point s.

I For any regressor j and split point s, de…ne the pair of half-planes R1(j, s) = X X < s and R2(j, s) = X X s f j j g f j j  g I Find the value of j and s that minimize

2 2 ∑ (yi y¯R1) + ∑ (yi y¯R1) i :x R1(j,s) i :x R1(j,s) i 2 i 2

where y¯R1 is the mean of y in region R1 (and similar for R2). I Once this …rst split is found, split both R1 and R2 and repeat I Each split is the one that reduces RSS the most. I Stop when e.g. less than …ve observations in each region.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 66/93 8. Regression Trees and Random Forests 8.1 Regression Trees Tree example from ISL page 308 (1) split X1 in two; (2) split the lowest X1 values on the basis of X2 into R1 and R2; (3) split the highest X1 values into two regions (R3 and R4/R5); (4) split the highest X1 values on the basis of X2 into R4 and R5.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 67/93 8. Regression Trees and Random Forests 8.1 Regression Trees Tree example from ISL (continued) The left …gure gives the tree. The right …gure shows the predicted values of y.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 68/93 8. Regression Trees and Random Forests 8.1 Regression Trees Regression tree (continued)

J The model is of form f (x) = ∑ yj 1[x Rj ]. j=1  2 The approach is a topdown greedy approach

I top down as start with top of the tree I greedy as at each step the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step. This leads to over…tting, so prune

I use cost complexity pruning (or weakest link pruning) I this penalizes for having too many terminal nodes I see ISL equation (8.4).

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 69/93 8. Regression Trees and Random Forests 8.1 Regression Trees Tree as alternative to k-NN or kernel regression Figure from Athey and Imbens (2019), “Machine Learning Methods Economists should Know About”

I axes are x1 and x2 I note that tree used explanation of y in determining neighbors I tree may not do so well near boundaries of region

F random forests form many trees so not always at boundary.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 70/93 8. Regression Trees and Random Forests 8.1 Regression Trees Improvements to regression trees

Regression trees are easy to understand if there are few regressors. But they do not predict as well as methods given so far

I due to high variance (e.g. split data in two then can get quite di¤erent trees). Better methods are given next

I bagging

F bootstrap aggregating averages regression trees over many samples

I random forests

F averages regression trees over many sub-samples

I boosting

F trees build on preceding trees.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 71/93 8. Regression Trees and Random Forests 8.2 Bagging 8.2 Bagging (Bootstrap Aggregating) Bagging is a general method for improving prediction that works especially well for regression trees. Idea is that averaging reduces variance. So average regression trees over many samples

I the di¤erent samples are obtained by bootstrap resample with replacement (so not completely independent of each other) I for each sample obtain a large tree and prediction fb (x). 1 B I average all these predictions: fbag(x) = B ∑b=1 fb (x). Get test sample error by using out-of-bag (OOB) observationsb not in the bootstrap sample b b th 1 n 1 I Pr[i obs not in resample] = (1 ) e = 0.368 1/3. n ! ' I this replaces cross validation. Interpretation of trees is now di¢ cult so

I record the total amount that RSS is decreased due to splits over a given predictor, averaged over all B trees. I a large value indicates an important predictor.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 72/93 8. Regression Trees and Random Forests 8.3 Random Forests 8.3 Random Forests

The B bagging estimates are correlated

I e.g. if a regressor is important it will appear near the top of the tree in each bootstrap sample. I the trees look similar from one resample to the next. Random forests get bootstrap resamples (like bagging)

I but within each bootstrap sample use only a random sample of m < p predictors in deciding each split. I usually m pp ' I this reduces correlation across bootstrap resamples. Simple bagging is random forest with m = p.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 73/93 8. Regression Trees and Random Forests 8.3 Random Forests Random Forests (continued)

Random forests are related to kernel and k-nearest neighbors

I as use a weighted average of nearby observations I but with a data-driven way of determining which nearby observations get weight I see Lin and Jeon (JASA, 2006). Susan Athey and coauthors are big on random forests.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 74/93 8. Regression Trees and Random Forests 8.4 Boosting 8.4 Boosting

Boosting is also a general method for improving prediction. Regression trees use a greedy algorithm. Boosting uses a slower algorithm to generate a sequence of trees

I each tree is grown using information from previously grown trees I and is …t on a modi…ed version of the original data set I boosting does not involve bootstrap . Speci…cally (with λ a penalty parameter)

I given current model b …t a decision tree to model b0s residuals (rather than the outcome Y ) b I then update f (x) = previous f (x) + λf (x) b I then update the residuals ri = previous ri λf (xi ) B b I the boosted modelb is f (x) = ∑b b=1 λf b(xi ). b Stata add-on boost includes …le boost64.dll that needs to be b b manually copied into c: ado plus n n

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 75/93 8. Regression Trees and Random Forests 8.4 Boosting Comparison of in-sample predictions

. * Compare various predictions in sample . quietly regress ltotexp $zlist

. predict yh_ols (option xb assumed; fitted values)

. summarize ltotexp yh*

Variable Obs Mean Std. Dev. Min Max

ltotexp 2,955 8.059866 1.367592 1.098612 11.74094 yh_rf 2,955 8.060393 .7232039 5.29125 10.39143 yh_rf_half 2,955 8.053512 .7061742 5.540289 9.797389 yh_boost 2,955 7.667966 .4974654 5.045572 8.571422 yh_ols 2,955 8.059866 .654323 6.866516 10.53811

. correlate ltotexp yh* (obs=2,955)

ltotexp yh_rf yh_rf_~f yh_boost yh_ols

ltotexp 1.0000 yh_rf 0.7377 1.0000 yh_rf_half 0.6178 0.9212 1.0000 yh_boost 0.5423 0.8769 0.8381 1.0000 yh_ols 0.4784 0.8615 0.8580 0.8666 1.0000

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 76/93 9. Classi…cation 9. Classi…cation: Overview

y 0s are now categorical

I example: binary if two categories. Interest lies in predicting y using y (classi…cation)

I whereas economists usually want Pr[y = j x] b j Use (0,1) loss function rather than MSE or ln L b I 0 if correct classi…cation I 1 if misclassi…ed. Many machine learning applications are in settings where can classify well

I e.g. reading car license plates I unlike many economics applications.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 77/93 9. Classi…cation Classi…cation: Overview (continued)

Regression methods predict probabilities

I logistic regression, multinomial regression, k-nearest neighbors I assign to class with the highest predicted probability (Bayes classi…er)

F in binary case y = 1 if p 0.5 and y = 0 if p < 0.5.  Discriminant analysis additionally assumes a normal distribution for b b b b the x’s

I use Bayes theorem to get Pr[Y = k X = x]. j Support vector classi…ers and support vector machines

I directly classify (no probabilities) I are more nonlinear so may classify better I use separating hyperplanes of X and extensions. Regression trees, bagging, random forests and boosting can be used for categorical data.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 78/93 9. Classi…cation 9.1 Loss Function 9.1 A Di¤erent Loss Function: Error Rate Instead of MSE we use the error rate

I the number of misclassi…cations

1 n Error rate = 1[y = y ], n ∑i=1 i 6 i

F where for K categories yi = 0, ..., K 1 and yi b= 0, ..., K 1. F and indicator 1[A] = 1 if event A happens and = 0 otherwise. b The test error rate is for the n0 observations in the test sample

1 n Ave(1 y y 0 1 y y . [ 0 = 0]) = ∑i=1 [ 0i = 0i ] 6 n0 6 Cross validation uses numberb of misclassi…ed observations.b e.g. LOOCV is

1 n 1 n CV Err 1 y y . (n) = ∑i=1 i = ∑i=1 [ i = ( i)] n n 6

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () b April232019 79/93 9. Classi…cation 9.2 Logit 9.2 Logit

Directly model p(x) = Pr[y x]. j Logistic (logit) regression for binary case obtains MLE for

p(x) ln 1 p(x) = x0β.   Logit model is a linear (in x) classi…er

I y = 1 if p(x) > 0.5 I i.e. if x0β > 0. b b b

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 80/93 9. Classi…cation 9.3 Discriminant Analysis 9.3 Linear and Quadratic Discriminant Analysis Developed for classi…cation problems such as is a skull Neanderthal or Homo Sapiens given various measures of the skull. Discriminant analysis speci…es a joint distribution for (Y , X). Linear discriminant analysis with K categories

I assume X Y = k is N(µ , Σ) with density f (x) = Pr[X = x Y = k] j k k j I and let πk = Pr[Y = k] Upon simpli…cation assignment to class k with largest Pr[Y = k X = x] is equivalent to choosing model with largest discriminantj function

1 1 1 δk (x) = x0Σ µ µ 0Σ µ + ln πk k 2 k k

1 N I use µk =—xk , Σ = Var[xk ] and πk = N ∑i=1 1[yi = k]. Called linear discriminant analysis as δk (x) linear in x. b b c b Quadratic discriminant analysis assumes X Y = k is N(µ , Σk ). j k A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 81/93 9. Classi…cation 9.3 Discriminant Analysis ISL Figure 4.9: Linear and Quadratic Boundaries LDA uses a linear boundary to classify and QDA a quadratic

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 82/93 9. Classi…cation 9.4 Support Vector Machines 9.4 Support Vector Classi…er

Build on LDA idea of linear boundary to classify when K = 2. Maximal margin classi…er

I classify using a separating hyperplane (linear combination of X ) I if perfect classi…cation is possible then there are an in…nite number of such hyperplanes I so use the separating hyperplane that is furthest from the training observations I this distance is called the maximal margin. Support vector classi…er

I generalize maximal margin classi…er to the nonseparable case I this adds slack variables to allow some y’sto be on the wrong side of the margin I Maxβ,εM (the margin - distance from separator to training X ’s) subject to β0β = 1, yi (β0 + xi0 β) M(1 εi ), εi 0 and n 6   ∑ ε C. i=1 i 

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 83/93 9. Classi…cation 9.4 Support Vector Machines Support Vector Machines

The support vector classi…er has a linear boundary n p I f (x0) = β0 + ∑i=1 αi x0xi , where x0xi = ∑j=1 x0j xij . The support vector machine has nonlinear boundaries n I f (x0) = β0 + ∑i=1 αi K (x0, xi ) where K ( ) is a kernel p  d I polynomial kernel K (x0, xi ) = (1 + ∑j=1 x0j xij ) p 2 I radial kernel K (x0, x ) = exp( γ ∑ (x x ) ) i j=1 0j ij Can extend to K > 2 classes (see ISL ch. 9.4).

I one-versus-one or all-pairs approach I one-versus-all approach.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 84/93 9. Classi…cation 9.4 Support Vector Machines ISL Figure 9.9: Support Vector Machine In this example a linear or quadratic classi…er won’twork whereas SVM does.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 85/93 9. Classi…cation 9.4 Support Vector Machines Comparison of model predictions

The following compares the various category predictions. SVM does best but we did in-sample predictions here

I especially for SVM we should have training and test samples.

. * Compare various in•sample predictions . correlate suppins yh_logit yh_knn yh_lda yh_qda yh_svm (obs=3,064)

suppins yh_logit yh_knn yh_lda yh_qda yh_svm

suppins 1.0000 yh_logit 0.2505 1.0000 yh_knn 0.3604 0.3575 1.0000 yh_lda 0.2395 0.6955 0.3776 1.0000 yh_qda 0.2294 0.6926 0.2762 0.5850 1.0000 yh_svm 0.5344 0.3966 0.6011 0.3941 0.3206 1.0000

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 86/93 10. Unsupervised Learning 10. Unsupervised Learning

Challenging area: no y, only x. Example is determining several types of individual based on responses to many psychological questions. Principal components analysis

I already presented earlier. Clustering Methods

I k-means clustering. I hierarchical clustering.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 87/93 10. Unsupervised Learning 10.1 Cluster Analysis 10.1 Cluster Analysis: k-Means Clustering

Goal is to …nd homogeneous subgroups among the X . K-Means splits into K distinct clusters where within cluster variation is minimized.

Let W (Ck ) be measure of variation K I MinimizeC1,...,Ck ∑k=1 W (Ck ) 1 K p 2 I Euclidean distance W (Ck ) = n ∑i,i C ∑j=1(xij xi j ) k 02 k 0 Global maximum requires K n partitions. Instead use algorithm 10.1 (ISL p.388) which …nds a local optimum

I run algorithm multiple times with di¤erent seeds K I choose the optimum with smallest ∑k=1 W (Ck ).

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 88/93 10. Unsupervised Learning 10.1 Cluster Analysis ISL Figure 10.5

Data is (x1.x2) with K = 2, 3 and 4 clusters identi…ed.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 89/93 10. Unsupervised Learning 10.1 Cluster Analysis Hierarchical Clustering

Do not specify K. Instead begin with n clusters (leaves) and combine clusters into branches up towards trunk

I represented by a dendrogram I eyeball to decide number of clusters. Need a dissimilarity measure between clusters

I four types of linkage: complete, average, single and centroid. For any clustering method

I it is a di¢ cult problem to do unsupervised learning I results can change a lot with small changes in method I clustering on subsets of the data can provide a sense of robustness.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 90/93 11. Conclusions 11. Conclusions Guard against over…tting

I use K-fold cross validation or penalty measures such as AIC. Biased estimators can be better predictors

I shrinkage towards zero such as Ridge and LASSO. For ‡exible models popular choices are

I neural nets I random forests. Though what method is best varies with the application

I and best are ensemble forecasts that combine di¤erent methods. Machine learning methods can outperform nonparametric and semiparametric methods

I so wherever econometricians use nonparametric and semiparametric regression in higher dimensional models it may be useful to use ML methods I though the underlying theory still relies on assumptions such as sparsity.

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 91/93 12. References 12. References

Undergraduate / Masters level book

I ISL: Gareth James, Daniela Witten, Trevor Hastie and Robert Tibsharani (2013), An Introduction to Statistical Learning: with Applications in R, Springer. I free legal pdf at http://www-bcf.usc.edu/~gareth/ISL/ I $25 hardcopy via http://www.springer.com/gp/products/books/mycopy Masters / PhD level book

I ESL: Trevor Hastie, Robert Tibsharani and Jerome Friedman (2009), The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer. I free legal pdf at http://statweb.stanford.edu/~tibs/ElemStatLearn/index.html I $25 hardcopy via http://www.springer.com/gp/products/books/mycopy

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 92/93 12. References References (continued)

A recent book is

I EH: Bradley Efron and Trevor Hastie (2016), Computer Age Statistical Inference: Algorithms, Evidence and Data Science, Cambridge University Press. Interesting book: Cathy O’Neil(2016), Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. My website has some material (including these slides)

I http://cameron.econ.ucdavis.edu/e240f/machinelearning.html

A. Colin Cameron U.C.-Davis . Presented at UniversityMachine of California Learning: - Riverside Overview () April232019 93/93