Notes on Statistical Learning

Notes on Statistical Learning John I. Marden Copyright 2006 2 Contents 1 Introduction 5 2 Linear models 7 2.1 Good predictions: Squared error loss and in-sample error........... 8 2.2 Matrices and least-squares estimates . ....... 9 2.3 Mean vectors and covariance matrices . ...... 10 2.4 Prediction using least-squares . ...... 11 2.5 Subset selection and Mallows’ Cp ........................ 13 2.5.1 Estimatingthein-sampleerrors . ... 15 2.5.2 Findingthebestsubset. 17 2.5.3 UsingR .................................. 19 2.6 Regularization: Ridgeregression . ....... 22 2.6.1 Estimatingthein-sampleerrors . ... 24 2.6.2 Finding the best λ ............................ 25 2.6.3 UsingR .................................. 28 2.7 Lasso........................................ 32 2.7.1 Estimatingthein-sampleerrors . ... 33 2.7.2 Finding the best λ ............................ 34 2.7.3 UsingR .................................. 37 3 Linear Predictors of Non-linear Functions 39 3.1 Polynomials .................................... 41 3.1.1 Leave-one-out cross-validation . ..... 48 3.1.2 UsingR .................................. 48 3.1.3 Thecross-validationestimate . ... 55 3.2 Sinesandcosines ................................. 56 2 3.2.1 Estimating σe ............................... 61 3.2.2 Cross-validation. .. .. 63 3.2.3 UsingR .................................. 68 3.3 Localfitting: Regressionsplines . ...... 70 3.3.1 UsingR .................................. 80 3.4 Smoothingsplines................................ 81 3 4 CONTENTS 3.4.1 UsingR .................................. 85 3.4.2 Aninterestingresult . 85 3.5 Aglimpseofwavelets.............................. 85 3.5.1 Haarwavelets ............................... 88 3.5.2 An example of another set of wavelets . .. 91 3.5.3 ExampleUsingR............................. 94 3.5.4 Remarks.................................. 98 4 Model-based Classification 103 4.1 The multivariate normal distribution and linear discrimination . 106 4.1.1 FindingthejointMLE . .. .. 107 4.1.2 Using R .................................. 109 4.1.3 Maximizing over Σ ............................ 112 4.2 Quadraticdiscrimination . .... 113 4.2.1 Using R .................................. 114 4.3 The Akaike Information Criterion (AIC) . ...... 115 4.3.1 Bayes Information Criterion (BIC) . 118 4.3.2 Example:Irisdata ............................ 118 4.3.3 Hypothesistesting ............................ 121 4.4 Otherexponentialfamiles . 121 4.5 Conditioning on X:Logisticregression . 122 Chapter 1 Introduction These notes are based on a course in statistical learning using the text The Elements of Statistical Learning by Hastie, Tibshirani and Friedman (2001) (The first edition). Hence, everything throughout these pages implicitly uses that book as a reference. So keep a copy handy! But everything here is my own interpretation. What is machine learning? In artificial intelligence, machine learning involves some kind of machine (robot, computer) that modifies its behavior based on experience. For example, if a robot falls down every time it comes to a stairway, it will learn to avoid stairways. E- mail programs often learn to distinguish spam from regular e-mail. In statistics, machine learning uses statistical data to learn. Generally, there are two categories: Supervised learning data consists of example (y, x)’s, the training data. The machine is a function built based on the data that takes in a new x, and produces a guess of the corresponding y. It is prediction if the y’s are continuous, and classification or categorization if the y’s are categories. Unsupervised learning is clustering. The data consists of example x’s, and the machine is a function that groups the x’s into clusters. What is data mining? Looking for relationships in large data sets. Observations are “baskets” of items. The goal is to see what items are associated with other items, or which items’ presence implies the presence of other items. For example, at Walmart, one may realize that people who buy socks also buy beer. Then Walmart would be smart to put some beer cases near the socks, or vice versa. Or if the government is spying on everyone’s e-mails, certain words (which I better not say) found together might cause the writer to be sent to Guantanamo. 5 6 CHAPTER 1. INTRODUCTION The difference for a statistician between supervised machine learning and regular data analysis is that in machine learning, the statistician does not care about the estimates of parameters nor hypothesis tests nor which models fit best. Rather, the focus is on finding some function that does a good job of predicting y from x. Estimating parameters, fitting models, etc., may indeed be important parts of developing the function, but they are not the objective. Chapter 2 Linear models To ease into machine learning, we start with regular linear models. There is one dependent variable, the y, and p explanatory variables, the x’s. The data, or training sample, consists of n independent observations: (y1, x1), (y2, x2),..., (yN , xN ). (2.1) For individual i, yi is the value of the one-dimensional dependent variable, and xi1 xi2 xi = . (2.2) . x ip is the p 1 vector of values for the explanatory variables. Generally, the y ’s are continuous, × i but the xij’s can be anything numerical, e.g., 0-1 indicator variables, or functions of another variable (e.g., x, x2, x3). The linear model is y = β + β x + + β x + e . (2.3) i 0 1 i1 ··· p ip i The βj’s are parameters, usually unknown and to be estimated. The ei’s are the errors or residuals. We will assume that • The ei’s are independent (of each other, and of the xi’s); • E[ei] = 0 for each i; • 2 V ar[ei]= σe for each i. There is also a good chance we will assume they are normally distributed. From STAT424 and 425 (or other courses), you know what to do now: estimate the βj’s 2 and σe , decide which βj’s are significant, do F -tests, look for outliers and other violations of the assumptions, etc. 7 8 CHAPTER 2. LINEAR MODELS Here, we may do much of that, but with the goal of prediction. Suppose (yNew, xNew) is a new point, satisfying the same model and assumptions as above (in particular, being independent of the observed xi’s). Once we have the estimates of the βi’s (based on the observed data), we predict yNew from xNew by yNew = β + β xnew + + β xnew. (2.4) 0 1 1 ··· p p New New New The prediction is good if by isb closeb to y . We dob not know y , but we can hope. But the key point is b The estimates of the parameters are good if they give good predictions. We don’t care if the βj’s are close to the βj’s; we don’t care about unbiasedness or minimum variance or significance. We just care whether we get good predictions. b 2.1 Good predictions: Squared error loss and in-sample error We want the predictions to be close to the actual (unobserved) value of the dependent variable, that is, we want yNew close to yNew. One way to measure closeness is by using squared error: b (yNew yNew)2. (2.5) − Because we do not know yNew (yet), we might look at the expected value instead: b E[(Y New Y New)2]. (2.6) − New But what is that the expected value over? Certainlyb Y , but the Yi’s and Xi’s in the sample, as well as the XNew, could all be considered random. There is no universal answer, but for our purposes here we will assume that the xi’s are fixed, and all the Yi’s are random. The next question is what to do about xNew? If you have a particular xNew in mind, then use that: E[(Y New Y New)2 X = x ,...,X = x ,XNew = xNew]. (2.7) − | 1 1 N N But typically you are creatingb a predictor for many new x’s, and likely you do not know what they will be. (You don’t know what the next 1000 e-mails you get will be.) A reasonable approach is to assume the new x’s will look much like the old ones, hence you would look at the errors for N new xi’s being the same as the old ones. Thus we would have N new cases, New New New (yi , xi ), but where xi = xi. The N expected errors are averaged, to obtain what is called the in-sample error: 1 N ERR = E[(Y New Y New)2 X = x ,...,X = x ,XNew = xNew]. (2.8) in N i − i | 1 1 N N i i Xi=1 In particular situations, you may haveb a more precise knowledge of what the new x’s would be. By all means, use those values. We will drop the conditional part of the notation for simplicity. 2.2. MATRICES AND LEAST-SQUARES ESTIMATES 9 2.2 Matrices and least-squares estimates Ultimately we want to find estimates of the parameters that yield a low ERRin. We’ll start with the least squares estimate, then translate things to matrices. The estimates of the βj’s depends on just the training sample. The least squares estimate of the parameters are the bj’s that minimize the objective function N obj(b ,...,b )= (y b b x b x )2. (2.9) 0 p i − 0 − 1 i1 −···− p ip Xi=1 The function is a nice convex function in the bj’s, so setting the derivatives equal to zero and solving will yield the minimum. The derivatives are ∂ N obj(b0,...,bp) = 2 (yi b0 b1xi1 bpxip); ∂b0 − − − −···− Xi=1 ∂ N obj(b0,...,bp) = 2 xij(yi b0 b1xi1 bpxip), j 1. (2.10) ∂bj − − − −···− ≥ Xi=1 Write the equations in matrix form, staring with y b b x b x y 1 x x x b 1 − 0 − 1 11 −···− p 1p 1 11 12 ··· 1p 0 y b b x b x y 1 x x x b 2 − 0 − 1 21 −···− p 2p 2 21 22 ··· 2p 1 .

Load more