<<

History Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Overview Breiman, L (2001). Statistical modeling: The

two cultures 1 History of statistical/ Statistical learning reading group 2 Supervised learning Alexander Ly 3 Two approaches to supervised learning

4 The general learning procedure

Psychological Methods 5 Model complexity University of Amsterdam 6 Conclusion Amsterdam, 27 October 2015

History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Definition of machine learning History

Evolved from Arthur Samuel (1959) Artificial intelligence Field of study that gives computers the ability to learn without being explicitly programmed Success stories (from 1990s onwards): Tom M. Mitchell (1997) Spam filters A computer program is said to learn from experience E with Optical character recognition respect to some class of tasks T and performance measure P, Natural language processing (Search engines) if its performance at tasks in T , as measured by P, improves Recommender systems with experience E. Netflix challenge (2006) won by AT&T labs research The correct outcome yi is present to supervise the learning

History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

History History

Summary: Success stories (from 1990s onwards): Success correlated with the rise of the internet and Spam filters reinvented . Optical character recognition Machine learning terms for statistical concepts Natural language processing (Search engines) Ignored by most , except for the Breiman and Recommender systems Tibshirani, Hastie, Friedman (Efron, Stanford school) Netflix challenge (2006) won by AT&T labs research Summary: Statistics Machine learning Estimation Learning Success correlated with the rise of the internet and point Example/Instance reinvented statistics. Regression Supervised learning Machine learning terms for statistical concepts Classification Supervised learning Ignored by most statisticians, except for the Breiman and Covariate Feature Tibshirani, Hastie, Friedman (Efron, Stanford school) Response Label

History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Breiman 1928 – 2005 Problem statement: Supervised learning

Career: 1954 PhD in probability, 1960s Consulting, 1980s onwards professor at UC Berkeley Breiman’s consulting experience Problem setting Prediction problems Based on n pairs of data x1 ,..., xn , where x are features y1 yn i and y are correct labels, predict future labels y given x . Live with the data before modelling i new new Solution should be either an algorithmic or a data model. Example (classification): Predictive accuracy is the criterion for the quality of the Features: x1 =(nose, mouth). Label: y1 = yes, face model Features: x2 =(doorbell, hinge). Label: y2 = no, face Computers necessary in practice

Examples of algorithmic techniques/models: Classification and regression trees, bagging, random forest. In the paper Breiman focusses on supervised learning History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Problem statement: Supervised learning Problem statement: Supervised learning

Problem setting Based on n pairs of data x1 ,..., xn , where x are features y1 yn i and y are correct labels, predict future labels y given x . Problem setting i new new Based on n pairs of data x1 ,..., xn , where x are features y1 yn i and y are correct labels, predict future labels y given x . i new new

y = f (x)+✏ x unknown f y For prediction: discover the unknown f that relates features (covariates) to labels (dependent variables).

input output

History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Goal: Machine learning Machine learning vs "standard" culture

Machine learning Predict future instances based on already observed data Goal: Prediction of new data (learn f from the data) Prediction based on past data. Approach: Data are always right, there are no models, only Example: "Based on previous data, I predict that it will rain algorithms tomorrow" Passive: Data are already collected Give a prediction accuracy. "Big" data Example: "Based on previous data, I’m 67% sure that it will "Standard" approach in psychology rain tomorrow" Goal: evaluation of theory (evaluate a known f ) Use data models or even algorithmic models to discover Approach: The model is right, the data could be wrong. the relationship f provided by nature Evaluate theory by comparing two models Active: Confirmatory analysis based on the model: Design of , power analysis, etc etc "Small" data History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

"Standard" approach in psychology Breiman’s: Critique on the data modelling approach Goal: evaluation of theory (evaluate a known f ) X Many time no theory, most research has an exploratory flavour (80% according to Lakens) Approach: The model is right, the data could be wrong. Evaluate theory by comparing two models Wrong presumption X What if the model is wrong? Let data be generated according to the data model f , where f is Active: Confirmatory analysis based on the model: Design linear//... of experiments, power analysis, etc etc Example: Uncritical use of to, for instance, X , power analysis, etc etc wrong if the bimodal data. model assumption is wrong "Small" data X Mechanical turk, fMRI data, genetics, cito, OSF, international collaboration many labs, etc.

History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Breiman’s: Critique on the data modelling approach Breiman’s: Critique on the data modelling approach

Wrong presumption Wrong presumption Let data be generated according to the data model f , where f is linear/logistic regression/... Let data be generated according to the data model f , where f is linear/logistic regression/...

Focus on modelling the distribution of the error not the f : f is linear known y f ( x )=✏ (0, 2) x f is logistic y ⇠ N obs z}|{ obs |{z} f is a "network" Example: in ANOVA,|{z} sums of squared error, R2, etc. If the errors are big, it is implied that the theory is bad. To quantify "big" require . Validation set to do Test set to estimate the prediction accuracy

History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Breiman’s: Critique on the data modelling approach Breiman’s experience from consulting

Wrong presumption Let data be generated according to the data model f , where f is The classical (linear) models are tractable, but typically linear/logistic regression/... yield bad predictions Live with the data before modelling To calculate sampling distributions stuck with simple Algorithmic models are also good models (Linear) Predictive accuracy on test set is the criterion for how good Conclusion are about the model mechanism, not about the model is nature’s mechanism Computers are important If model is a poor emulation of nature, conclusions will be wrong.

History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Recall setting Recall setting

Problem: Problem: With which f does nature generate data

y = f (x)+✏

Goal: Prediction based on past data. Learn (estimate) unknown f x unknown f y Give a prediction accuracy. "Live with the data before modelling". Data split in three parts: Training set to learn f from the data

input output Test set to estimate the prediction accuracy

Test set is used to derive the prediction accuracy. Training set is used to learn f from the data. Test set is used to derive the prediction accuracy.

History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Recall setting Recall setting

Problem: With which f does nature generate data Problem: With which f does nature generate data

y = f (x)+✏ y = f (x)+✏

Goal: Goal: Prediction based on past data. Learn (estimate) unknown f Prediction based on past data. Learn (estimate) unknown f Give a prediction accuracy. Give a prediction accuracy. "Live with the data before modelling". Data split in three parts: "Live with the data before modelling". Data split in three parts: Training set to learn f from the data Training set to learn f from the data Validation set to do model selection Validation set To do model selection Test set to estimate the prediction accuracy

History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

General learning procedure General learning procedure

Randomly select training samples, say, Recall data set consists of n pairs, say, x x x x x x x x x x 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 y y y y y y y y y y x x x x x x x x x x ✓ 1◆ ✓ 2◆ ✓ 3◆ ✓ 4◆ ✓ 5◆ ✓ 6◆ ✓ 7◆ ✓ 8◆ ✓ 9◆ ✓ 10◆ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , y y y y y y y y y y ✓ 1◆ ✓ 2◆ ✓ 3◆ ✓ 4◆ ✓ 5◆ ✓ 6◆ ✓ 7◆ ✓ 8◆ ✓ 9◆ ✓ 10◆ Training set is used to learn f from the data.

Fill in the true xtrain,i , ytrain,i y = f (x )+✏ Corresponding formula train,i train,i find that f for which the loss between = ( )+✏ y f x n 1 train Loss ytrain,i , f (xtrain,i ) ntrain Xi=1 ⇣ ⌘ is smallest. Call the minimiser ftrained History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

General learning procedure General learning procedure

Use the other samples as test samples, say, Generalise to a new instance x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x x x x , , , , , , , , , 1 ,..., 9 , 10 , new y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y y y ... ✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆ ✓ 1◆ ✓ 9◆ ✓ 10◆ ✓ ◆ f Training set is used to learn from the data. Training set is used to learn f from the data. Test set is used to derive the prediction accuracy. Test set is used to derive the prediction accuracy. Fill in ftrained and apply it to x to yield yimplied = ftrained(xtest,i ) and Final answer for yet unseen features xnew use to generate ynew compare the estimate the error between yimplied with true ytest,i , yimplied n 1 test ynew = f (xnew) ✏ ✏estim = Loss ytest,i , ftrained(xtest,i ) trained ± estim ntest Xi=1 ⇣ ⌘ z }| { Generalisation error

The error ✏estim so estimated serves as the prediction accuracy. | {z }

History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Remarks Loss functions

Only assumption is that the data xi are iid, that is, they yi For categorical data, typically, zero-one (all or nothing) loss: are generated with the same (true, fixed, but unknown) f . ⇤ The data are assumed to be true 0 if y = f (x ) Loss y , f (x ) = i i No, specification of the Loss function. The loss function i i (1 if yi = f (xi ) replaces the assumptions on the error (typically, Gaussian ⇣ ⌘ 6 error in "standard" statistics) Note the hard rule, here observation yi "supervise" the No, specification of the collection of functions f that we learning. F believe to be viable History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Loss functions Loss functions

For categorical data, typically, zero-one (all or nothing) loss: For categorical data, typically, zero-one (all or nothing) loss:

0 if yi = f (xi ) 0 if yi = f (xi ) Loss yi , f (xi ) = Loss y , f (x ) = (1 if yi = f (xi ) i i 6 (1 if yi = f (xi ) ⇣ ⌘ ⇣ ⌘ 6 Note the hard rule, here observation yi "supervise" the Note the hard rule, here observation yi "supervise" the learning. learning. For continuous data, typically, -squared error loss: For continuous data, typically, mean-squared error loss:

2 n Loss Y , f (x) = E Y f (x) 1 2 Loss Y , f (x) = y f (x ) n i i ⇣ ⌘ h i = Here, E is the expectation (average) with respect to the true ⇣ ⌘ Xi 1 h i relationship f ⇤ between X and Y . Here, the expectation E is replaced with the empirical average with respect to the data being true.

History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Bias- trade-off and overfitting Bias-Variance trade-off and overfitting

For continuous data, typically, mean-squared error loss: For continuous data, typically, mean-squared error loss: 2 2 Loss Y , f (x) =E Y f (x) Loss Y , f (x) = E Y f (x) ⇣ ⌘ h i 2 ⇣ ⌘ h i =Var(f (x)) + Bias(f (x)) + Var(✏) Here, E is the expectation (average) with respect to the true f . ⇤ h i Both Var and E, thus, Bias, are with respect to the true f ⇤. History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Bias-Variance trade-off and overfitting Overfitting

For continuous data, typically, mean-squared error loss: For continuous data, typically, mean-squared error loss:

2 2 Var(f (x)) + Bias(f (x)) + Var(✏) Var(f (x)) + Bias(f (x)) + Var(✏)

h i Unavoidable error h i Unavoidable error Structural error Structural error | {z } | {z } Recall that f|trained is from{z minimising} this loss. The more Overfitting occurs| if the{zf s under consideration} are too complicated a candidate f , the smaller the unavoidable error, as complicated. An over complicated f generalises badly and is everything is seen as structural. Problem: overfitting. recognised by low bias, high variance.

History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Overfitting: Over complicated f : low bias, high Overfitting: Over complicated f : low bias, high variance. variance.

2 True data generating f ⇤(x)=x + 2x + ✏, where ✏ is a Laplace Raw data and true function f ⇤: distribution (thicker tails than normal). Raw data:

40

40 30 y 30 20 y 20 10

10 34567

34567 x x History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Overfitting: Over complicated f : low bias, high Overfitting: Over complicated f : low bias, high variance. variance.

Fitted with a polynomial of order nine. Observe new test sample/instance xtest = 6.5

70

40 60 50 30 40 y y 20 30 20 10 10

34567 345676.5 x x

History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Overfitting: Over complicated f : low bias, high Overfitting: Over complicated f : low bias, high variance. variance.

Observe new test sample/instance xtest = 6.5 and ytest = 66 Large loss between ytest and yimplied:

70 70

60 60

50 50

40 40 y y 30 30

20 20

10 10

345676.5 345676.5 x x History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Overfitting: Over complicated f : low bias, high Model complexity variance.

70 60 True data generation was done with a polynomial of degree 50 2, polynomial of degree 9 is too complex. Hence, the 40 collection of candidate f s should be restricted to those f s y 30 with max degree 2. 20 In reality, don’t know the "complexity" of the true. How to 10 choose the collection of candidates ? F 345676.5 x

Example conclusion: your expected survival years are 12 years 50 years. Meaningless prediction. ±

History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Two types of model complexity Two types of model complexity

The structure of f . Linear, polynomial (but still a summation of terms, thus, linear). Neural networks (non-linear), The structure of f . Linear, polynomial (but still a summation support vector machines, generalised additive models, of terms, thus, linear) <- nothing new same as in statistics kernel smoothing, splines, reproducing kernel Hilbert space, regression trees. Number of features n < p. History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

"Standard" setting Regularisation

The functions f s in are linear F More features than data n < p. Hazard of overfitting Control the complexity with additional tuning parameter The functions f s in are linear (for instance, lasso) Hence, = F F F More data than features p < n. Example:

f (x)=✓ + ✓ x1 + ✓ x2 + ...+ ✓ xp + ✓ (1) 0 1 2 p | | Over complicated structure tuning parameter |{z} Here ✓ acts| as a penalty{z for complexity.} How to tune ? | |

History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

Regularisation General learning procedure

The functions f s in are linear F More features than data n < p. Hazard of overfitting Split data into three sets. Control the complexity with additional tuning parameter Training set is used to learn f from the data. (for instance, lasso) Hence, = F F Validation set to tune (model selection). Example: Test set is used to derive the prediction accuracy. f (x)=✓ + ✓ x1 + ✓ x2 + ...+ ✓ xp + ✓ (1) Partition . For the lasso, = 0 no regularisation, for = 0 1 2 p 1 | | only the constant function is viable. Say, = 0, 2, 22,...,210. Over complicated structure tuning parameter For each fixed follow the general procedure f. |{z} Here ✓ acts| as a penalty{z for complexity.} Tune with | | validation set aka repeat the general procedure many times with different (fixed) . History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

General learning procedure General learning procedure

Training set is used to learn f from the data. Training set is used to learn f from the data. Validation set to tune (model selection). Validation set to tune (model selection). Test set is used to derive the prediction accuracy. Test set is used to derive the prediction accuracy. Say, = 0, 2, 22,...,210. Say, = 0, 2, 22,...,210. Pick f,trained with lowest average loss on the validation set. We get n 1 val = f0,trained, f2,trained, f22,trained,...,f210,trained (2) fval = arg min loss f,trained(xi,val), yi,val (2) F nval Xi=1 ⇣ ⌘

History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion

General learning procedure

Gave general idea of supervised learning. Role of training, Training set is used to learn f from the data. validation and test set Which specific classes of to take. This seminar Validation set to tune (model selection). F discusses a couple of them: Neural networks (non-linear), Test set is used to derive the prediction accuracy. support vector machines, generalised additive models, Say, = 0, 2, 22,...,210. kernel smoothing, splines, reproducing kernel Hilbert Estimate prediction accuracy by averaging the average loss on space, regression trees. the test set How to minimise? Technicallities n 1 test Selection method based on select the best. Other ✏est = loss fval(xi,test), yi,test (2) methods, bagging and boosting ntest Xi=1 ⇣ ⌘ History Supervised learning Two cultures Learning Model complexity Conclusion

Further organisation

Changed name from machine learning reading group to statistical learning seminar Reading groups die Seminar does not require everyone to read everything Requires a small peak in preparation of a talk Not necessary to understand everything. Can be practical and theoretical Still good to read things in advanced. Also website with youtube clips are available.