Statistical Modeling: The

History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion Overview Breiman, L (2001). Statistical modeling: The two cultures 1 History of statistical/machine learning Statistical learning reading group 2 Supervised learning Alexander Ly 3 Two approaches to supervised learning 4 The general learning procedure Psychological Methods 5 Model complexity University of Amsterdam 6 Conclusion Amsterdam, 27 October 2015 History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion Definition of machine learning History Evolved from Arthur Samuel (1959) Artificial intelligence Field of study that gives computers the ability to learn without being explicitly programmed Pattern recognition Success stories (from 1990s onwards): Tom M. Mitchell (1997) Spam filters A computer program is said to learn from experience E with Optical character recognition respect to some class of tasks T and performance measure P, Natural language processing (Search engines) if its performance at tasks in T , as measured by P, improves Recommender systems with experience E. Netflix challenge (2006) won by AT&T labs research The correct outcome yi is present to supervise the learning History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion History History Summary: Success stories (from 1990s onwards): Success correlated with the rise of the internet and Spam filters reinvented statistics. Optical character recognition Machine learning terms for statistical concepts Natural language processing (Search engines) Ignored by most statisticians, except for the Breiman and Recommender systems Tibshirani, Hastie, Friedman (Efron, Stanford school) Netflix challenge (2006) won by AT&T labs research Summary: Statistics Machine learning Estimation Learning Success correlated with the rise of the internet and Data point Example/Instance reinvented statistics. Regression Supervised learning Machine learning terms for statistical concepts Classification Supervised learning Ignored by most statisticians, except for the Breiman and Covariate Feature Tibshirani, Hastie, Friedman (Efron, Stanford school) Response Label History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion Breiman 1928 – 2005 Problem statement: Supervised learning Career: 1954 PhD in probability, 1960s Consulting, 1980s onwards professor at UC Berkeley Breiman’s consulting experience Problem setting Prediction problems Based on n pairs of data x1 ,..., xn , where x are features y1 yn i and y are correct labels, predict future labels y given x . Live with the data before modelling i new new Solution should be either an algorithmic or a data model. Example (classification): Predictive accuracy is the criterion for the quality of the Features: x1 =(nose, mouth). Label: y1 = yes, face model Features: x2 =(doorbell, hinge). Label: y2 = no, face Computers necessary in practice Examples of algorithmic techniques/models: Classification and regression trees, bagging, random forest. In the paper Breiman focusses on supervised learning History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion Problem statement: Supervised learning Problem statement: Supervised learning Problem setting Based on n pairs of data x1 ,..., xn , where x are features y1 yn i and y are correct labels, predict future labels y given x . Problem setting i new new Based on n pairs of data x1 ,..., xn , where x are features y1 yn i and y are correct labels, predict future labels y given x . i new new y = f (x)+✏ x unknown f y For prediction: discover the unknown f that relates features (covariates) to labels (dependent variables). input output History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion Goal: Machine learning Machine learning vs "standard" statistical inference culture Machine learning Predict future instances based on already observed data Goal: Prediction of new data (learn f from the data) Prediction based on past data. Approach: Data are always right, there are no models, only Example: "Based on previous data, I predict that it will rain algorithms tomorrow" Passive: Data are already collected Give a prediction accuracy. "Big" data Example: "Based on previous data, I’m 67% sure that it will "Standard" approach in psychology rain tomorrow" Goal: evaluation of theory (evaluate a known f ) Use data models or even algorithmic models to discover Approach: The model is right, the data could be wrong. the relationship f provided by nature Evaluate theory by comparing two models Active: Confirmatory analysis based on the model: Design of experiments, power analysis, etc etc "Small" data History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion "Standard" approach in psychology Breiman’s: Critique on the data modelling approach Goal: evaluation of theory (evaluate a known f ) X Many time no theory, most research has an exploratory flavour (80% according to Lakens) Approach: The model is right, the data could be wrong. Evaluate theory by comparing two models Wrong presumption X What if the model is wrong? Let data be generated according to the data model f , where f is Active: Confirmatory analysis based on the model: Design linear/logistic regression/... of experiments, power analysis, etc etc Example: Uncritical use of linear regression to, for instance, X Design of experiments, power analysis, etc etc wrong if the bimodal data. model assumption is wrong "Small" data X Mechanical turk, fMRI data, genetics, cito, OSF, international collaboration many labs, etc. History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion Breiman’s: Critique on the data modelling approach Breiman’s: Critique on the data modelling approach Wrong presumption Wrong presumption Let data be generated according to the data model f , where f is linear/logistic regression/... Let data be generated according to the data model f , where f is linear/logistic regression/... Focus on modelling the sampling distribution of the error not the f : f is linear known y f ( x )=✏ (0, σ2) x f is logistic y − ⇠ N obs z}|{ obs |{z} f is a "network" Example: in ANOVA,|{z} sums of squared error, R2, etc. If the errors are big, it is implied that the theory is bad. To quantify "big" require sampling distribution. Validation set to do model selection Test set to estimate the prediction accuracy History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion Breiman’s: Critique on the data modelling approach Breiman’s experience from consulting Wrong presumption Let data be generated according to the data model f , where f is The classical (linear) models are tractable, but typically linear/logistic regression/... yield bad predictions Live with the data before modelling To calculate sampling distributions stuck with simple Algorithmic models are also good models (Linear) Predictive accuracy on test set is the criterion for how good Conclusion are about the model mechanism, not about the model is nature’s mechanism Computers are important If model is a poor emulation of nature, conclusions will be wrong. History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion Recall setting Recall setting Problem: Problem: With which f does nature generate data y = f (x)+✏ Goal: Prediction based on past data. Learn (estimate) unknown f x unknown f y Give a prediction accuracy. "Live with the data before modelling". Data split in three parts: Training set to learn f from the data input output Test set to estimate the prediction accuracy Test set is used to derive the prediction accuracy. Training set is used to learn f from the data. Test set is used to derive the prediction accuracy. History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion Recall setting Recall setting Problem: With which f does nature generate data Problem: With which f does nature generate data y = f (x)+✏ y = f (x)+✏ Goal: Goal: Prediction based on past data. Learn (estimate) unknown f Prediction based on past data. Learn (estimate) unknown f Give a prediction accuracy. Give a prediction accuracy. "Live with the data before modelling". Data split in three parts: "Live with the data before modelling". Data split in three parts: Training set to learn f from the data Training set to learn f from the data Validation set to do model selection Validation set To do model selection Test set to estimate the prediction accuracy History Supervised learning Two cultures Learning Model complexity Conclusion History Supervised learning Two cultures Learning Model complexity Conclusion General learning procedure General learning procedure Randomly select training samples, say, Recall data set

Statistical Modeling: The

Logistic Regression Trained with Different Loss Functions Discussion

Regularized Regression Under Quadratic Loss, Logistic Loss, Sigmoidal Loss, and Hinge Loss

The Central Limit Theorem in Differential Privacy

Bayesian Classifiers Under a Mixture Loss Function

A Unified Bias-Variance Decomposition for Zero-One And

Are Loss Functions All the Same?

Bayes Estimator Recap - Example

1 Basic Concepts 2 Loss Function and Risk

Maximum Likelihood Linear Regression

Lectures on Statistics

Lecture 5: Logistic Regression 1 MLE Derivation

Lecture 2: Estimating the Empirical Risk with Samples CS4787 — Principles of Large-Scale Machine Learning Systems