What Is Supervised Learning? Rich Caruana Cornell University

An Empirical Comparison of Learning Methods++ Preliminaries: What is Supervised Learning? Rich Caruana Cornell University joint work with Alex Niculescu-Mizil, Cristi Bucila, Art Munson What is Supervised Learning? Sad State of Affairs: Supervised Learning Linear/polynomial Regression training set of labeled cases: Logistic/Ridge regression Each algorithm has many variations and K-nearest neighbor free parameters: 0 1 1 0 0 0.25 0.16 0.68 --> 0 Linear perceptron – SVM: margin parameter, kernel, kernel 0 1 0 1 0 0.20 0.09 0.77 --> 0 Decision trees SVMs parameters (e.g. gamma), … 1 0 1 1 1 0.42 0.31 0.54 --> 1 Neural nets – ANN: # hidden units, # hidden layers, learning rate, momentum, … 0 0 1 1 0 0.58 0.29 0.63 --> 1 Naïve Bayes – DT: splitting criterion, pruning options, 1 1 1 0 0 0.18 0.13 0.82 --> 0 Bayesian Neural nets smoothing options, … … Bayes Nets (Graphical Models) – KNN: K, distance metric, distance Bagging (bagged decision trees) weighted averaging, … 1 0 1 0 1 0.66 0.43 0.52 --> 1 Random Forests Boosting (boosted decision trees) learn model that predicts outputs in train set from input features Boosted Stumps Must optimize to each problem: use model to make predictions on (future) cases not used for training ILP (Inductive Logic Programming) Rule learners (C2, …) – failure to optimize makes superior Ripper algorithm inferior 0 0 1 1 1 0.64 0.38 0.55 --> ? Gaussian Processes – optimization depends on criterion … e.g., for kNN: kaccuracy ≠ kROC ≠ kRMSE – optimization depends on size of train set 1 Sad State of Affairs: Supervised Learning Sad State of Affairs: Supervised Learning Linear/polynomial Regression Linear/polynomial Regression Logistic/Ridge regression Logistic/Ridge regression K-nearest neighbor K-nearest neighbor We have been too successful at Linear perceptron Linear perceptron Decision trees Decision trees developing new (improved?) SVMs SVMs learning methods. Neural nets Neural nets Naïve Bayes Naïve Bayes Bayesian Neural nets Bayesian Neural nets Bayes Nets (Graphical Models) Bayes Nets (Graphical Models) Bagging (bagged decision trees) Bagging (bagged decision trees) Random Forests Random Forests Boosting (boosted decision trees) Boosting (boosted decision trees) Boosted Stumps Boosted Stumps ILP (Inductive Logic Programming) ILP (Inductive Logic Programming) Rule learners (C2, …) Rule learners (C2, …) Ripper Ripper Gaussian Processes Gaussian Processes … … Sad State of Affairs: Supervised Learning Sad State of Affairs: Supervised Learning Linear/polynomial Regression Linear/polynomial Regression Logistic/Ridge regression Logistic/Ridge regression K-nearest neighbor K-nearest neighbor Linear perceptron Linear perceptron Decision trees Decision trees SVMs SVMs Neural nets Neural nets Naïve Bayes Naïve Bayes Bayesian Neural nets Bayesian Neural nets Bayes Nets (Graphical Models) Bayes Nets (Graphical Models) Bagging (bagged decision trees) Bagging (bagged decision trees) ? Random Forests Random Forests Boosting (boosted decision trees) Boosting (boosted decision trees) Boosted Stumps Boosted Stumps ILP (Inductive Logic Programming) ILP (Inductive Logic Programming) Rule learners (C2, …) Rule learners (C2, …) Ripper Ripper Gaussian Processes Gaussian Processes … … 2 Sad State of Affairs: Supervised Learning A Real Decision Tree Linear/polynomial Regression Logistic/Ridge regression K-nearest neighbor X<0 Linear perceptron Decision trees SVMs Neural nets Naïve Bayes y<0 N Bayesian Neural nets Bayes Nets (Graphical Models) Bagging (bagged decision trees) Random Forests Y Boosting (boosted decision trees) N Boosted Stumps ILP (Inductive Logic Programming) Rule learners (C2, …) Ripper Gaussian Processes … Not ALL DecisionDecision TreesTrees AreAre IntelligibleIntelligible Sad State of Affairs: Supervised Learning Linear/polynomial Regression Logistic/Ridge regression K-nearest neighbor This is part of one high-performing decision tree. Linear perceptron Decision trees SVMs Neural nets Naïve Bayes Bayesian Neural nets Bayes Nets (Graphical Models) Bagging (bagged decision trees) Random Forests Boosting (boosted decision trees) Boosted Stumps y ILP (Inductive Logic Programming) Rule learners (C2, …) x Ripper Gaussian Processes … 3 A Typical Neural Net Linear Regression linear output unit Logistic Regression Sad State of Affairs: Supervised Learning Linear/polynomial Regression Logistic/Ridge regression 0.95 K-nearest neighbor 0.75 Linear perceptron + + 0.55 Decision trees 0.35 SVMs + 0.15 + Neural nets -15 -10 -5 -0.05 0 5 10 15 sigmoid output unit Naïve Bayes - + Bayesian Neural nets Bayes Nets (Graphical Models) + + Bagging (bagged decision trees) Random Forests - Boosting (boosted decision trees) Boosted Stumps - - ILP (Inductive Logic Programming) Rule learners (C2, …) - Ripper Gaussian Processes … 4 Sad State of Affairs: Supervised Learning Questions Linear/polynomial Regression Is one algorithm “better” than the others? Logistic/Ridge regression Each algorithm has many variations and K-nearest neighbor free parameters: Are some learning methods best for certain loss functions? Linear perceptron – SVMs for classification? Decision trees – SVM: margin parameter, kernel, kernel SVMs parameters (e.g. gamma), … – ANNs for regression or predicting probabilities? Neural nets – ANN: # hidden units, # hidden layers, learning rate, momentum, … If no method(s) dominate, can we at least ignore some algs? Naïve Bayes – DT: splitting criterion, pruning options, Bayesian Neural nets smoothing options, … Why are some methods good on loss X, but poor on loss Y? Bayes Nets (Graphical Models) – KNN: K, distance metric, distance Bagging (bagged decision trees) weighted averaging, … How do different losses relate to each other? Random Forests Are some losses “better” than others? Boosting (boosted decision trees) Boosted Stumps Must optimize to each problem: … ILP (Inductive Logic Programming) Rule learners (C2, …) – failure to optimize makes superior Ripper algorithm inferior What should you use ??? Gaussian Processes – optimization depends on criterion … e.g., for kNN: kaccuracy ≠ kROC ≠ kRMSE – optimization depends on size of train set Data Sets Binary Classification Performance Metrics 8 binary classification data sets (now 10 sets) Threshold Metrics: – Adult – Accuracy – Cover Type UCI – F-Score – Letter.p1 – Letter.p2 – Lift – Pneumonia Ordering/Ranking Metrics: – Hyper Spectral “Real” – ROC Area – SLAC Particle Physics – Average Precision – Mg – Precision/Recall Break-Even Point 4 k train sets Probability Metrics: Training Data 1 k validation sets – Root-Mean-Squared-Error – Cross-Entropy Large final test sets (usually 20k) – Probability Calibration 5 Normalized Scores Massive Empirical Comparison Small Difficulty: 10 learning methods – some metrics, 1.00 is best (e.g. ACC) X – some metrics, 0.00 is best (e.g. RMS) 100’s of parameter settings per method – some metrics, baseline is 0.50 (e.g. AUC) X 5-fold cross validation – some metrics, best depends on data (e.g. Lift) = – some problems/metrics, 0.60 is excellent performance 10,000+ models trained per problem – some problems/metrics, 0.99 is poor performance X 11 Boolean classification test problems Solution: Normalized Scores: = – baseline performance => 0.00 110,000+ models – best observed performance => 1.00 (proxy for Bayes optimal) X – puts all metrics/problems on equal footing 9 performance metrics = 1,000,000+ model evaluations Look at Predicting Probabilities First Results on Test Sets (Normalized Scores) Why? Probability Metrics Squared Cross- Model Calibration Mean Best probabilities overall: – don’t want to hit you with results for nine metrics all at once Error Entropy – Neural Nets – if you can predict correct conditional probabilities, you’re ANN 0.872 0.878 0.826 0.859 – Bagged Decision Trees done–all reasonable performance metrics are optimized by BAG-DT 0.875 0.901 0.637 0.804 – Random Forests predicting true probabilities RND-FOR 0.882 0.899 0.567 0.783 – results for probabilities are interesting by themselves* KNN 0.783 0.769 0.684 0.745 Not competitive: LOG-REG 0.614 0.620 0.678 0.637 – Boosted decision trees and stumps (exponential loss) DT 0.583 0.638 0.512 0.578 – SVMs (standard hinge-loss) * Alex Niculescu-Mizil won best student paper award at BST-DT 0.596 0.598 0.045 0.413 SVM [0,1] 0.484 0.447 0.000 0.310 ICML05 for this work on predicting probabilities SVMs scaled to [0,1] via BST-STMP 0.355 0.339 0.123 0.272 simple min/max scaling NAÏVE-B 0.271 0.00 0.00 0.090 6 Bagged Decision Trees Bagging Results Draw 100 bootstrap samples of data Train trees on each sample -> 100 trees Un-weighted average prediction of trees 100 bagged trees … Average prediction (0.23 + 0.19 + 0.34 + 0.22 + 0.26 + … + 0.31) / # Trees = 0.24 single tree Highly under-rated! Random Forests (Bagged Trees++) Calibration & Reliability Diagrams Draw 1000+ bootstrap samples of data If on 100 days the forecast is 70% chance of rain, predictions well calibrated at Draw sample of available attributes at each split p = 0.70 if it actually rains about 70/100 days Train trees on each sample/attribute set -> 1000+ trees Put cases with predicted values between 0 and 0.1 in first bin, … Un-weighted average prediction of trees # # # # # # # # # # 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 For each bin, plot mean predicted value against true fraction of positives: … Average

What Is Supervised Learning? Rich Caruana Cornell University

Obtaining Well Calibrated Probabilities Using Bayesian Binning

Calibrated Model-Based Deep Reinforcement Learning

Offline Deep Models Calibration with Bayesian Neural Networks

Discovering General-Purpose Active Learning Strategies

On Calibration of Modern Neural Networks

Calibration Techniques for Binary Classification Problems

Radar-Detection Based Classification of Moving Objects Using Machine Learning Methods

Learning to Plan with Portable Symbols

Materials Genomics and Machine Learning Kevin Maik Jablonka, Daniele Ongari, Seyed Mohamad Moosavi, and Berend Smit*

Measuring Calibration in Deep Learning

Arxiv:1808.00111V2 [Cs.LG] 14 Sep 2018

Multi-Class Probabilistic Classification Using Inductive and Cross Venn–Abers Predictors