Probabilistic Supervised Learning

Probabilistic supervised learning 1 † 1 ‡ 2 § 3 Frithjof Gressmann ∗ , Franz J. Király , Bilal Mateen , and Harald Oberhauser 1 Department of Statistical Science, University College London, Gower Street, London WC1E 6BT, United Kingdom 2Warwick Medical School, University of Warwick, Coventry CV4 7AL, United Kingdom 3Mathematical Institute, University of Oxford, Andrew Wiles Building, Oxford OX2 6GG, United Kingdom May 8, 2019 Abstract Predictive modelling and supervised learning are central to modern data science. With predictions from an ever-expanding number of supervised black-box strategies - e.g., kernel methods, random forests, deep learning aka neural networks - being employed as a basis for decision making processes, it is crucial to understand the statistical uncertainty associated with these predictions. As a general means to approach the issue, we present an overarching framework for black-box prediction strategies that not only predict the target but also their own predictions’ uncertainty. Moreover, the framework allows for fair assessment and comparison of disparate prediction strategies. For this, we formally consider strategies capable of predicting full distributions from feature variables (rather than just a class or number), so-called probabilistic supervised learning strategies. Our work draws from prior work including Bayesian statistics, information theory, and modern supervised machine learning, and in a novel synthesis leads to (a) new theoretical insights such as a probabilistic bias-variance decomposition and an entropic formulation of prediction, as well as to (b) new algorithms and meta-algorithms, such as composite prediction strategies, probabilistic boosting and bagging, and a probabilistic predictive independence test. Our black-box formulation also leads (c) to a new modular interface view on probabilistic supervised learning and a modelling workflow API design, which we have implemented in the newly released skpro machine learning toolbox, extending the familiar modelling interface and meta-modelling functionality of arXiv:1801.00753v3 [stat.ML] 7 May 2019 sklearn. The skpro package provides interfaces for construction, composition, and tuning of probabilistic supervised learning strategies, together with orchestration features for validation and comparison of any such strategy - be it frequentist, Bayesian, or other. [email protected] ∗ †[email protected] ‡[email protected] §[email protected] 1 Contents 1 Introduction 4 1.1 Main ideas . .6 1.2 Technical contributions . .7 1.3 Relation to prior art . .8 1.3.1 Model assessment of probabilistic models via predictive likelihood . .9 1.3.2 Connections of predictive model assessment to information theory . 10 1.3.3 Bayesian/frequentist synthesis . 10 1.3.4 Toolbox API designs for supervised learning and probabilistic learning. 10 2 The probabilistic prediction setting 13 2.1 Notation and Conventions . 13 2.2 Statistical setting . 14 2.3 Classical supervised learning . 14 2.4 Probabilistic supervised learning . 15 2.5 Probabilistic loss functionals . 15 2.6 Recovering the classical setting . 19 2.7 Probabilistic losses for mean-variance predictions . 20 2.8 Short note: why taking (posterior) expectations is a bad idea . 22 3 Model assessment and model selection 23 3.1 Out-of-sample estimation of the generalization loss . 23 3.2 Model comparison and model performance . 24 3.3 Uninformed baseline: label density estimation . 25 3.4 Short note: why permutation of features/labels is not the uninformed baseline . 28 3.5 Classical baseline: residual density estimation . 29 3.6 Bayes type information criteria and empirical risk asymptotics for the log-loss . 30 3.7 No counterexample when out-of-sample . 32 4 Mixed outcome distributions 33 4.1 Integrating a discrete loss applied to binary cut-offs . 33 4.2 Convolution of the target with a kernel . 35 4.3 Kernel discrepancy losses . 36 4.4 Realizing that all computing is discrete . 39 4.5 Decomposition into continuous and discrete part . 40 4.6 Mixed-to-continuous adaptors . 41 4.7 Non-mixed distributions . 42 5 Learning theory and model diagnostics 43 5.1 Variations on Jensen . 43 5.2 Bayesics on predictions and posteriors . 44 5.3 Bias and variance in the probabilistic setting . 46 5.4 Information, predictability, and independence . 49 5.5 Transformations and probabilistic residuals . 54 5.6 Visual diagnostics of probabilistic models . 57 6 Meta-algorithms for the probabilistic setting 59 6.1 Target re-normalization and outlier capping for the log-loss . 59 6.2 Target adaptors and target composite strategies . 61 6.2.1 Conversion to point prediction . 62 6.2.2 Adaptors for point prediction strategies . 63 6.2.3 Adaptors for empirical samples such as from the Bayesian predictive posterior . 64 6.2.4 Adaptors for mixed distributions . 65 6.2.5 Conversion to parametric distributions . 67 6.3 Bagging and model averaging for probabilistic prediction strategies . 67 2 6.4 Boosting of probabilistic predictors . 69 6.5 A multi-variate test for independence . 70 6.5.1 Constructing a probabilistic predictive independence test . 70 6.5.2 Workflow meta-interfaces in the probabilistic setting . 72 6.5.3 Relation to kernel discrepancy based testing procedures . 72 7 Probabilistic supervised learning algorithms 74 7.1 Bayesian models . 74 7.2 Probabilistic Classifiers . 74 7.3 Heteroscedastic regression and prediction intervals . 75 7.4 Conditional Density Estimation . 75 7.5 Gaussian Processes . 76 7.6 Mixture density estimators, and mixtures of experts . 77 7.7 Quantile Regression and Quantile Estimation . 77 8 An API for probabilistic supervised learning 79 8.1 Overview . 79 8.1.1 General design principles . 79 8.1.2 Main use cases . 79 8.1.3 API design requirements . 80 8.2 Pythonic skpro interface design . 82 8.3 Core API . 84 8.3.1 Probabilistic estimators and distributions . 86 8.3.2 Integration of third-party prediction strategies . 86 8.3.3 Bayesian and probabilistic programming interfaces . 86 8.3.4 Model selection and performance metrics . 86 8.3.5 Code example . 87 8.4 Prediction strategies . 87 8.4.1 Baselines . 88 8.4.2 PyMC-interface . 88 8.4.3 Composite parametric prediction . ..

Probabilistic Supervised Learning

Piotr Fryzlewicz

Department of Statistics Academic Program Review Spring 2015

Mathematisches Forschungsinstitut Oberwolfach Statistical Inference For

Meetings Inc

Threshold Models in Time Series Analysis „ 30 Years On

An Adaptive Estimation of Dimension Reduction Space

New IMS President Welcomed

Short Curriculum Vitae of Professor Howell Tong

Statistical Inference for Dynamical Systems: a Review Kevin Mcgoﬀ, Sayan Mukherjee, Natesh S

Call for 2021 JSM Proposals

Historical Gregynog Statistical Conferences

A Conversation with Howell Tong 3