Optimal Trees for Prediction and Prescription Jack William Dunn

Optimal Trees for Prediction and Prescription by Jack William Dunn B.E.(Hons), University of Auckland (2014) Submitted to the Sloan School of Management in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Operations Research at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2018 ○c Massachusetts Institute of Technology 2018. All rights reserved. Author................................................................ Sloan School of Management May 18, 2018 Certified by. Dimitris Bertsimas Boeing Professor of Operations Research Co-director, Operations Research Center Thesis Supervisor Accepted by . Patrick Jaillet Dugald C. Jackson Professor Department of Electrical Engineering and Computer Science Co-Director, Operations Research Center Optimal Trees for Prediction and Prescription by Jack William Dunn Submitted to the Sloan School of Management on May 18, 2018, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Operations Research Abstract For the past 30 years, decision tree methods have been one of the most widely- used approaches in machine learning across industry and academia, due in large part to their interpretability. However, this interpretability comes at a price—the performance of classical decision tree methods is typically not competitive with state- of-the-art methods like random forests and gradient boosted trees. A key limitation of classical decision tree methods is their use of a greedy heuristic for training. The tree is therefore constructed one locally-optimal split at a time, and so the final tree as a whole may be far from global optimality. Motivated bythe increase in performance of mixed-integer optimization methods over the last 30 years, we formulate the problem of constructing the optimal decision tree using discrete optimization, allowing us to construct the entire decision tree in a single step and hence find the single tree that best minimizes the training error. We develop high- performance local search methods that allow us to efficiently solve this problem and find optimal decision trees with both parallel (axes-aligned) and hyperplane splits. We show that our approach using modern optimization results in decision trees that improve significantly upon classical decision tree methods. In particular, acrossa suite of synthetic and real-world classification and regression examples, our methods perform similarly to random forests and boosted trees whilst maintaining the interpretability advantage of a single decision tree, thus alleviating the need to choose between performance and interpretability. We also adapt our approach to the problem of prescription, where the goal is to make optimal prescriptions for each observation. While constructing the tree, our method simultaneously infers the unknown counterfactuals in the data and learns to make optimal prescriptions. This results in a decision tree that optimizes both the predictive and prescriptive error, and delivers an interpretable solution that offers significant improvements upon the existing state-of-the-art in prescriptive problems. Thesis Supervisor: Dimitris Bertsimas Title: Boeing Professor of Operations Research Co-director, Operations Research Center 3 Acknowledgments Firstly, I would like to thank Dimitris Bertsimas for being not just an advisor, but also a mentor and true friend to me during my time at the ORC. I am constantly inspired by his visions, ambitions, and positivity, and it is undeniable that he has had a great and positive influence on my attitude to both research and life. Iam continually impressed by his aptitude for selecting practically important problems— he asked me to think about optimizing decision trees during our first meeting, and I never would have imagined that we would still be meeting about them nearly four years later! It has been a truly rewarding experience to work with and be mentored by someone with such a genuine drive to pursue and address with research the things that matter most in life. I would like to thank Rahul Mazumder and Nikos Trichakis for serving as the other two members of my thesis committee, and for their useful feedback and insights during my thesis proposal. Thanks also to Juan Pablo Vielma for his role on my general examination committee. I am extremely grateful to all of my collaborators during my time at the ORC, including the many collaborations that were not included in the thesis. I would especially like to thank the following people: Colin Pawlowski and Daisy Zhuo for our work in the early years on Robust Classification, a project we never seemed to see the end of; Yuchen Wang for his collaborations on a variety of projects and being one of the most prolific users of Optimal Trees; Lauren Berk for her assistance on our project with Massachusetts General Hospital; Nishanth Mundru for his creativity in our work in figuring out Optimal Prescriptive Trees; and Emma Gibson and Agni Orfanoudaki for their perseverance with Optimal Survival Trees. It has been an absolute pleasure working with so many talented people over the course of my PhD and I hope it can continue into the future. I also thank rest of the wider ORC community for making the experience an enjoyable one. My thanks also goes to the various outside collaborators I have worked with. George Velmahos and Haytham Kaafarani at Massachusetts General Hospital were 5 both great inspirations and taught me a lot about the realities of working in health- care. Tom Trikalinos provided a lot of valuable input in our work with diagnosing head trauma in children. Aris Paschalidis, Steffie van Loenhout and Korina Digalaki all spent summers working with me in various capacities, and I am thankful for their assistance and company. One of the most demanding and yet rewarding aspects of the PhD has been developing and supporting the use of the Optimal Trees package within the ORC and beyond. I am extremely grateful to the roughly 30 students that have been using Optimal Trees in their own research, both for keeping me busy with bug reports and feature requests, but also providing the impetus and drive for continued improvement of the software. I would also like to thank Daisy Zhuo specifically for her contributions to the package, and for keeping me passionate and excited about the development process with her extreme enthusiasm for life. Finally, I would like to acknowledge and thank my family for their unconditional support and encouragement both during my time here and over my whole life. I cannot thank you enough for the assistance and direction you have given me, and for your impact in making me who I am today. Last, but certainly not least, I would like to thank my partner Jenny, who despite the inherent challenges has managed to follow me over the pond to the US and continues to be my biggest supporter every day. 6 Contents List of Figures 11 List of Tables 15 1 Introduction 17 1.1 Overview of CART . 18 1.2 Limitations of CART . 23 1.3 The Remarkable Progress of MIO . 26 1.4 What is Tractable? . 28 1.5 Motivation, Philosophy and Goal . 29 1.6 Structure of the Thesis . 31 2 Optimal Classification Trees with Parallel Splits 33 2.1 Review of Classification Tree Methods . 34 2.2 A Mixed-Integer Optimization Approach . 38 2.3 A Local Search Approach . 47 2.4 A Method for Tuning Hyperparameters . 57 2.5 Experiments with Synthetic Datasets . 67 2.6 Experiments with Real-World Datasets . 75 2.7 Conclusions . 83 3 Optimal Classification Trees with Hyperplane Splits 85 3.1 OCT with Hyperplanes via MIO . 86 3.2 OCT with Hyperplanes via Local Search . 90 7 3.3 The Accuracy-Interpretability Tradeoff . 98 3.4 Experiments with Synthetic Datasets . 101 3.5 Experiments with Real-World Datasets . 110 3.6 Conclusions . 114 4 Optimal Regression Trees with Constant Predictions 117 4.1 Review of Regression Tree Methods . 118 4.2 ORT with Constant Predictions via MIO . 119 4.3 ORT with Constant Predictions via Local Search . 123 4.4 Experiments with Synthetic Datasets . 126 4.5 Experiments with Real-World Datasets . 139 4.6 Conclusions . 145 5 Optimal Regression Trees with Linear Predictions 147 5.1 ORT with Linear Predictions via MIO . 149 5.2 ORT with Linear Predictions via Local Search . 152 5.3 Experiments with Synthetic Datasets . 155 5.4 Experiments with Real-World Datasets . 162 5.5 Conclusions . 166 6 Optimal Prescription Trees 169 6.1 Optimal Prescriptive Trees . 177 6.2 Experiments with Synthetic Datasets . 181 6.3 Experiments with Real-World Datasets . 190 6.4 Conclusions . 200 7 Diagnostics for Children After Head Trauma 201 7.1 Data and Methods . 202 7.2 Results . 205 7.3 Discussion . 212 8 Conclusion 215 8 Bibliography 217 9 List of Figures 1-1 An example of the CART algorithm applied to the “Iris” dataset that shows the recursive partitioning step-by-step. 20 1-2 The tree learned by CART for the “Iris” dataset. 21 1-3 The CART tree for the “Iris” dataset after conducting pruning. 22 1-4 An example of partitioning the “Iris” dataset with arbitrary hyperplanes rather than just splits parallel to the axes. 23 1-5 A tree for the “Iris” dataset with hyperplane splits. 23 1-6 An example where greedy tree induction fails to learn the truth in the data. 24 1-7 Peak supercomputer speed in GFlop/s (log scale) from 1994–2016. 28 2-1 An example of a decision tree. 35 2-2 An example of a strong split being hidden behind a weak split. 37 2-3 The maximal tree for depth D = 2.................... 39 2-4 Comparison of upper and lower bound evolution while solving MIO problem (2.24). 46 2-5 Training accuracy (%) and running time for each method on “Banknote Authentication” dataset.

Optimal Trees for Prediction and Prescription Jack William Dunn

Implications of Rational Inattention

The Physics of Optimal Decision Making: a Formal Analysis of Models of Performance in Two-Alternative Forced-Choice Tasks

Statistical Decision Theory Bayesian and Quasi-Bayesian Estimators

Statistical Decision Theory: Concepts, Methods and Applications

The Functional False Discovery Rate with Applications to Genomics

Rational Inattention in Controlled Markov Processes

Optimal Decision Making in Daily Life

Bayesian Decision Theory Chapter 2 (Jan 11, 18, 23, 25)

A CART-Based Genetic Algorithm for Constructing Higher Accuracy Decision Trees

FDR and Bayesian Multiple Comparisons Rules

1 Introduction

Chapter 2 Bayesian Decision Theory