Diversity-Promoting and Large-Scale Machine Learning for Healthcare

Diversity-promoting and Large-scale Machine Learning for Healthcare Pengtao Xie CMU-ML-18-106 Aug 2018 Machine Learning Department School of Computer Science Carnegie Mellon University Pittsburgh, PA Thesis Committee: Eric P. Xing, Chair Ruslan Salakhutdinov Pradeep Ravikumar Ryan Adams (Princeton & Google Brain) David Sontag (MIT) Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright c 2018 Pengtao Xie This research was sponsored by: Office of Naval Research award number N000140910758; Air Force Office of Sci- entific Research award FA95501010247; National Science Foundation awards IIS1115313, IIS1111142, IIS1218282 and IIS1447676; Defense Advanced Research Projects Agency award FA872105C0003; Department of the Air Force award FA870215D0002; National Institutes of Health award P30DA035778; Pennsylvania Department of Health’s Big Data for Better Health (BD4BH) award; and a grant from the University of Pittsburgh Medical Center. Keywords: Diversity-promoting Learning, Large-scale Distributed Learning, Machine Learn- ing for Healthcare, Regularization, Bayesian Priors, Generalization Error Analysis, System and Algorithm Co-design Dedicated to my parents for their endless love, support, and encouragement Abstract In healthcare, a tsunami of medical data has emerged, including electronic health records, images, literature, etc. These data are heterogeneous and noisy, which ren- ders clinical decision-makings time-consuming, error-prone, and suboptimal. In this thesis, we develop machine learning (ML) models and systems for distilling high- value patterns from unstructured clinical data and making informed and real-time medical predictions and recommendations, to aid physicians in improving the efficiency of workflow and the quality of patient care. When developing these models, we encounter several challenges: (1) How to better capture infrequent clinical patterns, such as rare subtypes of diseases; (2) How to make the models generalize well on unseen patients? (3) How to promote the interpretability of the decisions? (4) How to improve the timeliness of decision-making without sacrificing its quality? (5) How to efficiently discover massive clinical patterns from large-scale data? To address challenges (1-4), we systematically study diversity-promoting learning, which encourages the components in ML models (1) to diversely spread out to give infrequent patterns a broader coverage, (2) to be imposed with structured constraints for better generalization performance, (3) to be mutually complementary for more compact representation of information, and (4) to be less redundant for better interpretability. The study is performed in both frequentist statistics and Bayesian statistics. In the former, we develop diversity-promoting regularizers that are empir- ically effective, theoretically analyzable, and computationally efficient, and propose a rich set of optimization algorithms to solve the regularized problems. In the latter, we propose Bayesian priors that can effectively entail an inductive bias of “diversity” among a finite or infinite number of components and develop efficient posterior inference algorithms. We provide theoretical analysis on why promoting diversity can better capture infrequent betters and improve generalization. The developed regularizers and priors are demonstrated to be effective in a wide range of ML models. To address challenge (5), we study large-scale learning. Specifically, we design efficient distributed ML systems by exploiting a system-algorithm co-design approach. Inspired by a sufficient factor property of many ML models, we design a peer-to-peer system – Orpheus – that significantly reduces communication and fault tolerance costs. We also provide theoretical analysis showing that algorithms executed on Orpheus are guaranteed to converge. The efficiency of our system is demonstrated in several large-scale applications. We apply the proposed diversity-promoting learning (DPL) techniques and the distributed ML system to solve healthcare problems. In a similar-patient retrieval application, DPL shows great effectiveness in improving retrieval performance on infrequent diseases, enabling fast and accurate retrieval, and reducing overfitting. In a medical-topic discovery task, our Orpheus system is able to extract tens of thousands of topics from millions of documents in a few hours. Besides these two applications, we also design effective ML models for hierarchical multi-label tagging of medical images and automated ICD coding. Acknowledgments First and foremost, I would like to thank my PhD advisor Professor Eric Xing for his great advice, encouragement, support, and help throughout my journey at CMU. Eric’s exceptional insights and broad vision have played an instrumental role in helping me to pick impactful problems. He has consistently put in his time and energy in guiding me through every challenge in research, from designing models to diagnosing errors in experiments, from writing papers to giving presentations. I am also grateful to him for allowing me the freedom to do research in a diverse range of fields that I am passionate about. His enthusiasm for research, his passion for life, and his work ethic have been a great source of inspiration and will continue to shape me in the future. I would like to express my deepest gratitude to my other thesis committee mem- bers: Professors Ruslan Salakhutdinov, Pradeep Ravikumar, Ryan Adams, and David Sontag, for their invaluable feedback that has helped to improve many parts of this thesis. The insightful discussions with Russ and Pradeep inspired several ideas in this thesis. Ryan’s early work on priors for diversity in generative latent variable models is the motivation for systematically studying diversity-promoting learning in this thesis. David’s sharp insights and invaluable comments steered me in the right direction in the study of ML for healthcare. I have been very fortunate to collaborate with many brilliant friends: Qirong Ho, Ruslan Salakhutdinov, Wei Wu, Barnabas Poczos, Aarti Singh, James Zou, Yao- liang Yu, Jun Zhu, Jianxin Li, Jin Kyu Kim, Hongbao Zhang, Baoyu Jing, Devendra Sachan, Yuntian Deng, Yi Zhou, Haoran Shi, Yichen Zhu, Hao Zhang, Shizhen Xu, Haoyi Zhou, Shuai Zhang, Yuan Xie, Yulong Pei, among many others. It has always been a great pleasure and a valuable learning experience to brainstorm ideas, design models and algorithms, debug code, and write papers with these talented people and I am proud of the work we have done together over the last 5 years. I am also very thankful to the Sailing Lab and Machine Learning Department for fostering a friendly and wonderful environment for research and life. I would like to thank Diane Stidle, Michelle Martin, Mallory Deptola and the entire MLD staff for their warm help and kind support. I deeply appreciate many of CMU faculties, for their wonderful courses and talks. Lastly and most importantly, I am extremely grateful to my family. I could not have done this without the constant encouragement, unwavering support, and endless love from my parents. Contents 1 Introduction 1 1.1 Thesis Introduction and Scope . .1 1.2 Contributions . .5 1.2.1 Diversity-promoting Learning . .6 1.2.2 Large-scale Distributed Learning . .8 1.2.3 ML for Healthcare . .9 1.3 Related Works . 10 1.3.1 Diversity-promoting Learning . 10 1.3.2 Distributed Learning . 11 1.3.3 ML for Healthcare . 12 2 Diversity-promoting Learning I – Regularization 14 2.1 Uncorrelation and Evenness: A Diversity-promoting Regularizer . 14 2.1.1 Uniform Eigenvalue Regularizer . 16 2.1.2 Case Studies . 19 2.1.3 A Projected Gradient Decent Algorithm . 20 2.1.4 Evaluation . 21 2.2 Convex Diversity-promoting Regularizers . 26 2.2.1 Nonconvex Bregman Matrix Divergence Regularizers . 27 2.2.2 Convex Bregman Matrix Divergence Regularizers . 28 2.2.3 A Proximal Gradient Descent Algorithm . 31 2.2.4 Evaluation . 32 2.3 Angular Constraints for Improving Generalization Performance . 36 2.3.1 Angular Constraints . 37 2.3.2 An ADMM-based Algorithm . 38 2.3.3 Evaluation . 41 2.3.4 Appendix: Proofs and Details of Algorithms . 44 2.4 Diversity in the RKHS: Orthogonality-promoting Regularization of Kernel Meth- ods.......................................... 52 2.4.1 Bregman Matrix Divergence Regularized Kernel Methods . 53 2.4.2 A Functional Gradient Descent Algorithm . 54 2.4.3 Evaluation . 57 2.4.4 Appendix: Details of Algorithms . 60 2.5 Diversity and Sparsity: Nonoverlapness-promoting Regularization . 62 vi 2.5.1 Nonoverlap-promoting Regularization . 63 2.5.2 A Coordinate Descent Algorithm . 65 2.5.3 Evaluation . 66 2.5.4 Appendix: Details of Algorithms . 70 3 Diversity-promoting Learning II – Bayesian Inference 84 3.1 Diversity-promoting Learning of Bayesian Parametric Models . 84 3.1.1 Mutual Angular Process . 85 3.1.2 Approximate Inference Algorithms . 87 3.1.3 Diversity-promoting Posterior Regularization . 89 3.1.4 Case Study: Bayesian Mixture of Experts Model . 90 3.1.5 Evaluation . 91 3.1.6 Appendix: Details of Algorithms . 94 3.2 Diversity-promoting Learning of Bayesian Nonparametric Models . 100 3.2.1 Infinite Mutual Angular Process . 101 3.2.2 Case Study: Infinite Latent Feature Model . 102 3.2.3 A Sampling-based Inference Algorithm . 102 3.2.4 Evaluation . 105 4 Diversity-promoting Learning III – Analysis 110 4.1 Analysis of Better Capturing of Infrequent Patterns . 110 4.2 Analysis of Generalization Errors . 112 4.2.1 Generalization Error Analysis for the Angular Constraints . 112 4.2.2 Estimation Error Analysis for the Nonconvex Bregman Matrix Diver- gence Regularizers . 114 4.2.3 Estimation Error Analysis for the Convex Bregman Matrix Divergence Regularizers . 115 4.3 Appendix: Proofs . 116 4.3.1 Proof of Theorem 1 . 116 4.3.2 Proof of Theorem 2 . 123 4.3.3 Proof of Theorem 3 . 125 4.3.4 Proof of Theorem 4 . 131 4.3.5 Proof of Theorem 5 . 135 5 Large-scale Learning via System and Algorithm Co-design 142 5.1 Sufficient Factor Property . 144 5.2 Orpheus: A Light-weight Peer-to-peer System . 146 5.2.1 Communication: Sufficient Factor Broadcasting (SFB) .

Diversity-Promoting and Large-Scale Machine Learning for Healthcare

Incorporating Prior Domain Knowledge Into Inductive Machine Learning Its Implementation in Contemporary Capital Markets

Efficient Machine Learning Models of Inductive Biases Combined with the Strengths of Program Synthesis (PBE) to Combat Implicit Biases of the Brain

A Simple Algorithm for Semi-Supervised Learning with Improved Generalization Error Bound

Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks

Cognitive Biases and Interpretability of Inductively Learnt Rules

Phd Thesis Institute of Cognitive Science University of Osnabruck¨

Multitask Learning with Local Attention for Tibetan Speech Recognition

Arxiv:1911.10500V2 [Cs.LG] 23 Dec 2019 Level

Description of Structural Biases and Associated Data in Sensor-Rich Environments

Inductive Bias

Lecture 9: Generalization

Issues in Using Function Approximation for Reinforcement Learning