Diversity-Promoting and Large-Scale Machine Learning for Healthcare

Total Page:16

File Type:pdf, Size:1020Kb

Diversity-Promoting and Large-Scale Machine Learning for Healthcare Diversity-promoting and Large-scale Machine Learning for Healthcare Pengtao Xie CMU-ML-18-106 Aug 2018 Machine Learning Department School of Computer Science Carnegie Mellon University Pittsburgh, PA Thesis Committee: Eric P. Xing, Chair Ruslan Salakhutdinov Pradeep Ravikumar Ryan Adams (Princeton & Google Brain) David Sontag (MIT) Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright c 2018 Pengtao Xie This research was sponsored by: Office of Naval Research award number N000140910758; Air Force Office of Sci- entific Research award FA95501010247; National Science Foundation awards IIS1115313, IIS1111142, IIS1218282 and IIS1447676; Defense Advanced Research Projects Agency award FA872105C0003; Department of the Air Force award FA870215D0002; National Institutes of Health award P30DA035778; Pennsylvania Department of Health’s Big Data for Better Health (BD4BH) award; and a grant from the University of Pittsburgh Medical Center. Keywords: Diversity-promoting Learning, Large-scale Distributed Learning, Machine Learn- ing for Healthcare, Regularization, Bayesian Priors, Generalization Error Analysis, System and Algorithm Co-design Dedicated to my parents for their endless love, support, and encouragement Abstract In healthcare, a tsunami of medical data has emerged, including electronic health records, images, literature, etc. These data are heterogeneous and noisy, which ren- ders clinical decision-makings time-consuming, error-prone, and suboptimal. In this thesis, we develop machine learning (ML) models and systems for distilling high- value patterns from unstructured clinical data and making informed and real-time medical predictions and recommendations, to aid physicians in improving the effi- ciency of workflow and the quality of patient care. When developing these models, we encounter several challenges: (1) How to better capture infrequent clinical pat- terns, such as rare subtypes of diseases; (2) How to make the models generalize well on unseen patients? (3) How to promote the interpretability of the decisions? (4) How to improve the timeliness of decision-making without sacrificing its quality? (5) How to efficiently discover massive clinical patterns from large-scale data? To address challenges (1-4), we systematically study diversity-promoting learn- ing, which encourages the components in ML models (1) to diversely spread out to give infrequent patterns a broader coverage, (2) to be imposed with structured con- straints for better generalization performance, (3) to be mutually complementary for more compact representation of information, and (4) to be less redundant for better interpretability. The study is performed in both frequentist statistics and Bayesian statistics. In the former, we develop diversity-promoting regularizers that are empir- ically effective, theoretically analyzable, and computationally efficient, and propose a rich set of optimization algorithms to solve the regularized problems. In the latter, we propose Bayesian priors that can effectively entail an inductive bias of “diversity” among a finite or infinite number of components and develop efficient posterior in- ference algorithms. We provide theoretical analysis on why promoting diversity can better capture infrequent betters and improve generalization. The developed regular- izers and priors are demonstrated to be effective in a wide range of ML models. To address challenge (5), we study large-scale learning. Specifically, we de- sign efficient distributed ML systems by exploiting a system-algorithm co-design approach. Inspired by a sufficient factor property of many ML models, we design a peer-to-peer system – Orpheus – that significantly reduces communication and fault tolerance costs. We also provide theoretical analysis showing that algorithms executed on Orpheus are guaranteed to converge. The efficiency of our system is demonstrated in several large-scale applications. We apply the proposed diversity-promoting learning (DPL) techniques and the distributed ML system to solve healthcare problems. In a similar-patient retrieval application, DPL shows great effectiveness in improving retrieval performance on infrequent diseases, enabling fast and accurate retrieval, and reducing overfitting. In a medical-topic discovery task, our Orpheus system is able to extract tens of thousands of topics from millions of documents in a few hours. Besides these two applications, we also design effective ML models for hierarchical multi-label tagging of medical images and automated ICD coding. Acknowledgments First and foremost, I would like to thank my PhD advisor Professor Eric Xing for his great advice, encouragement, support, and help throughout my journey at CMU. Eric’s exceptional insights and broad vision have played an instrumental role in helping me to pick impactful problems. He has consistently put in his time and energy in guiding me through every challenge in research, from designing models to diagnosing errors in experiments, from writing papers to giving presentations. I am also grateful to him for allowing me the freedom to do research in a diverse range of fields that I am passionate about. His enthusiasm for research, his passion for life, and his work ethic have been a great source of inspiration and will continue to shape me in the future. I would like to express my deepest gratitude to my other thesis committee mem- bers: Professors Ruslan Salakhutdinov, Pradeep Ravikumar, Ryan Adams, and David Sontag, for their invaluable feedback that has helped to improve many parts of this thesis. The insightful discussions with Russ and Pradeep inspired several ideas in this thesis. Ryan’s early work on priors for diversity in generative latent variable models is the motivation for systematically studying diversity-promoting learning in this thesis. David’s sharp insights and invaluable comments steered me in the right direction in the study of ML for healthcare. I have been very fortunate to collaborate with many brilliant friends: Qirong Ho, Ruslan Salakhutdinov, Wei Wu, Barnabas Poczos, Aarti Singh, James Zou, Yao- liang Yu, Jun Zhu, Jianxin Li, Jin Kyu Kim, Hongbao Zhang, Baoyu Jing, Devendra Sachan, Yuntian Deng, Yi Zhou, Haoran Shi, Yichen Zhu, Hao Zhang, Shizhen Xu, Haoyi Zhou, Shuai Zhang, Yuan Xie, Yulong Pei, among many others. It has always been a great pleasure and a valuable learning experience to brainstorm ideas, design models and algorithms, debug code, and write papers with these talented people and I am proud of the work we have done together over the last 5 years. I am also very thankful to the Sailing Lab and Machine Learning Department for fostering a friendly and wonderful environment for research and life. I would like to thank Diane Stidle, Michelle Martin, Mallory Deptola and the entire MLD staff for their warm help and kind support. I deeply appreciate many of CMU faculties, for their wonderful courses and talks. Lastly and most importantly, I am extremely grateful to my family. I could not have done this without the constant encouragement, unwavering support, and endless love from my parents. Contents 1 Introduction 1 1.1 Thesis Introduction and Scope . .1 1.2 Contributions . .5 1.2.1 Diversity-promoting Learning . .6 1.2.2 Large-scale Distributed Learning . .8 1.2.3 ML for Healthcare . .9 1.3 Related Works . 10 1.3.1 Diversity-promoting Learning . 10 1.3.2 Distributed Learning . 11 1.3.3 ML for Healthcare . 12 2 Diversity-promoting Learning I – Regularization 14 2.1 Uncorrelation and Evenness: A Diversity-promoting Regularizer . 14 2.1.1 Uniform Eigenvalue Regularizer . 16 2.1.2 Case Studies . 19 2.1.3 A Projected Gradient Decent Algorithm . 20 2.1.4 Evaluation . 21 2.2 Convex Diversity-promoting Regularizers . 26 2.2.1 Nonconvex Bregman Matrix Divergence Regularizers . 27 2.2.2 Convex Bregman Matrix Divergence Regularizers . 28 2.2.3 A Proximal Gradient Descent Algorithm . 31 2.2.4 Evaluation . 32 2.3 Angular Constraints for Improving Generalization Performance . 36 2.3.1 Angular Constraints . 37 2.3.2 An ADMM-based Algorithm . 38 2.3.3 Evaluation . 41 2.3.4 Appendix: Proofs and Details of Algorithms . 44 2.4 Diversity in the RKHS: Orthogonality-promoting Regularization of Kernel Meth- ods.......................................... 52 2.4.1 Bregman Matrix Divergence Regularized Kernel Methods . 53 2.4.2 A Functional Gradient Descent Algorithm . 54 2.4.3 Evaluation . 57 2.4.4 Appendix: Details of Algorithms . 60 2.5 Diversity and Sparsity: Nonoverlapness-promoting Regularization . 62 vi 2.5.1 Nonoverlap-promoting Regularization . 63 2.5.2 A Coordinate Descent Algorithm . 65 2.5.3 Evaluation . 66 2.5.4 Appendix: Details of Algorithms . 70 3 Diversity-promoting Learning II – Bayesian Inference 84 3.1 Diversity-promoting Learning of Bayesian Parametric Models . 84 3.1.1 Mutual Angular Process . 85 3.1.2 Approximate Inference Algorithms . 87 3.1.3 Diversity-promoting Posterior Regularization . 89 3.1.4 Case Study: Bayesian Mixture of Experts Model . 90 3.1.5 Evaluation . 91 3.1.6 Appendix: Details of Algorithms . 94 3.2 Diversity-promoting Learning of Bayesian Nonparametric Models . 100 3.2.1 Infinite Mutual Angular Process . 101 3.2.2 Case Study: Infinite Latent Feature Model . 102 3.2.3 A Sampling-based Inference Algorithm . 102 3.2.4 Evaluation . 105 4 Diversity-promoting Learning III – Analysis 110 4.1 Analysis of Better Capturing of Infrequent Patterns . 110 4.2 Analysis of Generalization Errors . 112 4.2.1 Generalization Error Analysis for the Angular Constraints . 112 4.2.2 Estimation Error Analysis for the Nonconvex Bregman Matrix Diver- gence Regularizers . 114 4.2.3 Estimation Error Analysis for the Convex Bregman Matrix Divergence Regularizers . 115 4.3 Appendix: Proofs . 116 4.3.1 Proof of Theorem 1 . 116 4.3.2 Proof of Theorem 2 . 123 4.3.3 Proof of Theorem 3 . 125 4.3.4 Proof of Theorem 4 . 131 4.3.5 Proof of Theorem 5 . 135 5 Large-scale Learning via System and Algorithm Co-design 142 5.1 Sufficient Factor Property . 144 5.2 Orpheus: A Light-weight Peer-to-peer System . 146 5.2.1 Communication: Sufficient Factor Broadcasting (SFB) .
Recommended publications
  • Incorporating Prior Domain Knowledge Into Inductive Machine Learning Its Implementation in Contemporary Capital Markets
    University of Technology Sydney Incorporating Prior Domain Knowledge into Inductive Machine Learning Its implementation in contemporary capital markets A dissertation submitted for the degree of Doctor of Philosophy in Computing Sciences by Ting Yu Sydney, Australia 2007 °c Copyright by Ting Yu 2007 CERTIFICATE OF AUTHORSHIP/ORIGINALITY I certify that the work in this thesis has not previously been submitted for a degree nor has it been submitted as a part of requirements for a degree except as fully acknowledged within the text. I also certify that the thesis has been written by me. Any help that I have received in my research work and the preparation of the thesis itself has been acknowledged. In addition, I certify that all information sources and literature used are indicated in the thesis. Signature of Candidate ii Table of Contents 1 Introduction :::::::::::::::::::::::::::::::: 1 1.1 Overview of Incorporating Prior Domain Knowledge into Inductive Machine Learning . 2 1.2 Machine Learning and Prior Domain Knowledge . 3 1.2.1 What is Machine Learning? . 3 1.2.2 What is prior domain knowledge? . 6 1.3 Motivation: Why is Domain Knowledge needed to enhance Induc- tive Machine Learning? . 9 1.3.1 Open Areas . 12 1.4 Proposal and Structure of the Thesis . 13 2 Inductive Machine Learning and Prior Domain Knowledge :: 15 2.1 Overview of Inductive Machine Learning . 15 2.1.1 Consistency and Inductive Bias . 17 2.2 Statistical Learning Theory Overview . 22 2.2.1 Maximal Margin Hyperplane . 30 2.3 Linear Learning Machines and Kernel Methods . 31 2.3.1 Support Vector Machines .
    [Show full text]
  • Efficient Machine Learning Models of Inductive Biases Combined with the Strengths of Program Synthesis (PBE) to Combat Implicit Biases of the Brain
    The Dangers of Algorithmic Autonomy: Efficient Machine Learning Models of Inductive Biases Combined With the Strengths of Program Synthesis (PBE) to Combat Implicit Biases of the Brain The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Halder, Sumona. 2020. The Dangers of Algorithmic Autonomy: Efficient Machine Learning Models of Inductive Biases Combined With the Strengths of Program Synthesis (PBE) to Combat Implicit Biases of the Brain. Bachelor's thesis, Harvard College. Citable link https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37364669 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA The Dangers of Algorithmic Autonomy: Efficient Machine Learning Models of Inductive Biases combined with the Strengths of Program Synthesis (PBE) to Combat Implicit Biases of the Brain A Thesis Presented By Sumona Halder To The Department of Computer Science and Mind Brain Behavior In Partial Fulfillment of the Requirements for the Degree of Bachelor of Arts in the Subject of Computer Science on the Mind Brain Behavior Track Harvard University April 10th, 2020 2 Table of Contents ABSTRACT 3 INTRODUCTION 5 WHAT IS IMPLICIT BIAS? 11 Introduction and study of the underlying cognitive and social mechanisms 11 Methods to combat Implicit Bias 16 The Importance of Reinforced Learning 19 INDUCTIVE BIAS and MACHINE LEARNING ALGORITHMS 22 What is Inductive Bias and why is it useful? 22 How is Inductive Reasoning translated in Machine Learning Algorithms? 25 A Brief History of Inductive Bias Research 28 Current Roadblocks on Research with Inductive Biases in Machines 30 How Inductive Biases have been incorporated into Reinforced Learning 32 I.
    [Show full text]
  • A Simple Algorithm for Semi-Supervised Learning with Improved Generalization Error Bound
    A Simple Algorithm for Semi-supervised Learning with Improved Generalization Error Bound Ming Ji∗‡ [email protected] Tianbao Yang∗† [email protected] Binbin Lin\ [email protected] Rong Jiny [email protected] Jiawei Hanz [email protected] zDepartment of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA yDepartment of Computer Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA \State Key Lab of CAD&CG, College of Computer Science, Zhejiang University, Hangzhou, 310058, China ∗Equal contribution Abstract which states that the prediction function lives in a low In this work, we develop a simple algorithm dimensional manifold of the marginal distribution PX . for semi-supervised regression. The key idea It has been pointed out by several stud- is to use the top eigenfunctions of integral ies (Lafferty & Wasserman, 2007; Nadler et al., operator derived from both labeled and un- 2009) that the manifold assumption by itself is labeled examples as the basis functions and insufficient to reduce the generalization error bound learn the prediction function by a simple lin- of supervised learning. However, on the other hand, it ear regression. We show that under appropri- was found in (Niyogi, 2008) that for certain learning ate assumptions about the integral operator, problems, no supervised learner can learn effectively, this approach is able to achieve an improved while a manifold based learner (that knows the man- regression error bound better than existing ifold or learns it from unlabeled examples) can learn bounds of supervised learning. We also veri- well with relatively few labeled examples.
    [Show full text]
  • Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks
    Statistical Mechanics Methods for Discovering Knowledge from Production-Scale Neural Networks Charles H. Martin∗ and Michael W. Mahoney† Tutorial at ACM-KDD, August 2019 ∗Calculation Consulting, [email protected] †ICSI and Dept of Statistics, UC Berkeley, https://www.stat.berkeley.edu/~mmahoney/ Martin and Mahoney (CC & ICSI/UCB) Statistical Mechanics Methods August 2019 1 / 98 Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions Outline 1 Prehistory and History Older Background A Very Simple Deep Learning Model More Immediate Background 2 Preliminary Results Regularization and the Energy Landscape Preliminary Empirical Results Gaussian and Heavy-tailed Random Matrix Theory 3 Developing a Theory for Deep Learning More Detailed Empirical Results An RMT-based Theory for Deep Learning Tikhonov Regularization versus Heavy-tailed Regularization 4 Validating and Using the Theory Varying the Batch Size: Explaining the Generalization Gap Using the Theory: pip install weightwatcher Diagnostics at Scale: Predicting Test Accuracies 5 More General Implications and Conclusions Statistical Physics & Neural Networks: A Long History 60s: I J. D. Cowan, Statistical Mechanics of Neural Networks, 1967. 70s: I W. A. Little, “The existence of persistent states in the brain,” Math.
    [Show full text]
  • Cognitive Biases and Interpretability of Inductively Learnt Rules
    Introduction Problem ML Model Plausibility Additional Experiments Algo design Conclusions References Cognitive biases and interpretability of inductively learnt rules Tom´aˇsKliegr Department of Information and Knowledge Engineering University of Economics, Prague Some parts are based on joint papers with prof. J F¨urnkranz,prof. H Paulheim, prof. E Izquierdo, Dr. S Bahn´ıkand Dr. S Voj´ıˇr. This presentation covers work in progress. KEG Seminar, Nov 8, 2018 Introduction Problem ML Model Plausibility Additional Experiments Algo design Conclusions References Outline Introducing Inductive and Cognitive Bias AI and cognitive biases Problem Background Study of Plausibility of Machine Learning Models Methodology Setup Main Experiments More Experiments Semantic coherence vs diversity Linda experiments Summary of results Incorporating Selected Cognitive Bias to Classification Algorithm Overview of cognitive bias-inspired learning algorithms Motivation QCBA: Quantitative Classification based on Associations Experiments Conclusions Introduction Problem ML Model Plausibility Additional Experiments Algo design Conclusions References Research background I Around 2010, I set out to investigate how can we transfer cognitive biases (originally monotonicity constraint) into a machine learning algorithm. I It turned out that the relation between cognitive and inductive biases is virtually unstudied. I The most direct area to explore was effect of cognitive biases on perception of results of existing machine learning algorithms I ! we added studying the effect of cognitive biases on comprehensibility of machine learning models among research objectives I Transfer of selected cognitive bias to a machine learning algorithm remained secondary objective. Introduction Problem ML Model Plausibility Additional Experiments Algo design Conclusions References Goals 1) Study semantic and pragmatic comprehension of machine learning models.
    [Show full text]
  • Phd Thesis Institute of Cognitive Science University of Osnabruck¨
    Data augmentation and image understanding PhD Thesis Institute of Cognitive Science University of Osnabruck¨ Alex Hern´andez-Garc´ıa Advisor: Prof. Peter Konig¨ July 2020 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License: http://creativecommons.org/licenses/by-nc-sa/4.0 Foreword There are certainly many people I am grateful to for their support and love. I hope I show them and they feel my gratitude and love often enough. Instead of naming them here, I will make sure to thank them personally. I am also grateful to my PhD adviser and collaborators. I acknowledge their specific contributions at the beginning of each chapter. I wish to use this space for putting down in words some thoughts that have been recurrently present throughout my PhD and are especially strong now. The main reason why I have been able to write a PhD thesis is luck. Specifically, I have been very lucky to be a privileged person. In my opinion, we often overestimate our own effort, skills and hard work as the main factors for professional success. I would not have got here without my privileged position, in the first place. There are some obvious comparisons. To name a few: During the first months of my PhD, Syrians had undergone several years of war1 and citizens of Aleppo were suffering one of the most cruel sieges ever2. Hundreds of thousands were killed and millions had to flee their land and seek refuge. While I was visiting Palestine after a PhD-related workshop in Jerusalem, 17 Palestinian people were killed by the Israeli army in Gaza during the Land Day3 protests.
    [Show full text]
  • Multitask Learning with Local Attention for Tibetan Speech Recognition
    Hindawi Complexity Volume 2020, Article ID 8894566, 10 pages https://doi.org/10.1155/2020/8894566 Research Article Multitask Learning with Local Attention for Tibetan Speech Recognition Hui Wang , Fei Gao , Yue Zhao , Li Yang , Jianjian Yue , and Huilin Ma School of Information Engineering, Minzu University of China, Beijing 100081, China Correspondence should be addressed to Fei Gao; [email protected] Received 22 September 2020; Revised 12 November 2020; Accepted 26 November 2020; Published 18 December 2020 Academic Editor: Ning Cai Copyright © 2020 Hui Wang et al. *is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In this paper, we propose to incorporate the local attention in WaveNet-CTC to improve the performance of Tibetan speech recognition in multitask learning. With an increase in task number, such as simultaneous Tibetan speech content recognition, dialect identification, and speaker recognition, the accuracy rate of a single WaveNet-CTC decreases on speech recognition. Inspired by the attention mechanism, we introduce the local attention to automatically tune the weights of feature frames in a window and pay different attention on context information for multitask learning. *e experimental results show that our method improves the accuracies of speech recognition for all Tibetan dialects in three-task learning, compared with the baseline model. Furthermore, our method significantly improves the accuracy for low-resource dialect by 5.11% against the specific-dialect model. 1. Introduction recognition, dialect identification, and speaker recognition in a single model.
    [Show full text]
  • Arxiv:1911.10500V2 [Cs.LG] 23 Dec 2019 Level
    CAUSALITY FOR MACHINE LEARNING Bernhard Schölkopf Max Planck Institute for Intelligent Systems, Max-Planck-Ring 4, 72076 Tübingen, Germany [email protected] ABSTRACT Graphical causal inference as pioneered by Judea Pearl arose from research on artificial intelligence (AI), and for a long time had little connection to the field of machine learning. This article discusses where links have been and should be established, introducing key concepts along the way. It argues that the hard open problems of machine learning and AI are intrinsically related to causality, and explains how the field is beginning to understand them. 1 Introduction The machine learning community’s interest in causality has significantly increased in recent years. My understanding of causality has been shaped by Judea Pearl and a number of collaborators and colleagues, and much of it went into a book written with Dominik Janzing and Jonas Peters (Peters et al., 2017). I have spoken about this topic on various occasions,1 and some of it is in the process of entering the machine learning mainstream, in particular the view that causal modeling can lead to more invariant or robust models. There is excitement about developments at the interface of causality and machine learning, and the present article tries to put my thoughts into writing and draw a bigger picture. I hope it may not only be useful by discussing the importance of causal thinking for AI, but it can also serve as an introduction to some relevant concepts of graphical or structural causal models for a machine learning audience. In spite of all recent successes, if we compare what machine learning can do to what animals accomplish, we observe that the former is rather bad at some crucial feats where animals excel.
    [Show full text]
  • Description of Structural Biases and Associated Data in Sensor-Rich Environments
    Description of Structural Biases and Associated Data in Sensor-Rich Environments Massinissa Hamidi and Aomar Osmani LIPN-UMR CNRS 7030, Univ. Sorbonne Paris Nord fhamidi,[email protected] Abstract In this article, we study activity recognition in the context of sensor-rich en- vironments. We address, in particular, the problem of inductive biases and their impact on the data collection process. To be effective and robust, activity recognition systems must take these biases into account at all levels and model them as hyperparameters by which they can be controlled. Whether it is a bias related to sensor measurement, transmission protocol, sensor deployment topology, heterogeneity, dynamicity, or stochastic effects, it is important to un- derstand their substantial impact on the quality of activity recognition models. This study highlights the need to separate the different types of biases arising in real situations so that machine learning models, e.g., adapt to the dynamicity of these environments, resist to sensor failures, and follow the evolution of the sensors topology. We propose a metamodeling process in which the sensor data is structured in layers. The lower layers encode the various biases linked to transformations, transmissions, and topology of data. The upper layers encode biases related to the data itself. This way, it becomes easier to model hyperpa- rameters and follow changes in the data acquisition infrastructure. We illustrate our approach on the SHL dataset which provides motion sensor data for a list of human activities collected under real conditions. The trade-offs exposed and the broader implications of our approach are discussed with alternative techniques to encode and incorporate knowledge into activity recognition models.
    [Show full text]
  • Inductive Bias
    pick up an exercise! ethics in NLP CS 585, Fall 2019 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University of Massachusetts Amherst many slides from Yulia Tsvetkov stuff from last class • milestones due today! • are we going to cover “natural language understanding”? • what are we talking about today? • many NLP systems affect actual people • systems that interact with people (conversational agents) • perform some reasoning over people (e.g., recommendation systems, targeted ads) • make decisions about people’s lives (e.g., parole decisions, employment, immigration) • questions of ethics arise in all of these applications! why are we talking about it? • the explosion of data, in particular user-generated data (e.g., social media) • machine learning models that leverage huge amounts of this data to solve certain tasks Language and People The common misconception is that language has to do with words and what they mean. It doesn’t. It has to do with people and what they mean. Dan Jurafsky’s keynote talks at CVPR’17 and EMNLP’17 Learn to Assess AI Systems Adversarially ● Who could benefit from such a technology? ● Who can be harmed by such a technology? ● Representativeness of training data ● Could sharing this data have major effect on people’s lives? ● What are confounding variables and corner cases to control for? ● Does the system optimize for the “right” objective? ● Could prediction errors have major effect on people’s lives? let’s start
    [Show full text]
  • Lecture 9: Generalization
    Lecture 9: Generalization Roger Grosse 1 Introduction When we train a machine learning model, we don't just want it to learn to model the training data. We want it to generalize to data it hasn't seen before. Fortunately, there's a very convenient way to measure an algorithm's generalization performance: we measure its performance on a held-out test set, consisting of examples it hasn't seen before. If an algorithm works well on the training set but fails to generalize, we say it is overfitting. Improving generalization (or preventing overfitting) in neural nets is still somewhat of a dark art, but this lecture will cover a few simple strategies that can often help a lot. 1.1 Learning Goals • Know the difference between a training set, validation set, and test set. • Be able to reason qualitatively about how training and test error de- pend on the size of the model, the number of training examples, and the number of training iterations. • Understand the motivation behind, and be able to use, several strate- gies to improve generalization: { reducing the capacity { early stopping { weight decay { ensembles { input transformations { stochastic regularization 2 Measuring generalization So far in this course, we've focused on training, or optimizing, neural net- works. We defined a cost function, the average loss over the training set: N 1 X L(y(x(i)); t(i)): (1) N i=1 But we don't just want the network to get the training examples right; we also want it to generalize to novel instances it hasn't seen before.
    [Show full text]
  • Issues in Using Function Approximation for Reinforcement Learning
    To appear in: Proceedings of the Fourth Connectionist Models Summer School Lawrence Erlbaum Publisher, Hillsdale, NJ, Dec. 1993 Issues in Using Function Approximation for Reinforcement Learning Sebastian Thrun Anton Schwartz Institut fÈur Informatik III Dept. of Computer Science UniversitÈat Bonn Stanford University RÈomerstr. 164, D±53225 Bonn, Germany Stanford, CA 94305 [email protected] [email protected] Reinforcement learning techniques address the problem of learning to select actions in unknown, dynamic environments. It is widely acknowledged that to be of use in complex domains, reinforcement learning techniques must be combined with generalizing function approximation methods such as arti®cial neural networks. Little, however, is understood about the theoretical properties of such combinations, and many researchers have encountered failures in practice. In this paper we identify a prime source of such failuresÐnamely, a systematic overestimation of utility values. Using Watkins' Q-Learning [18] as an example, we give a theoretical account of the phenomenon, deriving conditions under which one may expected it to cause learning to fail. Employing some of the most popular function approximators, we present experimental results which support the theoretical ®ndings. 1 Introduction Reinforcement learning methods [1, 16, 18] address the problem of learning, through experimentation, to choose actions so as to maximize one's productivity in unknown, dynamic environments. Unlike most learning algorithms that have been studied in the ®eld of Machine Learning, reinforcement learning techniques allow for ®nding optimal action sequences in temporal decision tasks where the external evaluation is sparse, and neither the effects of actions, nor the temporal delay between actions and its effects on the learner's performance is known to the learner beforehand.
    [Show full text]