Weak Supervision from High-Level Abstractions A

WEAK SUPERVISION FROM HIGH-LEVEL ABSTRACTIONS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Braden Jay Hancock August 2019 © 2019 by Braden Jay Hancock. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/ns523jd4552 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Chris Re, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Dan Jurafsky I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Percy Liang Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract The interfaces for interacting with machine learning models are changing. Consider, for example, that while computers run on 1s and 0s, that is no longer the level of abstraction we use to program most computers. Instead, we use higher-level abstractions such as assembly language, high-level languages, or declarative languages to more efficiently convert our objectives into code. Similarly, most machine learning models are trained with "1s and 0s" (individually labeled examples), but we need not limit ourselves to interacting with them at this low level. Instead, we can use higher-level abstractions to more efficiently convert our domain knowledge into the inputs our models require. In this work, we show that weak supervision from high-level abstractions can be used to train high-performance machine learning models. At three different levels of abstraction, we describe the system we built to enable such interaction. We begin with Snorkel, which elevates label generation from a manual process to a programmatic one. With this system, domain experts encode their knowledge in potentially noisy and correlated black-box functions called labeling functions. These functions can then be automatically denoised and applied to unlabeled data to create large training sets quickly. Next, with Fonduer we enable an abstraction one step higher where advanced primitives defined over multiple modalities (visual, textual, structural, and tabular) allow users to programmatically supervise over richly formatted data (e.g., PDFs with tables and formatting). Finally, in BabbleLabble we show that we can even utilize supervision given in the form of natural language explanations, maintaining the benefits of programmatic supervision while removing the burden of writing code. For all of these systems, we demonstrate their effectiveness with empirical results and present real-world use cases where they have enabled rapid development of machine learning applications, including in bio-medicine, commerce, and defense. iv Acknowledgments My Ph.D. experience was rich and rewarding, and for that I owe a great debt to a great many people. I am deeply grateful to my advisor, Chris Ré. In an advisor I hoped for a master in identifying problems where progress would translate into real-world impact. I absolutely found this in Chris, but much more. From him I internalized valuable lessons such as: focus on process, not products; a paper is a receipt for good work, not the good work itself; and the reward for hard work is always more hard work. No one works harder than Chris, and it was an honor to be a part of his lab during my tenure at Stanford. I am grateful to Percy Liang and Dan Jurafsky, whose classes and insights during my first year at Stanford gave me a love for natural language processing and appreciation for education at the highest level. I owe a great deal to my fellow students and labmates in the Hazy Research group. It takes a village to raise a researcher, and innumerable paper swaps, whiteboard discussions, and late night hackathons together have made me the researcher I am today. In particular, I would like to acknowledge my fellow Ph.D. students Alex Ratner and Paroma Varma, who were with me through it all. I have had outstanding mentors at every stage of my education who took chances on me when they certainly did not have to: John Clark at AFRL before I knew a thing about programming, Christopher Mattson at BYU before I knew a thing about research, Mark Dredze at Johns Hopkins before I knew a thing about NLP, Vijay Gadepally at MIT Lincoln Laboratory before I knew a thing about machine learning, Hongrae Lee at Google before I knew a thing about production environments, and Antoine Bordes and Jason Weston at Facebook before I knew a thing about dialogue. My research would not have been possible without the financial support of my funders: the NSF Graduate Research Fellowship and Stanford Finch Family Fellowship especially, but also the DOE, NIH, ONR, DARPA, member companies of Stanford DAWN, and many other organizations supporting the Hazy Research group. Finally and most of all, I am grateful for my wife Lauren and daughters Annie and Pippa. When we moved to Stanford with a 10-day-old baby, I had hope but no assurance that I would be able to keep up with the rigorous demands of a top-tier doctoral program while simultaneously learning how to raise a family. I could not have anticipated then how many times more beautiful and rich my experience would be because it was shared with them. v Contents Abstract iv Acknowledgments v 1 Introduction 1 2 Weak Supervision from Code 4 2.1 Introduction . .4 2.2 Snorkel Architecture . .8 2.2.1 A Language for Weak Supervision . 10 2.2.2 Generative Model . 14 2.2.3 Discriminative Model . 15 2.3 Weak Supervision Tradeoffs . 15 2.3.1 Modeling Accuracies . 16 2.3.2 Modeling Structure . 20 2.4 Evaluation . 23 2.4.1 Applications . 24 2.4.2 User Study . 31 2.5 Extensions & Next Steps . 34 2.5.1 Extensions for Real-World Deployments . 34 2.5.2 Multi-Task Weak Supervision . 34 2.5.3 Future Directions . 35 2.6 Related Work . 35 2.7 Conclusion . 36 3 Weak Supervision from Primitives 38 3.1 Introduction . 38 3.2 Background . 42 3.2.1 Knowledge Base Construction . 42 vi 3.2.2 Recurrent Neural Networks . 43 3.3 The Fonduer Framework . 45 3.3.1 Fonduer’s Data Model . 45 3.3.2 User Inputs and Fonduer’s Pipeline . 46 3.3.3 Fonduer’s Programming Model for KBC . 49 3.4 KBC in Fonduer ........................................ 50 3.4.1 Candidate Generation . 50 3.4.2 Multimodal LSTM Model . 51 3.4.3 Multimodal Supervision . 54 3.5 Experiments . 54 3.5.1 Experimental Settings . 54 3.5.2 Experimental Results . 56 3.5.3 Ablation Studies . 58 3.6 User Study . 61 3.7 Extensions . 63 3.8 Related Work . 64 3.9 Conclusion . 64 4 Weak Supervision from Natural Language 65 4.1 Introduction . 65 4.2 The BabbleLabble Framework . 67 4.2.1 Explanations . 68 4.2.2 Semantic Parser . 68 4.2.3 Filter Bank . 69 4.2.4 Label Aggregator . 70 4.2.5 Discriminative Model . 71 4.3 Experimental Setup . 71 4.3.1 Datasets . 72 4.3.2 Experimental Settings . 73 4.4 Experimental Results . 73 4.4.1 High Bandwidth Supervision . 73 4.4.2 Utility of Incorrect Parses . 74 4.4.3 Using LFs as Functions or Features . 75 4.5 Related Work and Discussion . 76 4.6 Extensions . 77 vii 5 Discussion & Conclusion 78 5.1 Advantages of Programmatic Supervision . 78 5.2 Limitations . 79 5.3 The Supervision Stack . 80 5.4 Conclusion . 81 A Snorkel Appendix 82 A.1 Additional Material for Sec. 3.1 . 82 A.1.1 Minor Notes . 82 A.1.2 Proof of Proposition 1 . 82 A.1.3 Proof of Proposition 2 . 84 A.1.4 Proof of Proposition 3 . 85 B Fonduer Appendix 88 B.1 Data Programming . 88 B.1.1 Components of Data Programming . 88 B.1.2 Theoretical Guarantees . 89 B.2 Extended Feature Library . ..

Load more