Incorporating Domain Knowledge in Latent Topic Models
Total Page:16
File Type:pdf, Size:1020Kb
INCORPORATING DOMAIN KNOWLEDGE IN LATENT TOPIC MODELS by David Michael Andrzejewski A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Sciences) at the UNIVERSITY OF WISCONSIN–MADISON 2010 c Copyright by David Michael Andrzejewski 2010 All Rights Reserved i For my parents and my Cho. ii ACKNOWLEDGMENTS Obviously none of this would have been possible without the diligent advising of Mark Craven and Xiaojin (Jerry) Zhu. Taking a bioinformatics class from Mark as an undergraduate initially got me excited about the power of statistical machine learning to extract insights from seemingly impenetrable datasets. Jerry’s enthusiasm for research and relentless pursuit of excellence were a great inspiration for me. On countless occasions, Mark and Jerry have selflessly donated their time and effort to help me develop better research skills and to improve the quality of my work. I would have been lucky to have even a single advisor as excellent as either Mark or Jerry; I have been extremely fortunate to have them both as co-advisors. My committee members have also been indispensable. Jude Shavlik has always brought an emphasis on clear communication and solid experimental technique which will hopefully stay with me for the rest of my career. Michael Newton helped me understand the modeling issues in this research from a statistical perspective. Working with prelim committee member Ben Liblit gave me the exciting opportunity to apply machine learning to a very challenging problem in the debugging work presented in Chapter 4. I also learned a lot about how other computer scientists think from meetings with Ben. Optimization guru Ben Recht provided great ideas and insights about issues of scalability in practical applications of machine learning. The Computation and Informatics in Biology and Medicine (CIBM) program at the University of Wisconsin–Madison provided invaluable support, both financial and otherwise, for the majority of my graduate career. CIBM gave me the opportunity to get involved in the exchange of ideas at the crossroads of computational and biological sciences in a variety of venues. The machine learning community at Wisconsin was a constant source of advice, expertise, and camaraderie. In particular it was great to interact with everyone who participated in the AI and iii other various reading groups. My graduate school experience was also greatly enriched by interac- tions and collaborations with past and present members of the Zhu and Craven research groups. In alphabetical orders, members of the Craven lab during my studies were Debbie Chasman, Deborah Muganda, Keith Noto, Irene Ong, Yue Pan, Soumya Ray, Burr Settles, Adam Smith, and Andreas Vlachos, while members of the Zhu group included Bryan Gibson, Nate Fillmore, Andrew Gold- berg, Lijie Heng, Kwang-Sung Jun, Ming Li, Junming Sui, Jurgen Van Gael, and Zhiting Xu. Biological collaborators Brandi Gancarz (Ahlquist lab) and Ron Stewart (Thomson lab) helped keep this research focused on real problems. The biological annotations in Chapter 5 were pro- vided by Brandi, and Ron is the biological expert collaborator for the text mining application in Chapter 7. The LogicLDA work presented in Chapter 7 benefited from helpful discussions with Markov Logic experts Daniel Lowd, Sriraam Natarajan, and Sebastian Riedel. My summer internship working with Alice Zheng at Microsoft Research was a great experience where I learned a lot about how research gets applied to real-world problems. Working with David Stork on the computer analysis of Mondrian paintings was very interest- ing, and definitely taught me to broaden my horizons with respect to what constitutes a machine learning problem. My parents and family have always been supportive of my plan to stay in school (nearly) forever, and for that I am incredibly grateful. Finally, Cho has been with me through the entire process: from graduate school applications right up through the final defense. Her patience and love have meant everything to me and made this all possible. DISCARD THIS PAGE iv TABLE OF CONTENTS Page LIST OF TABLES ....................................... vii LIST OF FIGURES ......................................x ABSTRACT .......................................... xiii 1 Introduction ........................................5 1.1 A motivating example . .5 1.2 Latent topic modeling . .6 1.3 Challenges in topic modeling . 10 1.4 Adding domain knowledge . 11 1.5 Overview . 12 2 Background ........................................ 15 2.1 Overview . 15 2.2 Latent Dirichlet Allocation (LDA) . 15 2.2.1 Model definition . 15 2.2.2 Inference . 17 2.2.3 The number of topics . 22 3 Related work ........................................ 25 3.1 LDA variants . 25 3.1.1 LDA+X . 26 3.1.2 φ-side . 27 3.1.3 θ-side . 29 3.1.4 Summary and relation to this thesis . 32 3.2 Augmented clustering . 33 3.2.1 Constrained clustering . 33 3.2.2 Interactive clustering . 34 v Page 4 DeltaLDA ......................................... 36 4.1 Cooperative Bug Isolation (CBI) . 36 4.2 DeltaLDA . 38 4.3 Experiments . 40 4.3.1 Clustering runs by cause of failure . 42 4.3.2 Identifying root causes of failure . 44 4.4 Discussion . 45 4.5 Summary . 46 5 Topic-in-set knowledge .................................. 48 5.1 Collapsed Gibbs sampling with z-labels . 48 5.2 Experiments . 50 5.2.1 Concept expansion . 50 5.2.2 Concept exploration . 52 5.3 Principled derivation . 53 5.4 Summary . 53 6 Dirichlet Forest priors .................................. 57 6.1 Encoding Must-Link . 58 6.2 Encoding Cannot-Link . 61 6.3 The Dirichlet Forest prior . 65 6.4 Inference and estimation . 67 6.4.1 Sampling z .................................. 67 6.4.2 Sampling q .................................. 68 6.4.3 Estimating φ and θ .............................. 68 6.5 Experiments . 69 6.5.1 Synthetic data . 69 6.5.2 Wish corpus . 77 6.5.3 Yeast corpus . 82 6.6 Summary . 85 7 LogicLDA ......................................... 86 7.1 The LogicLDA model . 86 7.1.1 Undirected LDA . 87 7.1.2 Logic background . 89 7.1.3 LogicLDA model . 91 vi Page 7.2 Inference . 93 7.2.1 Collapsed Gibbs Sampling (CGS) . 94 7.2.2 MaxWalkSAT (MWS) . 95 7.2.3 Alternating Optimization with MWS+LDA (M+L) . 95 7.2.4 Alternating Optimization with Mirror Descent (Mir) . 99 7.3 Experiments . 102 7.3.1 Synthetic Must-Links and Cannot-Links (S1, S2) . 102 7.3.2 Mac vs PC (Mac) . 103 7.3.3 Comp.* newsgroups (Comp) . 104 7.3.4 Congress (Con) . 106 7.3.5 Polarity (Pol) . 107 7.3.6 Human development genes (HDG) . 107 7.3.7 Evaluation of inference . 118 7.4 Encoding LDA variants . 118 7.4.1 Concept-Topic Model . 120 7.4.2 Hidden Topic Markov Model . 120 7.4.3 Restricted topic models . 120 7.5 Summary . 120 8 Conclusion ......................................... 122 8.1 Summary . 122 8.2 Software . 125 8.3 Conclusion and future directions . 126 LIST OF REFERENCES ................................... 128 APPENDICES Appendix A: Collapsed Gibbs sampling derivation for ∆LDA . 138 Appendix B: Collapsed Gibbs sampling for Dirichlet Forest LDA . 142 Appendix C: Collapsed Gibbs sampling derivation for LogicLDA . 147 DISCARD THIS PAGE vii LIST OF TABLES Table Page 0.1 Symbols used in this thesis (part one). .1 0.2 Symbols used in this thesis (part two). .2 0.3 Symbols used in this thesis (part three). .3 0.4 Symbols used in this thesis (part four). .4 1.1 Example wishes. .7 1.2 Corpus-wide word frequencies and learned topic word probabilities. .8 1.3 Example learned topics which may not be meaningful or useful to the user. 11 2.1 Example generated document d1, with N1 = 6 ..................... 17 3.1 Overview of LDA variant families discussed in this chapter. 32 4.1 Schema mapping between text and statistical debugging. 37 4.2 Statistical debugging datasets. For each program, the number of bug topics is set to the ground truth number of bugs. 42 4.3 Rand indices showing similarity between computed clusterings of failing runs and true partitionings by cause of failure (complete agreement is 1, complete disagreement is 0.0). 44 5.1 Standard LDA and z-label topics learned from a corpus of PubMed abstracts, where the goal is to learn topics related to translation. Concept seed words are bolded, other words judged relevant to the target concept are italicized. 51 viii Table Page 5.2 z-label topics learned from an entity-tagged corpus of.