Abstract Rich and Scalable Models for Text

Abstract Rich and Scalable Models for Text

ABSTRACT Title of dissertation: RICH AND SCALABLE MODELS FOR TEXT Thang Dai Nguyen, Doctor of Philosophy, 2019 Dissertation directed by: Professor Jordan Boyd-Graber Department of Computer Science and Institute for Advanced Computer Studies Professor Philip Resnik Department of Linguistics and Institute for Advanced Computer Studies Topic models have become essential tools for uncovering hidden structures in big data. However, the most popular topic model algorithm—Latent Dirichlet Allocation (LDA)— and its extensions suffer from sluggish performance on big datasets. Recently, the machine learning community has attacked this problem using spectral learning approaches such as the moment method with tensor decomposition or matrix factorization. The anchor word algorithm by Arora et al.[2013] has emerged as a more efficient approach to solve a large class of topic modeling problems. The anchor word algorithm is high-speed, and it has a provable theoretical guarantee: it will converge to a global solution given enough number of documents. In this thesis, we present a series of spectral models based on the anchor word algorithm to serve a broader class of datasets and to provide more abundant and more flexible modeling capacity. First, we improve the anchor word algorithm by incorporating various rich priors in the form of appropriate regularization terms. Our new regularized anchor word algorithms produce higher topic quality and provide flexibility to incorporate informed priors, creating the ability to discover topics more suited for external knowledge. Second, we enrich the anchor word algorithm with metadata-based word repre- sentation for labeled datasets. Our new supervised anchor word algorithm runs very fast and predicts better than supervised topic models such as Supervised LDA on three sentiment datasets. Also, sentiment anchor words, which play a vital role in generating sentiment topics, provide cues to understand sentiment datasets better than unsupervised topic models. Lastly, we examine ALTO, an active learning framework with a static topic overview, and investigate the usability of supervised topic models for active learning. We develop a new, dynamic, active learning framework that combines the concept of informativeness and representativeness of documents using dynamically updating topics from our fast supervised anchor word algorithm. Experiments using three multi-class datasets show that our new framework consistently improves classification accuracy over ALTO. RICH AND SCALABLE MODELS FOR TEXT by Thang Dai Nguyen Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2019 Advisory Committee: Professor Jordan Boyd-Graber, Chair/Co-Advisor Professor Philip Resnik, Co-Advisor Professor Douglas W. Oard Professor Naomi Feldman Professor Furong Huang © Copyright by Thang Dai Nguyen 2019 Acknowledgments I owe my deepest gratitude to many wonderful people, whose help and support have made this dissertation possible. Spending seven years at the University of Maryland part-time while working full time is a very challenging experience, but it also shows that I have been fortunate to reach this milestone in my life. This would not be possible without the support and guidance from the following people. First and foremost, I wholeheartedly thank my co-advisors, Jordan Boyd-Graber and Philip Resnik, for their continuous guidance and advice throughout my graduate study. They have been a fantastic team that provides so much support for me as I needed; their beautiful lectures have built my foundation for research; their wisdom helped me cross the boundary to graduation. In particular, I would like to thank Jordan for accepting me as a graduate student at the ISchool; without that first step, I could not have been transferred to become a graduate student at the Department of Computer Science at the University of Maryland. Jordan has been patiently laying the bricks for my research, guiding me from the first ACL conference talk to the last research equation. I am deeply grateful to Philip for guiding me through hardship; every time I was stuck with research, either emotionally or technically, Philip was the first one I came for advice. I also miss our lunches at the Lilit Cafe a lot. I am fortunate to know many amazing professors whose research have been inspira- tional for my dissertation. I would like to thank Professor Hal Daume´ III, Professor Jimmy Lin, Professor Thomas Goldstein, Professor Kevin Seppi, and Professor Eric Ringger for intellectual conversations. My special thank to Professor Neomi Feldman, Professor Dou- ii glass W.Oard, and Professor Furong Huang for agreeing to serve on my thesis committee and spending their precious time reading my dissertation. Many thanks also go to wonderful friends who have coauthored and collaborated with me on various papers. I thank Yuening Hu, Jeff Lund, William Armstrong, Leonardo Claudino for working with me on experiments and on writing and for continually providing feedbacks and insights. I am fortunate to be part of a fantastic community of researchers: the Computational Linguistics and Information Processing Lab (CLIP). Various ideas and inspiration come from discussions with Mohit Iyyer, Alvin Grissom II, Viet-An Nguyen, Mossaab Bagdouri, He He, Ke Zhai, Khanh Nguyen, Ning Gao, Jyothi Vinjumur, Jinfeng Rao, Sudha Rao, Weiwei Yang, and Snigdha Chaturvedi. I also want to especially thank many UMIACS staffs for their help and the Computer Science professors for their fascinating lectures. As a part-time student, my Ph.D. is not possible without constant support from my managers at work. I especially thank Dr. Olivier Bodenreider and Lee Peters at the National Library of Medicine (NLM) for their continuous trust and encouragement so that I could take time off work to go to classes while still progressed and performed well at work. Finally, I want to thank my family for their love and support. I thank my parents for their unconditional love; they always believe in me and encourage me to pursue my dreams. I thank my sister for her advice and encouragement. And most importantly, I thank my beloved wife Hang and our beautiful sons, Thomas and Ryan. My wife has always been wonderful, being there whenever I am high or low in this adventure; my sons have been a big push for me to move forward. I know, together, we could accomplish anything in life. iii Table of Contents Acknowledgements ii Table of Contents iv List of Tables vii List of Figures viii 1 Introduction1 1.1 Statistical Machine Learning in The Age of Big Data...........1 1.2 Natural Language Processing........................7 1.3 Challenges of Applying Machine Learning to Natural Language..... 11 1.3.1 Scalability Challenge........................ 12 1.3.2 Variability Challenge........................ 12 1.3.3 Interactivity Challenge....................... 14 1.4 Challenges of Applying Topic Models................... 15 1.5 Contributions to the Anchor Word: Addressing Scalability, Variability, and Interactivity................................. 18 1.5.1 Incorporating Priors into Scalable Anchor Topic Models for Ro- bustness and Extensibility..................... 20 1.5.2 Uncovering Insights from Labeled Documents with Supervised Anchor Topic Models........................ 21 1.5.3 The Usability of Supervised Topic Models for Active Learning.. 24 1.6 Main Technical Contributions....................... 25 2 Topic Modeling Foundations 28 2.1 What is Topic Modeling?.......................... 31 2.1.1 Topic Definition.......................... 31 2.1.2 Latent Dirichlet Allocation..................... 33 2.2 Inference Methods for Topic Models.................... 40 2.2.1 Gibbs Sampling........................... 40 2.2.2 Variational Inference........................ 47 2.3 Topic Model Evaluation........................... 53 2.3.1 Document Held-out Likelihood.................. 54 2.3.2 Topic Interpretability Metric.................... 56 2.3.3 Task-Based and Other Evaluation Metrics............. 59 2.4 LDA Extensions............................... 60 2.4.1 Scaling Up Topic Models...................... 60 2.4.2 Adding Supervision........................ 62 2.4.3 Incorporating Domain Knowledge into Topic Models....... 66 2.5 Spectral Methods for Topic Models..................... 69 2.5.1 Non-negative Matrix Factorization................. 71 2.5.2 Other Spectral Methods...................... 76 2.6 Unsupervised Anchor Word Topic Models................. 77 iv 2.6.1 Anchor Method is faster than Gibbs & VI............. 83 2.6.2 Contributions............................ 85 3 Regularized Anchor Word Topic Models 87 3.1 Introduction................................. 87 3.1.1 A Brief Overview of Regularization................ 89 3.1.2 The Importance of Priors in Topic Modeling........... 94 3.1.3 Chapter Structure.......................... 97 3.2 L2 Anchor: Improving Robustness and Variability............. 97 3.2.1 Objective Function with L2 Regularization............ 98 3.3 Beta Anchor: Improving Anchor Topic Quality.............. 100 3.3.1 Objective Function with Beta Regularization........... 101 3.3.2 Optimizing the Beta Objective Function.............. 103 3.4 Data Used in Experimentation....................... 104 3.5 Regularization Improving Topic Models.................. 105 3.5.1 Grid Search for Parameters on Development Set......... 107 3.5.2 Model Heldout Likelihood..................... 108 3.5.3 Topic

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    245 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us