Neural Methods Towards Concept Discovery from Text Via Knowledge Transfer
Total Page:16
File Type:pdf, Size:1020Kb
Neural Methods Towards Concept Discovery from Text via Knowledge Transfer DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Manirupa Das, B.E., M.S. Graduate Program in Computer Science and Engineering The Ohio State University 2019 Dissertation Committee: Prof Rajiv Ramnath, Advisor Prof Eric Fosler-Lussier, Advisor Prof Huan Sun c Copyright by Manirupa Das 2019 ABSTRACT Novel contexts, consisting of a set of terms referring to one or more concepts, often arise in real-world querying scenarios such as; a complex search query into a document retrieval system or a nuanced subjective natural language question. The concepts in these queries may not directly refer to entities or canonical concept forms occurring in any fact-based or rule-based knowledge source such as a knowledge base or on- tology. Thus, in addressing the complex information needs expressed by such novel contexts, systems using only such sources can fall short. Moreover, hidden associa- tions meaningful in the current context, may not exist in a single document, but in a collection, between matching candidate concepts having different surface realizations, via alternate lexical forms. These may refer to underlying latent concepts, i.e., exist- ing or conceived concepts or semantic classes that are accessible only via their surface forms. Inferring these latent concept associations in an implicit manner, by transfer- ring knowledge from the same domain { within a collection, or from across domains (different collections), can potentially better address such novel contexts. Thus latent concept associations may act as a proxy for a novel context. This research hypothesizes that leveraging hidden associations between latent concepts may help to address novel contexts in a downstream recommendation task, and that knowledge transfer methods may aid and augment this process. With novel contexts and latent concept associations as the foundation, I define the process of concept discovery from text by two steps: ii first, \matching" the novel context to an appropriate hidden relation between latent concepts, and second, \retrieving" the surface forms of the matched related concept as the discovered terms or concept. Our prior study provides insight into how the transfer of knowledge within and across domains can help to learn associations between concepts, informing downstream prediction and recommendation tasks. In developing prediction models to explain factors affecting newspaper subscriber attrition or \churn", a set of \global" coarse- grained concepts or topics were learned on a corpus of web documents from a News domain , and later \inferred" on a parallel corpus of user search query logs belonging to a Clicklog domain. This process was then repeated in reverse and the topics learned on both domains were used in turn, as features into models predicting customer churn. The results in terms of the most predictive topics from the twin prediction tasks then allow us to reason about and understand how related factors across domains provide complementary signals to explain customer engagement. This dissertation focuses on three main research contributions to improve seman- tic matching for downstream recommendation tasks via knowledge transfer. First, I employ a phrasal embedding-based generalized language model (GLM), to rank the other documents in a collection against each \query document", as a pseudo rele- vance feedback (PRF){based scheme for generating semantic tag recommendations. This effectively leads to knowledge transfer \within" a domain by way of inferring re- lated terms or fine-grained concepts for semantic tagging of documents, from existing documents in a collection. These semantic tags when used downstream in query ex- pansion for information retrieval, both in direct and pseudo-relevance feedback query settings, give statistically significant improvement over baseline models that use word iii embedding-based or human expert-based query expansion terms. Next, inspired by the recent success of sequence-to-sequence neural models in delivering the state-of-the-art in a wide range of NLP tasks, I broaden the scope of the phrasal embedding-based generalized language model, to develop a novel end-end sequence-to-set framework (Seq2Set) with neural attention, for learning document representations for semanti- cally tagging a large collection of documents with no previous labels, in an inductive transfer learning setting via self-taught learning. Seq2Set extends the use case of the previous GLM framework from an unsupervised PRF{based query expansion task set- ting, to supervised and semi-supervised task settings for automated text categorization via multi-label prediction. Using the Seq2Set framework we obtain statistically signifi- cant improvement over both; the previous phrasal GLM framework for the unsupervised query expansion task and also over the current state-of-the-art for the automated text categorization task both in the supervised as well as the semi-supervised settings. The final contribution is to learn to answer complex, subjective, specific queries, given a source domain of \answered questions" about products that are labeled, by using a target domain of rich opinion data, i.e. \product reviews" that are unla- beled, by the novel application of neural domain adaptation in a transductive transfer learning setting. We learn to classify both labeled answers to questions as well as unla- beled review sentences via shared feature learning for appropriate knowledge transfer across the two domains, outperforming state-of-the-art baseline systems for sentence pair modeling tasks. Thus, given training labels on answer data, and by leveraging po- tential hidden associations between concepts in review and answer data, and reviews and query text, we are able to infer suitable answers from review text. We employ strategies such as maximum likelihood estimation-based neural generalized language iv modeling, sequence-to-set multi-label prediction with self-attention, and neural domain adaptation in these works. Combining these strategies with distributional semantics- based representations of surface forms of concepts, within neural frameworks that can facilitate knowledge transfer within and across domains, we demonstrate improved se- mantic matching, in downstream recommendation tasks, e.g. in finding related terms to address novel contexts in complex user queries, in a step towards really “finding what is meant" via concept discovery from text. v For Ryka, Rakesh, Ma and Baba vi ACKNOWLEDGMENTS First and foremost I am deeply grateful to have been mentored and guided by my advisors, Professors Rajiv Ramnath and Eric Fosler-Lussier. Thanks to Rajiv, for giving me the opportunity to pursue my studies with a great degree of independence and autonomy. Thank you for believing in my research, for believing in me, and for supporting me in every aspect along the way. Your marvelous ability to abstract out the essence of a problem to pinpoint exactly where I needed to focus, as well as to help break down a seemingly complex problem into manageable pieces so that a solution became feasible, have been invaluable to me. Be it while developing and writing any work, especially my candidacy proposal or journal articles, but most of all my dissertation, or be it with managing collaborations related to my research such as with The Columbus Dispatch or Nationwide Childrens Hospital, your patient listening, insightful advice and guidance, and your supportive and helpful feedback have been truly valuable, and made the biggest difference. Any words to describe what your support and your insights have meant, to my entire effort and to crystallizing the work, would fall short. Most of all, thanks for being the Jupiter to my Earth orbit, deflecting off stray objects that could derail that orbit and keeping me on track. For all of this and more I will be eternally grateful to you. I have also been most fortunate to have been advised by Prof. Eric Fosler-Lussier. You are not just my favorite classroom teacher of all things AI, you somehow manage to balance this with great intuition in your advising for very vii diverse areas of research in speech and language processing. I will take away great memories of all the intellectual as well as informal discussions about possible machine learning models and their enhancements, both from lab meetings and our one-on-ones, where I got to learn so much from you. Your insistence on the highest standards for any work has enabled me to push my own boundaries in developing and performing the kind of advanced research that even a year ago I didn't think I was capable of. While advising me in developing any research plans you ensured I critically evaluated every method and assumption, produced the highest quality of work and always pushed me to challenge my own perceived limitations as a researcher. Your keen attention to every detail pertaining to any work, right from concept and formulation to results and presentation, has instilled in me I hope the highest standard of work ethic, which I hope to carry forward into the world. For this, and every other lesson, I will be forever indebted to you. A Phd is a momentous journey of intellectual self-realization, of venturing into lands unknown, asking difficult questions, developing along the way the skills necessary for problem solving, learning courage in the face of uncertainty, course correcting, maybe finding