<<

Neural Methods Towards Concept Discovery from Text via Knowledge Transfer

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By Manirupa Das, B.E., M.S. Graduate Program in Computer Science and Engineering

The Ohio State University 2019

Dissertation Committee:

Prof Rajiv Ramnath, Advisor Prof Eric Fosler-Lussier, Advisor Prof Huan Sun c Copyright by

Manirupa Das

2019 ABSTRACT

Novel contexts, consisting of a set of terms referring to one or more concepts, often arise in real-world querying scenarios such as; a complex search query into a document retrieval system or a nuanced subjective natural question. The concepts in these queries may not directly refer to entities or canonical concept forms occurring in any fact-based or rule-based knowledge source such as a knowledge base or on- tology. Thus, in addressing the complex information needs expressed by such novel contexts, systems using only such sources can fall short. Moreover, hidden associa- tions meaningful in the current context, may not exist in a single document, but in a collection, between matching candidate concepts having different surface realizations, via alternate lexical forms. These may refer to underlying latent concepts, i.e., exist- ing or conceived concepts or semantic classes that are accessible only via their surface forms. Inferring these latent concept associations in an implicit manner, by transfer- ring knowledge from the same domain – within a collection, or from across domains

(different collections), can potentially better address such novel contexts. Thus latent concept associations may act as a proxy for a novel context. This research hypothesizes that leveraging hidden associations between latent concepts may help to address novel contexts in a downstream recommendation task, and that knowledge transfer methods may aid and augment this process. With novel contexts and latent concept associations as the foundation, I define the process of concept discovery from text by two steps:

ii first, “matching” the novel context to an appropriate hidden relation between latent concepts, and second, “retrieving” the surface forms of the matched related concept as the discovered terms or concept.

Our prior study provides insight into how the transfer of knowledge within and across domains can help to learn associations between concepts, informing downstream prediction and recommendation tasks. In developing prediction models to explain factors affecting newspaper subscriber attrition or “churn”, a set of “global” coarse- grained concepts or topics were learned on a corpus of web documents from a News domain , and later “inferred” on a parallel corpus of user search query logs belonging to a Clicklog domain. This process was then repeated in reverse and the topics learned on both domains were used in turn, as features into models predicting customer churn.

The results in terms of the most predictive topics from the twin prediction tasks then allow us to reason about and understand how related factors across domains provide complementary signals to explain customer engagement.

This dissertation focuses on three main research contributions to improve seman- tic matching for downstream recommendation tasks via knowledge transfer. First, I employ a phrasal embedding-based generalized language model (GLM), to rank the other documents in a collection against each “query document”, as a pseudo rele- vance feedback (PRF)–based scheme for generating semantic tag recommendations.

This effectively leads to knowledge transfer “within” a domain by way of inferring re- lated terms or fine-grained concepts for semantic tagging of documents, from existing documents in a collection. These semantic tags when used downstream in query ex- pansion for , both in direct and pseudo-relevance feedback query settings, give statistically significant improvement over baseline models that use

iii embedding-based or human expert-based terms. Next, inspired by the recent success of sequence-to-sequence neural models in delivering the state-of-the-art in a wide range of NLP tasks, I broaden the scope of the phrasal embedding-based generalized language model, to develop a novel end-end sequence-to-set framework

(Seq2Set) with neural attention, for learning document representations for semanti- cally tagging a large collection of documents with no previous labels, in an inductive transfer learning setting via self-taught learning. Seq2Set extends the use case of the previous GLM framework from an unsupervised PRF–based query expansion task set- ting, to supervised and semi-supervised task settings for automated text categorization via multi-label prediction. Using the Seq2Set framework we obtain statistically signifi- cant improvement over both; the previous phrasal GLM framework for the unsupervised query expansion task and also over the current state-of-the-art for the automated text categorization task both in the supervised as well as the semi-supervised settings.

The final contribution is to learn to answer complex, subjective, specific queries, given a source domain of “answered questions” about products that are labeled, by using a target domain of rich opinion data, i.e. “product reviews” that are unla- beled, by the novel application of neural domain adaptation in a transductive transfer learning setting. We learn to classify both labeled answers to questions as well as unla- beled review sentences via shared feature learning for appropriate knowledge transfer across the two domains, outperforming state-of-the-art baseline systems for sentence pair modeling tasks. Thus, given training labels on answer data, and by leveraging po- tential hidden associations between concepts in review and answer data, and reviews and query text, we are able to infer suitable answers from review text. We employ strategies such as maximum likelihood estimation-based neural generalized language

iv modeling, sequence-to-set multi-label prediction with self-attention, and neural domain adaptation in these works. Combining these strategies with distributional - based representations of surface forms of concepts, within neural frameworks that can facilitate knowledge transfer within and across domains, we demonstrate improved se- mantic matching, in downstream recommendation tasks, e.g. in finding related terms to address novel contexts in complex user queries, in a step towards really “finding what is meant” via concept discovery from text.

v For Ryka, Rakesh, Ma and Baba

vi ACKNOWLEDGMENTS

First and foremost I am deeply grateful to have been mentored and guided by my advisors, Professors Rajiv Ramnath and Eric Fosler-Lussier. Thanks to Rajiv, for giving me the opportunity to pursue my studies with a great degree of independence and autonomy. Thank you for believing in my research, for believing in me, and for supporting me in every aspect along the way. Your marvelous ability to abstract out the essence of a problem to pinpoint exactly where I needed to focus, as well as to help break down a seemingly complex problem into manageable pieces so that a solution became feasible, have been invaluable to me. Be it while developing and writing any work, especially my candidacy proposal or journal articles, but most of all my dissertation, or be it with managing collaborations related to my research such as with The Columbus

Dispatch or Nationwide Childrens Hospital, your patient listening, insightful advice and guidance, and your supportive and helpful feedback have been truly valuable, and made the biggest difference. Any to describe what your support and your insights have meant, to my entire effort and to crystallizing the work, would fall short.

Most of all, thanks for being the Jupiter to my Earth orbit, deflecting off stray objects that could derail that orbit and keeping me on track. For all of this and more I will be eternally grateful to you. I have also been most fortunate to have been advised by

Prof. Eric Fosler-Lussier. You are not just my favorite classroom teacher of all things

AI, you somehow manage to balance this with great intuition in your advising for very

vii diverse areas of research in speech and language processing. I will take away great memories of all the intellectual as well as informal discussions about possible machine learning models and their enhancements, both from lab meetings and our one-on-ones, where I got to learn so much from you. Your insistence on the highest standards for any work has enabled me to push my own boundaries in developing and performing the kind of advanced research that even a year ago I didn’t think I was capable of.

While advising me in developing any research plans you ensured I critically evaluated every method and assumption, produced the highest quality of work and always pushed me to challenge my own perceived limitations as a researcher. Your keen attention to every detail pertaining to any work, right from concept and formulation to results and presentation, has instilled in me I hope the highest standard of work ethic, which I hope to carry forward into the world. For this, and every other lesson, I will be forever indebted to you.

A Phd is a momentous journey of intellectual self-realization, of venturing into lands unknown, asking difficult questions, developing along the way the skills necessary for problem solving, learning courage in the face of uncertainty, course correcting, maybe

finding satisfactory answers, hopefully making meaningful contributions into the body of human knowledge, and at the end, of great personal reward. This journey would not have been possible for me but for the support of the most important people in my life. To my family – my husband, daughter, and mothers both sides, for their unwavering love and support. Your smiles, laughter, and silent sacrifices have made all the difference in this journey. To Rakesh, for always being supportive of anything I wanted to do, for always being there for me through thick and thin, through the many highs and lows that Phd studies entail, being genuinely happy at any little victory and

viii always with a kind word, or dry humor, or sage advice during perceived failure, making it that much easier to deal with it – you really have been the rock and pillar of support in my life as a returning graduate student. I have been able to even undertake, and today, complete this journey, only because of your support. To my late father who instilled in me the love of science and engineering, and a thinking mindset to ask questions. Your inspiring words, your love and your ever-supportive encouragement and belief in me, has single-handedly put me where I am today. Just the memory of our many philosophical discussions and interactions and your progressive ideas about how women should strive to achieve their true potential and play a greater role, has been the fuel to my intellectual pursuits to this day, and will always be into the future.

I wish you were here today to witness the culmination of this journey, and I am sure you are. I know you have been guiding me all along as I undertook and fulfilled this journey. To my mother, even though you didn’t understand every aspect of my work or the reasons behind my pursuits, your willingness to provide mental and emotional support, and the wisdom and strength you provided through trying times during this journey has ensured I am where I stand today. To both my mother and my mother- in-law for their unconditional love and support to my family and to my child, during the times they could be here with us, through the hectic schedules of our daily lives – anything said of what your love and support has meant to us, would not do justice. To my darling Ryka, from the day I became your mother, since you have walked outside of me, you intuitively knew all that was happening in my life as a student at every stage. Your smiles, your jokes, your laughter, your stories, even your tantrums, your pride in me and your belief that mommy is going to be a computer scientist someday, but most of all your unbridled enthusiasm for anything “life”, your curiosity, and your

ix sun-shiny happiness to brighten up the dullest day have been the greatest source of inspiration and strength for me. I want you to always know that this whole journey and effort has been entirely for you, and is most fondly dedicated to you. I just hope with the teleportation headpiece you are building out of the cardboard box with your painting of the solar system on it, that you don’t ever teleport too far away from me.

I also want to thank my extended family and friends for all of their support and good wishes for me throughout.

Apart from family, I do have a long list of people to remember and thank, and my apologies in advance if I miss anybody. To my lab mates from both of the CSE research labs I have been affiliated with, who are the smartest and funniest bunch that

I have had the good fortune of knowing. In alphabetical order Adam, Deblin, Denis,

Peter and Prashant of SLaTe lab; and Kayhan, Renhao and Sobhan of CETI lab. Your company and camaraderie have truly been the highlight of my experience during my graduate studies that have made the toughest of days somewhat bearable. We have shared the silence of solidarity in each other’s failures and the hoorays and fist bumps in each other’s successes. To members of the SLaTe lab: Thanks to Deblin for providing me with insight on the nuances of various neural architectures that I would often consider for my design choices, with your deep expertise on the same. Whiteboard and non-whiteboard discussions with you and Denis, and also Prashant, Peter and Adam have helped me come up with a solution many a time. Thank you Denis for being a dependable lab mate and colleague whom I could turn to with a question or thought on just about anything at any time, from research questions to existential dilemmas or philosophical reverie, and from biomedical dataset questions, to system configurations and passwords. Our discussion about “latent concepts” – a term I would like to credit

x you with, is what eventually led to the formulation of one of the main ideas in my dissertation. Thanks to Adam, besides being a great lab mate to contemplate about the ironies of life with, and fellow coffee brewer contributing to technical thoughts and discussions, for helping change a flat tire I had close to an important deadline, saving me many precious hours that day. Thanks to Peter for your assistance with GPU-related shenanigans and your much-appreciated encouragement of me in my endeavors; thanks to Prashant for our Transformer and Meditation-related conversations, and to you both, as well as the others in the lab, for helpful discussions on unpacking the unresolved mysteries of deep neural networks. From the beers, cheers and lunches to celebrate victories and defenses, to our lab meetings that have been enlightening, engaging and entertaining all at once, thank you all for the great memories. To members of the

CETI lab: Thanks to Sobhan for being a great lab mate to share a space with, for conversations ranging from research perspectives and CSE, to World politics, for good discussions surrounding internships and navigating the Phd, for any lab space-related decisions and for any technical discussions. It really has been wonderful to have both you and Kayhan for colleagues. Thanks to Renhao for great conversations on various

NLP and AI topics, and what Statistics classes to take, and for being an awesome partner on some fun projects and collaborations. Your awesome spirit in the face of adversity inspires me in my life as well. It is hard to put into words the precious time spent at both the research labs I have been a part of, and the role it has played in shaping myself for a career in Science.

To present and former students, some of whom became friends and trusted col- leagues that I can always depend on for a great answer, or even the right answer.

Firstly, to Annatala Wolf, with whom I learnt some of the first nuts and bolts of data

xi mining, who can build an in-memory index from scratch from ground up, at the drop of a hat, and who I honestly believe is more deserving of a PhD than I am, to this day, you inspire me. To the illustrious women alumni of my lab and partner labs, Preethi

Jyothi, Preethi Raghavan and Ilana Heintz, who have been super role models of women in Science and Tech for me to draw inpiration from. To Anupama Nandi, who I’ve learned so much from who I have my bets on to solve some of the toughest problems in CSE some day. To you and Jeniya for being great study partners for qualifiers. To

Kristin Barber, for sharing all-important notes with me and also being a fun pal to study for qualifiers with. To the resident NLP gurus of CSE, Jeniya and Mounica, you guys have been so great to share a conversation or a light moment with. To the rest of the ladies of the group “PhD Women in CSE”, Tara, Ziyu, Moniba, Soumya, apart from the others previously mentioned, whom I have had the good fortune of hanging out more with only recently – your awesome spirit inspires me and makes me optimistic for the future. To illustrious seniors and alumni of CSE, Dave Fuhry,

Joo-Kyung Kim, Chaitanya Shivade, Satyajeet Raje and Sandesh Swamy, thank you for the enlightening and helpful conversations on various topics ranging from modeling strategies, algorithms and datasets, to career path decision making.

To my collaborators and mentors at Nationwide Children’s Hospital – Drs Simon

Lin, Yungui Huang and Steve Rust. Your thoughtful leadership, advice and guidance, have played a huge role in my success as a Phd student at OSU. Heartfelt thanks to

Dr Lin for first giving me the cool GeneRIF project opportunity that I was able to use for my Machine Learning class, and then the internship opportunity that led to this collaboration, and for all the support thereafter, that led to a publication. Thank you to Dr Huang for your guidance and support throughout, and for the editorial

xii help whenever needed. Thanks to Dr Rust for your guidance and support and your extended time in explaining to us about robust ROC analysis, providing us helpful feedback for further improving the evaluations for the Seq2set project. Sincere thanks to Dr Rust, also, for insights on concept discovery, feedback on extended experiments, and coordinating to help provide expert–based validations for my final thesis work.

Many heartfelt thanks to Dr Samuel Yang for providing these evaluations with a quick turnaround time. Thanks to you all for the editorial help and feedback for any papers.

To Soheil, for the invaluable time and assistance with ElasticSearch issues and dataset setup related activities, it was indeed a pleasure working with such a congenial and helpful colleague such as yourself. Thank you so much to my mentors and managers during my internships at Amazon, Rahul Bhagat, Estevam Hruschka and Tian Chen, for the great insights, advice and lessons learnt from the perspective of modeling and metrics, during my time there, that have helped in my research as well, back at the

University. To my mentors at The Columbus Dispatch Company, Jax Zachariah, Nikhil

H., and Jason Cotter, and colleagues John Valentine, John Schafer, Benjamin Becker, and Brian Espin, it was great to be part of the Big Data Initiative and be able to work with you on this research collaboration, and for my internship and research project thereafter. I took back many valuable lessons on the variety of data, infrastructure, metrics and query logs available within a media group and learnt much about modeling to derive insights for the same.

To my collaborator on the Seq2set work, Juanxi Li, whose Master’s project was intended to be carved out from this work, a huge applause and thanks to you for your dedicated contribution at every level of this project, and props for the depth of machine learning and deep learning expertise you developed along the way. Your

xiii willingness to take up a challenge, deciphering the most complex of models and taking ownership of the project accelerated the many different paths we were able to explore and the multitude of experiments we were able to run, including incorporating the latest polysemous or contextualized models. It was a pleasure to be your mentor, great to work with you as a team, and to push boundaries of the project as we learnt better strategies. It was easy to communicate ideas to you to convert to relevant results and it was awesome to get to learn alongside you. Wish you the very best of luck for your future endeavors and hope you choose to take up a Phd someday, which I am sure you would also do great at. To my collaborators on the product-related work, thank you to Zhen Wang, Evan Jaffe and

Madhuja Chattopadhyay. To Zhen Wang, for his keen technical expertise and taking ownership of a critical piece of the project to ensure results, and as a colleague for providing practical suggestions and sensible advice on various matters when needed, apart from also the much-needed humor during crunch times. Thank you also for the great brainstorming provided for designs to improve the model. To Evan Jaffe for being a dependable colleague who has not only great expertise from a standpoint, but also great insight on various modeling aspects and the literature. For all the great support provided from figuring ways to evaluate difficult baselines, providing analysis, being my data annotation partner in crime, and mentoring Madhuja, thank you Evan.

To Madhuja for your energy and enthusiasm in being a part of this project and making important contributions like getting the results on an important baseline and providing annotations for the next phase. It was indeed my pleasure to get to work with you all as colleagues.

xiv To Prof Micha, thank you for being the perfect Computational Linguistics minor area advisor to me – I couldn’t have wished for a better one. I learned so much about machine learning and NLP tasks and tricks from you, from both your highly engaging

Comp Ling II class and Language and Vision seminar, and from in-person discussions.

You have made the time for me when I needed it most and I can attribute much of the early crucial success for submission or publication to your detailed suggestions and guidance. At various times thereafter you always made the time to provide me feedback when I was contemplating designs for projects or needed to review any ideas.

For all of the guidance and support you have provided, I will be always indebted to you. Thank you also to awesome colleagues at the weekly Clippers meeting, for some great discussions and feedback sessions.

Thanks to Prof Alan Ritter for being on my candidacy committee, for posing thoughtful questions that helped to improve my proposal, and for giving me help- ful feedback on a paper draft that I believe helped to get it accepted. Thank you

Prof Huan Sun for being on my dissertation committee, for your flexibility and prompt responses during scheduling of my defense and for your thoughtful questions that ulti- mately helped to further improve my thesis write-up. I also gained a lot from attending your seminar class. Thank you to Prof Marie de Marneffe for being a mentor and for supporting conversations that have helped me. I very much enjoyed your Computa- tional Linguistics class and got a lot from your seminars. Thanks to Prof Arnab Nandi for helping write the winning abstract that led to my first paper acceptance out of

OSU, and for helpful feedback along the way for that specific work and for Phd in general. To the great folks at OSC who have enabled my experiments and kept things going when resources were low and stakes were high.

xv Thanks to Prof Srini for the most interesting first seminar and class I took in that eventually got me interested and into the field of information retrieval, for any guidance and support along the way, and for inviting me to do my first conference talk dry-run with your group; and to Prof Belkin for an unforgettable class in Machine

Learning that I really enjoyed and learned much from. To Profs Gagan, Neelam, Steve

Lai, Paul Sivilotti, Yusu Wang, Tim Long and Ken Supowit whose classes or seminars

I either took or sat in and learned so much from, thank you for instilling in your students, great fundamentals. To the awesome TAs, lecturers and course coordinators of CSE 2111, this was a great learning experience and a great team experience for me, so, a big thank you to Ms Lori Rice, Dr Diana Kline, Ms Catherine Mc Kinley, Clair

Farris and Mark Jackson – I will never forget you. To Ms Lynn, Ms Kitty, Ms Catrena and Ms Tamera in Dreese Office 395, I know you are happy for me as I graduate, and

I want to sincerely thank you for all your help and support throughout my many years in the department.

Finally I want to express my gratitude for the wondrous gift that is this beautiful life, that has made this journey even possible. While the objective through this whole program was to design algorithms that might make machines understand and work more effectively for us for various tasks, I came to recognize even more that being human is perhaps the most wonderful experience there may be, and that this gift must therefore be used and handled with the utmost responsibility and care.

xvi VITA

October 31, 1978 ...... Born - Darjeeling, West Bengal, India

June, 2000 ...... B.E. Computer Science & Engineering, National Institute of Technology, Goa, India December, 2003 ...... M.S. Computer & Information Science, University of Mississippi, Oxford, MS, USA August, 2013-present ...... Graduate Teaching Associate, The Ohio State University, Ohio, USA.

PUBLICATIONS

Research Publications

Manirupa Das, Micha Elsner, Arnab Nandi, and Rajiv Ramnath, “TopChurn: Max- imum entropy churn prediction using topic models over heterogeneous signals”. Pro- ceedings of the 24th International Conference on World Wide Web, 291–297, 2015, ACM.

Manirupa Das, Renhao Cui, David R Campbell, Gagan Agrawal, and Rajiv Ram- nath, “Towards methods for systematic research on big data”. 2015 IEEE Interna- tional Conference on Big Data (Big Data), 2072–2081, 2015, IEEE.

Manirupa Das, Eric Fosler-Lussier, Simon Lin, Soheil Moosavinasab, David Chen, Steve Rust, Yungui Huang, and Rajiv Ramnath, “Phrase2VecGLM: Neural generalized language model–based semantic tagging for complex query reformulation in medical IR”. Proceedings of the BioNLP 2018 workshop, 118–128, 2018.

xvii Manirupa Das, Juanxi Li, Eric Fosler-Lussier, Simon Lin, Soheil Moosavinasab, Steve Rust, Yungui Huang, and Rajiv Ramnath, “Sequence-to-Set Semantic Tagging: End- to-End Multi-label Prediction using Neural Attention for Complex Query Reformula- tion and Automated Text Categorization”. arXiv preprint, arXiv:1910.2898805, 2019.

Manirupa Das, Zhen Wang, Evan Jaffe, Madhuja Chattopadhyay, Eric Fosler-Lussier, and Rajiv Ramnath, “Learning to Answer Subjective, Specific Product-Related Queries using Customer Reviews by Adversarial Domain Adaptation”. arXiv preprint arXiv:1910.08270, 2019.

Manirupa Das, and Renhao Cui, “Comparison of Quality Indicators in User-generated Content Using Social Media and Scholarly Text”. arXiv preprint arXiv:1910.11399, 2019.

FIELDS OF STUDY

Major Field: Computer Science and Engineering

Studies in: Artificial Intelligence Prof. Eric Fosler-Lussier Software Systems Prof. Rajiv Ramnath Computational Linguistics Prof. Micha Elsner

xviii TABLE OF CONTENTS

Page

Abstract ...... ii

Dedication ...... vi

Acknowledgments ...... vii

Vita ...... xvii

List of Tables ...... xxv

List of Figures ...... xxvii

1. INTRODUCTION ...... 1

1.1 A Brief Synopsis of Meaning ...... 1 1.2 Motivation and Background ...... 5 1.3 The Role of Context in Semantics ...... 10 1.3.1 Contexts for Knowledge Representation in KBs ...... 10 1.3.2 Contexts in Distributional Models of Representation . . . . . 16 1.3.3 Semantic Lifting ...... 19 1.4 Concepts, in Context ...... 21 1.4.1 Concepts ...... 24 1.4.2 Latent Concepts ...... 25 1.4.3 Context ...... 27 1.4.4 Shared Context ...... 28 1.4.5 Novel Context ...... 28 1.5 Problem Statement and Research Task ...... 28 1.5.1 Problem Definition ...... 29 1.5.2 What is Concept Discovery? ...... 36 1.6 Main Research Contributions ...... 42

xix 1.6.1 Exploring Topic Models Over Complementary Signals to Ex- plain Subscriber Engagement ...... 42 1.6.2 A Phrasal Embedding-based Generalized Language Model for Semantic Tagging for Query Expansion ...... 43 1.6.3 Sequence-to-Set Semantic Tagging with Neural Attention for Complex Query Reformulation and Automated Text Catego- rization ...... 44 1.6.4 Learning to address complex product-related queries with prod- uct reviews by neural domain adaptation ...... 45

2. LITERATURE SURVEY AND SIGNIFICANCE OF THIS RESEARCH . 48

2.1 Perspectives on Similarity ...... 49 2.1.1 Respects for Similarity ...... 49 2.1.2 From Frequency to Meaning ...... 52 2.1.3 Distributional and Vector Space Models of Semantics . . . . . 54 2.1.4 Hypotheses for Word- and Document-level semantics . . . . . 55 2.1.5 ...... 56 2.1.6 Latent Relational Analysis ...... 57 2.1.7 Latent Dirichlet Allocation (LDA) ...... 59 2.2 Related Works in Concept and Relation Extraction ...... 61 2.2.1 Early Work in Concept Discovery from Text ...... 61 2.2.2 Multi-way classification of semantic relations between pairs of nominals ...... 62 2.2.3 Distant Supervision and Missing Data Modeling ...... 64 2.2.4 Knowledge Base Completion ...... 66 2.2.5 Semantic Inference – Learning Selectional Preferences . . . . . 67 2.2.6 Paraphrase Detection, Identification and Extraction ...... 69 2.2.7 Named Entity Recognition for Relation Extraction ...... 71 2.2.8 Automated Text Categorization and Semantic Tagging . . . . 73 2.3 Knowledge Transfer and Natural Language Processing ...... 76 2.3.1 A Taxonomy of Knowledge Transfer ...... 77 2.3.2 Natural Language Processing – An Overview and Tasks . . . 80 2.3.3 Transfer Learning for NLP – Language Model Pre-training on Sesame Street ...... 84

3. EXPLORING TOPIC MODELS OVER COMPLEMENTARY SIGNALS TO EXPLAIN NEWSPAPER SUBSCRIBER ENGAGEMENT ...... 87

3.1 Introduction ...... 87 3.2 Motivation and Background ...... 90

xx 3.3 Related Work ...... 92 3.4 Dataset ...... 95 3.5 Churn Prediction ...... 97 3.5.1 Methodology ...... 97 3.5.2 Data Preparation and Topic Modeling ...... 98 3.5.3 Label Generation for Predictive Modeling ...... 100 3.5.4 Exploratory Trend Analysis ...... 101 3.5.5 Feature generation ...... 102 3.5.6 Experimental Results ...... 103 3.5.7 Feature Performance ...... 105 3.6 Extended Analysis - Explaining Customer Engagement ...... 108 3.6.1 Exploring Topic Models to Explain User Engagement . . . . . 108 3.7 Correlation Analysis ...... 110 3.8 Parameterized exploration ...... 113 3.9 Investigating Complementarity of Signals for Insights into Causes . . 115 3.10 Discussion ...... 118 3.11 Conclusion and Future Work ...... 121

4. PHRASE2VECGLM: NEURAL GENERALIZED LANGUAGE MODEL– BASED SEMANTIC TAGGING FOR COMPLEX QUERY REFORMU- LATION ...... 127

4.1 Introduction ...... 127 4.2 Motivation and Background ...... 130 4.3 Problem Definition ...... 133 4.3.1 Dataset and Task ...... 133 4.4 Semantic tag recommendation models ...... 135 4.5 Query Expansion ...... 137 4.6 Research Methodology ...... 138 4.7 A Phrasal Embedding-based General LM for semantic tag recommen- dation by knowledge transfer ...... 141 4.8 Algorithm ...... 144 4.9 Data Pre-processing for Phrasal LM ...... 145 4.10 Experimental Setup ...... 146 4.11 Evaluation using an ElasticSearch Index ...... 148 4.12 Experimental Results and Discussion ...... 150

5. SEQUENCE-TO-SET SEMANTIC TAGGING: END-END NEURAL ATTENTION– BASED TERM TRANSFER TOWARDS COMPLEX QUERY REFOR- MULATION AND AUTOMATED CATEGORIZATION OF TEXT . . . . 154

xxi 5.1 Introduction ...... 156 5.2 Related Work ...... 158 5.3 Methodology ...... 161 5.4 Sequence-based Document Encoders ...... 162 5.4.1 doc2vec encoder ...... 162 5.4.2 Deep Averaging Network encoder ...... 163 5.4.3 LSTM and BiLSTM encoders ...... 163 5.4.4 BiLSTM with Attention encoder ...... 164 5.4.5 GRU and BiGRU encoders ...... 165 5.4.6 Transformer self-attentional encoder ...... 165 5.4.7 CNN encoder ...... 166 5.5 Sequence-to-Set Architecture ...... 167 5.6 Training and Inference Setup ...... 168 5.7 Unsupervised Task Setting – Semantic Tagging for Query Expansion 170 5.7.1 Dataset - TREC CDS 2016 ...... 170 5.7.2 Experimental Designs with Word Embedding Models . . . . . 171 5.7.3 Experiments ...... 172 5.7.4 Discussion ...... 174 5.8 Supervised Task Setting – Automated Text Categorization ...... 175 5.8.1 Dataset – Delicious ...... 176 5.8.2 Experiments ...... 177 5.9 Semi-Supervised Task Setting – Automated Text Categorization . . . 179 5.9.1 Dataset – Ohsumed ...... 179 5.9.2 Experiments ...... 179 5.10 Conclusion ...... 182

6. LEARNING TO ANSWER SUBJECTIVE, SPECIFIC PRODUCT-RELATED QUERIES WITH CUSTOMER REVIEWS BY NEURAL DOMAIN ADAP- TATION ...... 183

6.1 Introduction ...... 183 6.2 Motivation and Background ...... 186 6.3 Related Work ...... 189 6.3.1 Addressing subjective product-related queries with reviews . . 189 6.3.2 Modeling Sentence Pairs ...... 191 6.3.3 BiCNN and ABCNN-3 ...... 191 6.3.4 Reasoning for ...... 193 6.3.5 Neural Domain Adaptation ...... 194 6.3.6 Proposed Model – Domain Adversarial Neural Network . . . . 194 6.4 Dataset ...... 198 6.5 Experiments ...... 199

xxii 6.6 Results and Discussion ...... 200 6.6.1 ABCNN models: BiCNN and ABCNN-3 ...... 200 6.6.2 Attentive LSTM for RTE model ...... 201 6.6.3 DANN model ...... 202 6.7 Evaluation, Conclusion and Future work ...... 204

7. EVALUATION, CONCLUSION AND FUTURE DIRECTIONS ...... 207

7.1 Expert Evaluation of Main Ideas ...... 208 7.2 Summary of Contributions ...... 213 7.2.1 Exploring complementary signals for subscriber churn predic- tion via topic models ...... 213 7.2.2 Phrasal neural generalized language model for semantic tagging 215 7.2.3 Sequence-to-set semantic tagging for complex query reformu- lation via semantic tagging and automated text categorization 217 7.2.4 Learning to answer subjective, specific questions with cus- tomer reviews by neural domain adaptation ...... 219 7.3 Future Directions ...... 220 7.3.1 Jointly learning multi-label prediction and NER for matching to novel contexts ...... 220 7.3.2 Multi-task learning in domain adversarial setting for question answering ...... 221 7.3.3 Provide linguistic resource to advance research in Concept Dis- covery ...... 222 7.3.4 How does this research impact the NLP community ...... 223 7.4 Final remarks ...... 224

Appendices 226

A. PROBABILITY AND INFORMATION THEORY ...... 226

A.0.1 Probability fundamentals ...... 226 A.0.2 Maximum Likelihood Estimation ...... 229 A.0.3 Cross Entropy ...... 230 A.0.4 Perplexity ...... 231

B. NEURAL NETWORKS AND MACHINE LEARNING ...... 233

B.0.1 Linear Regression ...... 233 B.0.2 Logistic Regression ...... 235 B.0.3 Multi-Class Classification by Softmax ...... 236

xxiii B.0.4 Feed-Forward Neural Networks and Other Neural Models . . 237 B.0.5 Gradient Descent ...... 241 B.0.6 Back-propagation ...... 242

C. KNOWLEDGE REPRESENTATION FOR KBS AND OTHER RELATED AREAS ...... 245

C.0.1 Document Summarization ...... 245 C.0.2 Knowledge Representation for Relation Extraction ...... 247 C.0.3 Never-Ending Language Learning ...... 249

Bibliography ...... 251

xxiv LIST OF TABLES

Table Page

3.1 Performance for TopChurnWEB models with parametrized exploration . 116

3.2 Inferred Models Performance based on best parameters learned for TopChurnWEB116

4.1 Results for Query Expansion by different methods using unigram and phrasal GLM–generated terms, directly and in feedback loop. Asterix indicate inferred measures (Voorhees, 2014). Bold face values indicate statistical significance at p << 0.01 over previous result or baseline. . . 149

4.2 UMLS Concept Unique Identifier(CUI)-based Latent Concept mappings and Possible Relations for unigramGLM ...... 152

4.3 UMLS Concept Unique Identifier(CUI)-based Latent Concept mappings and Possible Relations for Phrase2VecGLM ...... 153

5.1 Results on IR for best Seq2set models, in an unsupervised PRF– based QE setting. Boldface indicates statistical significance @p<<0.01 over previous...... 173

5.2 UMLS Concept Unique Identifier(CUI)-based Latent Concept mappings and Possible Relations for Seq2set ...... 175

6.1 Dataset Statistics for Open-Ended, Multi-Answer Q-A Pairs and Reviews matched on 128K unique ASINs from the Amazon product review dataset McAuley and Yang (2016)...... 199

6.2 Label Proportions (0/1 for unrelated/related) in each dataset for a total number of instances...... 200

xxv 6.3 Results from Experiments with the Baseline Sentence-Pair models: Conditional- encoding-based Attentive LSTM, BiCNN and ABCNN-3. Bold indicate best performance in that row of results while * indicates best perform- ing for a particular model on target domain evaluation across individual categories or combined...... 201

6.4 Results from experiments with the DANN model. Bold * indicate best results with domain adaptation on target domain evaluations ...... 203

6.5 Examples of Q-R target-only inference on positive examples that the DANN model gets right...... 206

6.6 Comparison of QAR-Net system with the DANN model for PRQA. . . 206

6.7 Latent Concept mappings and Possible Relations between question and answer sentence concepts ...... 206

7.1 Expert Judgments on UMLS CUI-based Latent Concept mappings and Possible Relations for unigramGLM ...... 209

7.2 Expert Judgments on UMLS CUI-based Latent Concept mappings and Possible Relations for Phrase2VecGLM ...... 210

7.3 Expert Judgments on UMLS CUI-based Latent Concept mappings and Possible Relations for the Seq2set framework ...... 210

7.4 Expert judgments for concept tags for documents, for Concept Yes-or- No, where no latent concept mapping via UMLS CUI could be found...... 211

7.5 Expert Judgments for concept tags having no UMLS CUI-based map- pings to Latent Concepts, and their Possible Relations to query concepts for the Seq2set framework ...... 213

xxvi LIST OF FIGURES

Figure Page

1.1 Skip-gram model architecture (Mikolov et al., 2013a) ...... 17

1.2 Example of Hidden Associations between Latent Concepts in Context. . 34

2.1 An Overview of Different Settings of Transfer Learning by Pan and Yang (2010) ...... 78

3.1 Overview of Dataset ...... 95

3.2 Customer transactions with memo text ...... 96

3.3 Top ranking topics by dataset with counts ...... 101

3.4 Topic models for DROPPED complaints ...... 103

3.5 WEB and ClickLog Feature Integration for Churn Prediction104

3.6 Feature Categories for Churn models ...... 104

3.7 Churn Prediction, K =3 ...... 106

3.8 Churn Prediction, K =6 ...... 107

3.9 ROC curves for top performing churn models on TopChurnTRANS . . . 107

3.10 ROC – TopChurnWEB topics+sentiment ...... 108

3.11 Most informative features, K =3 ...... 123

3.12 Feature comparison with K =6 ...... 124

xxvii 3.13 Correlation analysis of predictive features in TopChurn models . . . . . 125

3.14 Comparison between topics learned from WEB-NEWS and fit/inferred on WEB-CLICKLOGS ...... 126

4.1 Example of Hidden Associations between Latent Concepts for the Med- ical Domain present in the UMLS ontology...... 128

4.2 Sample query ’topic’ from the TREC 2016 challenge, showing a clinical note with patient history, at 3 levels of granularity Note, Description and Summary...... 139

5.1 Overview of Sequence-to-Set Framework. (a) Method for training doc- ument or query representations, (b) Method for Inference via term transfer for semantic tagging; Document Sequence Encoders: (c) Deep Averaging encoder; (d) LSTM last hidden state, GRU encoders; (e) BiLSTM last hidden state, BiGRU (shown in dotted box), BiLSTM attended hidden states encoders; and (f) Transformer self-attentional encoder Alammar (2018)...... 168

5.2 A comparison of document labeling performance of Seq2set versus MLTM ...... 177

5.3 Seq2Set – supervised text categorization task setting ROC AUC on del.icio.us dataset for best performing models ...... 178

5.4 Seq2Set – supervised and semi-supervised text categorization task set- ting ROC AUC on Ohsumed dataset for best performing models . . . 180

6.1 Example of Hidden Associations between Latent Concepts in a corpus of Product-related Questions, Answers and Reviews...... 185

6.2 BiCNN baseline of Yin et al. (2015) ...... 192

6.3 Conditional encoding-based Attentive LSTM for textual entailment of Rockt¨aschel et al. (2015)...... 193

6.4 Unsupervised domain adaptation is achieved by adding a domain classi- fier (red) connected to the feature extractor via a gradient reversal layer Ajakan et al. (2014); Ganin et al. (2016)...... 196

xxviii 6.5 Architecture of the adapted DANN model for Question-Answer/Review sentence pairs ...... 196

xxix CHAPTER 1

INTRODUCTION

What magical trick makes us intelligent? The trick is that there is no trick. The power of intelligence stems from our vast diversity, not from any single, perfect prin- ciple.

– Marvin Minsky, The Society of Mind, (Minsky, 1991, p.308)

Of the various widely accepted definitions of the term semantics, fol- lowing is the one adopted in this dissertation : “Semantics is the branch of linguistics and logic concerned with meaning. There are a number of branches and sub-branches of semantics, including formal semantics, which studies the logical aspects of mean- ing, such as sense, reference, implication, and logical form, , which studies word meanings and word relations, and conceptual semantics, which studies the cognitive structure of meaning.”

1.1 A Brief Synopsis of Meaning

“When I use a word,” Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean - neither more nor less.” “The question is,” said Alice,

“whether you can make words mean so many different things.”

1 – Lewis Carroll, Alice’s Adventures in Wonderland, 1865. (Jurafsky and Martin,

2014)

As any good introductory course in computational linguistics may suggest, the three levels of meaning to consider in the computation of semantics are: 1) Lexical Semantics

- which deals with the individual meanings of words, i.e. it is concerned with systematic relations in the meanings of words, and in recurring patterns among different meanings of the same word; 2) Compositional Semantics - which looks at how the meanings of sentences or utterances are formed from the meanings of their constituent words; and

3) Discourse or Pragmatics - which concerns itself with how the meaning of a text or a discourse is formed from the meanings of its sentences or utterances, along with other facts about the context and the world, considering non-literal meaning (de Marneffe et al., 2015; Thomason, 2012).

Compositional (Sentential) Semantics may be further thought of as belonging in two categories. The first, compositional formal semantics is based on the idea that the “meaning of a sentence is a function of the meanings of its parts”. In this case, sentential semantics, thus being representable as a composition of propositional logical forms, lends itself to machines being able to directly interpret the meaning of a sentence and use it for certain tasks (Szab´o¡,2017; de Marneffe et al., 2015). This idea forms the basis for the discussion in section 1.3.1 about how contexts could be represented in a generalizable way within a knowledge base. The second approach to sentential compositional semantics may be regarded as compositional based on a vector-space model of representing semantics using frequency based co-occurrence counts (Turney and Pantel, 2010), satisfying the distributional hypothesis that “a word is characterized by the company it keeps” (Firth, 1957) and

2 “words that occur in similar contexts tend to have similar meanings” (Harris, 1954).

The treatment of contexts using this approach is covered in more detail in section 1.3.2.

“The unit of meaning is a sense - thus one word can have multiple meanings. A sense is a representation of one aspect of the meaning of a word. Thus most non-rare words have multiple meanings.” Relations between word senses may be synonymy, antonymy, hypernymy and hyponymy and . Synonymy is when two words have the same sense. Formally, two words may be if they can be substituted for each other without changing the associated truth conditions – i.e. provide the same propositional meaning. Antonyms are senses that are in one feature of their meaning, otherwise they are very similar, e.g some define a binary opposition, like on/off, living/dead, or lie at opposite ends of a scale, like long/short or fast/slow, and yet others are reversives, e.g. rise/fall, up/down. Hyponymy, also known as the is − a relation, is an asymmetric, transitive relation between senses, thus if X is hyponym of Y , it denotes a subclass of Y . The inverse relation is hy- pernymy. Sometimes we may distinguish instance hyponyms from subclass hyponyms.

Meronymy, is an asymmetric, transitive relation between senses, where X is meronym of Y , if it denotes a part of Y , and the inverse relation is holonymy (de Marneffe et al.,

2015). Again for some applications we distinguish part meronyms e.g. porch, wheel, leg, substance meronyms e.g. rubber, water, wood, and member meronyms e.g. profes- sor (holonym: faculty), tree (holonym: forest), player (holonym: team) (de Marneffe et al., 2015; Fellbaum, 1998).

According to Manning and Sch¨utze:“Many tasks like text understanding and infor- mation retrieval would be greatly helped by statistical NLP if we could automatically

3 acquire meaning. Unfortunately, how to represent meaning in a way that can be oper- ationally used by an automatic system is a largely unsolved problem. For this reason, most work on acquiring semantic properties of words has focused mainly on using various measures for the same. Automatically acquiring a relative measure of how similar a new word is to known words (or how dissimilar) is much easier than determining what the meaning actually is. Despite its limitations, semantic similarity is a still a useful metric to have as it is most often used for generalization to related terms during an inference process, under the assumption that semantically similar words behave similarly. Similarity-based generalization is a close relative of class-based generalization, where, in similarity-based generalization we only consider the closest neighbors in generalizing to the word of interest, whereas in class-based generalization, we consider the whole class of elements that the word of interest is most likely to be a member of. Semantic similarity is also used for query expansion for information retrieval.” (Manning et al., 1999)

Further, Manning and Sch¨utzenote that, “semantic similarity” is not as intuitive a notion as it may seem at first. For some, semantic similarity is an extension of synonymy and refers to cases of near-synonymy like the pair dwelling:abode. Often semantic similarity refers to the notion that two words are from the same “semantic domain” or “topic”. By this understanding, words are similar if they refer to entities in the world that are likely to co-occur like doctor, nurse, fever, and intravenous, words that can refer to quite different entities, or even be members of different syntactic categories. Miller and Charles, attempt to put the notion of semantic similarity on a more solid footing by investigating semantic and contextual similarity for pairs of

4 nouns that vary from high to low semantic similarity. In this study, semantic simi- larity is estimated by subjective ratings; and contextual similarity is estimated by the method of sorting sentential contexts. The results show an inverse linear relationship between similarity of meaning and the discriminability of contexts, which indicates that judgments of semantic similarity can be explained by the degree of contextual interchangeability or the degree to which one word can be substituted for another, in context” (Miller and Charles, 1991; Manning et al., 1999).

According to (Manning et al., 1999), ambiguity presents a problem for all notions of semantic similarity. Thus if a word is semantically similar to one sense of an ambiguous word, then it is rarely semantically similar to the other sense. For example, litigation is similar to the “legal” sense of suit, but not to the “clothes” sense. Thus when applied to ambiguous words, semantically similar usually means “similar to the appropriate sense” or as in this dissertation, “related in context”. A large class of measures of semantic similarity are best conceptualized as measures of vector similarity. Thus the two words whose semantic similarity we want to compute are represented as vectors in some multi-dimensional space.

1.2 Motivation and Background

Traditional knowledge graphs driven by knowledge bases can represent facts about and capture relationships among entities very well, thus performing quite accurately in fact-based information retrieval or question answering (Fader et al., 2014). However, novel contexts consisting of a new set of terms referring to one or more concepts, may appear in a real-world querying scenario in the form of a natural language question or a search query into a document retrieval system. These may not directly refer to existing

5 entities or surface form concepts occurring in the relations within a knowledge base.

Thus, in addressing these novel contexts, such as those appearing in nuanced subjec- tive queries, these systems can fall short. This is because, hidden relations meaningful in the current context, may exist in a collection, between candidate latent concepts or entities that have different surface realizations via alternate lexical forms, (we de-

fine latent concepts in section 1.4.2), but which are not currently present in a curated knowledge source such as a knowledge base or an ontology. For example, as later outlined in section 1.5.1, the alternate surface lexical forms like “high blood sugar” and “elevated glucose level” both referring to the underlying latent concept Hyper- glycemia, become “related” by virtue of a relation like sameAs. Whereas the lexical surface forms such as “hypotensive drug” referring to a latent concept like Antihyper- tensives and “high blood pressure” referring to a latent concept like Hypertensive disease become related by virtue of a hidden relation like treatsFor, between these two latent concepts. Thus, these hidden relations between latent concepts that exist in the collection, can potentially help to better address a particular novel context appearing in a query e.g. “treatments for high blood pressure”, to surface meaningful related terms such as “hypotensive drug”.

Here “latent concept”, more formally defined in section 1.4.2 with examples, refers to the nominalized sense of an underlying concept, that may not be explicitly mentioned in a collection and may or may not exist in some ontology or knowledge source but appeals to a common sense understanding of the real world. In this dissertation we aim to infer these latent concept relations in an implicit manner, in a prediction or recommendation task setting. We want to leverage knowledge available for transfer from within the same domain or collection, or from across domains, in a completely

6 unsupervised manner with no reference to any external fact-based knowledge sources.

We believe this can help in addressing novel contexts appearing in the form of queries into the same or a new domain. Therefore, this research attempts to model directly for these potential relations or novel associations between latent concepts, by leveraging distributional semantics-based approaches for representation of the concepts, along with strategies to enable inference of these associations.

I proceed to develop the proposed research ideas in this dissertation as follows: I

first describe an exploratory study wherein knowledge transfer occurs across two do- mains (News and Clicklogs) by means of topics learnt by LDA on one domain and inferred on the other domain as described in detail in section 3.9. This gives us some key insights on how a downstream prediction task for predicting customer engagement is affected. It also provides useful lessons for our ensuing research involving query terms and customer review–based answers, in that it motivates the predominant use of distributional methods in these following works to enable learning of interpretable latent concepts at a more local level within a collection (as opposed to a global topic distribution across a collection). This we believe allows for inference of more meaning- ful concept associations. This ultimately serves as the means to infer “related” novel associated concepts e.g., for a query document, by knowledge transfer from rest of the documents in a collection, i.e. the same domain, useful in turn for downstream docu- ment retrieval for lengthy complex queries that are expanded by these novel associated concepts.

Thus, by making no assumptions whatsoever as to which contexts various concepts may be associated in, or what those associations may be, the overall research goal is to directly map to one another, alternate lexical forms of latent concepts, associated

7 in a certain context. This is without attempting to “label the relationship” that exists between them, without explicitly labeling the latent concepts or tying them to their

“exact” occurrence in some knowledge base, and without attempting to derive any inference rules from associations thus learned between concepts for inclusion into some knowledge base. Additionally, the goal is to do so in a fully unsupervised manner using no external fact or rule-based knowledge sources except the collection(s) in question, and still demonstrate gains via improved ranking precision in various recommendation task settings.

Consider the following example from a product-related answered questions and reviews database. This example possibly illustrates how latent concepts, described in sections 1.4.2 and 1.4 may be able to capture novel associations that are relevant in some new context appearing e.g. in a question, thus motivating the development of the proposed research ideas.

Here we consider questions, answers, and reviews related to a product titled Animal

Planet’s Big Tub of Dinosaurs, 40+ Piece Set which is a toy dinosaur playset for kids, having many toy dinosaur models and simulated landscapes. A question appears on the customer Q&A forum as follows:

Q: Would you recommend this for a tomboy age 9?

A couple of the top-rated helpful answers to this question are as follows, and we can easily see how well they answer the question or not. While the top-rated first answer is somewhat relevant to the question, the top-rated second answer is less relevant

(considering the question is about a tomboy (a girl by definition) and the answer is about a grandson (boy)).

8 A: 1) This was a Christmas present for our granddaughter age almost five (she

LOVED it). Her father (our son) loved dinosaurs (and still does at 30+). Really depends on the kid:)

A: 2) Yes, I would. They even have little plastic rocks. I bought this as a gift for my cousin’s grandson and he loves it.

Now consider some product reviews posted for the same item that were not written to address any particular question. Among a couple of reviews that might serve as potentially good answers or supplemental answers, the first one really stands out:

R: 1) “round about december 22nd my three year old daughter started telling ev- eryone she was asking santa for a dinosaur for christmas . this was a surprise because a- shes the most feminine child ever and b- i had long since finished my christmas shopping and wasnt planning on buying another thing . however i didnt want her to lose faith in santa and ended up searching amazon to see what i could come up with . this toy got my attention because it was a whole host of dinosaurs with a mat for them to play on . my daughter loved it for weeks she would get it out on a daily basis to arrange the dinosaurs on their terrain. from a girls perspective its really cute in that there are small versions of the big dinosaurs included so kids can group mommy dinosaurs with baby dinosaurs. she doesn’t get it out quite as often now that its been a few months but when she does shes every bit as delighted. and the dinosaurs are such high quality i know my ten month old son will be enjoying it in no time.”

R: 2) “i bought these for my 2 yr old for easter and he loves this set its a great deal and comes with a ton of dinosaurs and love that the name of the dinosaur is printed

9 on the belly. the mat that it comes with is a cheesy piece of plastic and not to scale and we rarely use it but a great value just for all the dinos and the storage bin.”

In this example, the nominal tomboy may be thought of as referring to a latent concept “boyish child”. Latent concepts in review 1 may be “girlish child” referred by the noun phrase feminine child, “toys mostly for boys” referred by the noun dinosaur, and “christmas wishlist” referred to by the noun phrase asking santa. If a machine or intelligent agent is able to infer these latent concepts from the text, and thus also infer a relation like < “girlish child”, “has christmas wishlist”,

“toys mostly for boys” > that is able to relate some of these latent concepts, then we can imagine that it may be able to make effective use of this relation to address the novel context of “recommendation for tomboy”. By reasoning in this manner, with all of the text, it can actually find a better relevant answer from product reviews as opposed to predicting a known correct answer in the most accurate way, alluding to some notion of discovery from the text. Chapter 6 describes and more precisely formulates such an inference task in greater detail.

In following sections we discuss the role of context towards representation in KBs and in vector space models, and put this into perspective for outlining the problem statement for this research providing the required definitions, along with references to motivating examples in the specific areas of the proposed work.

1.3 The Role of Context in Semantics

1.3.1 Contexts for Knowledge Representation in KBs

(Guha, 1991) defines contexts as “rich objects within a domain that cannot be fully described but that we can make statements about”. This was one of the earliest works

10 in representing semantics about domain knowledge from a standpoint of employing such contexts in a generalizable way for use in knowledge bases, consequently allowing effective reasoning and inference with them. According to Guha, in a “Symbolic Model of AI”, there is a repository of knowledge, the Knowledge Base (KB), and a set of procedures which operate on this to produce the intelligent behavior. Inputs to the system are added to the KB. They are translated by an appropriate front end into the language of the KB before being added to the KB. This KB may use a declarative encoding (such as one with a ) and the procedures carry out deductions. The system has some domain of competence and the overall goal of AI is to build a system whose domain of competence is comparable to that of humans.

Thus, the KB primarily contains the system’s knowledge about its domain of compe- tence, where, occasionally the KB might also have some meta-knowledge about how to use this knowledge. The KB consists of a set of expressions (sentences) in a certain vocabulary, where each sentence conveys some truth about the domain and the mean- ingfulness and truth of each sentence is independent of the presence or absence of other sentences. Thus in a sense, these are assumed to be eternal or universal sentences, i.e. all foreseeable dependencies of the sentence have been made explicit in the sentence.

In the context of communicative versus KB expressions, Guha states that sentences in the KB of this model are very different from natural language (NL) utterances, where NL utterances are far from being universal. They usually make a number of assumptions and depend very heavily on situational factors to convey their intended meaning. These situational factors include not just the previous utterances, but also broader factors such as the goals of the discoursants, the socio-cultural setting of the discourse, etc.

11 In the context of choosing a vocabulary for the knowledge base Guha states that

“One of the first things one need to do in representing any domain is to pick a vo- cabulary for encoding the KB’s knowledge of that domain. This vocabulary should allow expression of the phenomena we expect to find. The choice of the vocabulary can therefore exclude certain phenomena from consideration by the KB. But if certain phenomena are excluded, this constitutes an assumption that these phenomena are not important or do not exist. The decision to exclude these could either be by design or simply accidental. Thus it is almost inevitable that certain parts of the domain will be overlooked in the representation process.”

An example is given of a theory of commercial transactions, to represent the concept of buying and selling, where we decide that to refer to the event of X buying Y from

Z, we use the term (Buy X Y Z). We realize that this is insufficient since it cannot distinguish between two separate events involving X buying Y from Z. i.e. if Z sells Y to X, Z then buys Y back from X and sells it to him again, both the first and third transaction will be (Buy X Y Z) and we cannot distinguish between them (Guha,

1991).

Consider that we add an extra argument to extend our vocabulary, to represent the time at which the sale took place, to refer to X buying Y from Z at time T1, for which we use the predicate (Buy X Y Z T 1). But now suppose we have two sales of Y to

X at the same time; now our vocabulary again does not allow us to distinguish between them. We could of course further refine our vocabulary to cover these phenomena. But in the process we also make the vocabulary more awkward to use, requiring a trade- off between the part of the domain which can be covered, and the usability of the vocabulary, which seems undesirable. It also seems that with a little searching, we

12 can always uncover inexpressible phenomenon. But at some point, we have to

finalize the vocabulary and carry on with the representation. As humans we can quite easily conceptualize these inexpressible phenomenon; although we may not be able to predict what will happen, we can still understand it. Thus Guha argues that there is a certain “upward compatibility” of our vocabularies which our programs should have, and the first step towards designing these upwardly compatible vocabularies is to capture the fact that the choice of the vocabulary incorporates certain assumptions

(Guha, 1991).

It is almost inevitable that there will be some limitations in the vocabulary and that there will be some assumptions behind the theory. When these limitations are eventually discovered, extending the KB to deal with these phenomena could require a reworking of relevant parts of KB. This is undesirable in that we would like a more graceful way of extending the KB to deal with hitherto unexpected phenomena (Guha,

1991). The problem is that the naive theory by itself does not contain all the informa- tion associated with it. The information associated with a naive theory of commercial transactions includes not just the axioms describing buying and selling, but also in- formation about the assumptions made by the theory, when these assumptions are reasonable, when this theory is applicable, etc. These are “external” to the axioms constituting the theory itself.

The contextual effects on an expression are often so rich that they cannot be cap- tured completely in the logic. In Guha’s work contexts are incorporated as objects in the ontology in the following way: i.e. by using statements of propositional logic such as the formula ist(NaiveMoneyMt A1), where the context is denoted by the symbol

NaiveMoneyMt and is supposed to capture everything that is not in a statement A1,

13 that is required to make A1 a meaningful statement representing what it is intended to state. Here, ist stands for “is true in” and, the first argument to ist denotes a context as defined by (Guha, 1991).

The idea is that A1 itself need not be a completely de-contextualized or universal statement. It might depend on some contextual aspects that have not yet been specified and these aspects are to be captured by the context argument. Indeed, it might not be possible to ever completely list all of these context dependencies. At any time, we might have only a partial description of the context and this is why contexts are assumed to be rich objects. In other words, the context object can be thought of as the reification of context dependencies of the sentences associated with the context (Guha,

1991).

Another way of looking at the context argument is from a domain standpoint, as follows. Imagine a robot which in the course of its duties has to deal with a certain new domain. In order to do this, it examines the domain and writes a set of sentences representing this domain. If we view this action of the robot as a function that computes the representation, then the domain itself is of course an argument to this function, so when the robot writes ist(< context > < the theory >) the second argument to ist accounts for the domain argument and the context argument accounts for that fact that there were these other factors influencing the representation. Thus Guha argues that we are in a sense, in a position similar to that of the robot, except that we are writing sentences for the computer, and not for ourselves, and there are similar factors influencing our representation, so the context argument in the sentences we may write are meant to capture the effect of these factors (Guha, 1991).

14 Therefore, if all we could do with new syntax was to partition the KB into separate theories, this would not be very useful, though different theories in the KB might have different context dependencies, there has to be some relation between them. We would like to be able to provide at least “partial descriptions of these contexts” and reason with them to combine information from different contexts, to ensure that a theory is not used inappropriately, or extend a theory to cover new phenomena, etc., (Guha,

1991). Thus, in a pure KB operational sense, Guha notes that if a new theory GT T is introduced that is a little more explicit than A1 in NaiveMoneyMt, so that we can now, e.g., make a distinction between different kinds of costs associated with the original example of financial transactions, we say that the theory GT T has been “par- tially decontextualized” or “relatively decontextualized”, and he refers to this process as Lifting.

We contrast and establish parallels between this idea and the semantic lifting de- scribed in section 1.3.3. Therefore, given a problem, we might not have a context with just the right theory for solving this problem, requiring axioms from different theories.

Thus, Guha notes that we should be able to create a new context, lift the relevant axioms from these two theories into this context, enter this context and solve the problem (Guha, 1991). These two theories might make different assumptions and use different vocabularies. Thus the lifting process needs to perform the requisite relative de-contextualization. Traditionally, research on problem solving has assumed that the representation (theory) is a given and has been set up before the problem solving starts.

The goal of the create, lift, enter, problem-solve, exit sequence is to bring the choice of the theory used to solve a problem within the scope of the problem solving

(Guha, 1991).

15 1.3.2 Contexts in Distributional Models of Representation

Distributed representations of words in a vector space help learning algorithms to achieve better performance in natural language processing tasks by grouping similar words (Mikolov et al., 2013b). One of the earliest use of word representations due to Rumelhart, Hinton, and Williams, 1986 (Rumelhart et al., 1988), has since been applied to statistical language modeling with considerable success (Bengio et al., 2003).

The follow up work includes applications to automatic , , and a wide range of NLP tasks (Mikolov et al., 2013b; Le and Mikolov,

2014). Recently, Mikolov et al. (Mikolov et al., 2013a) introduced the Skip-gram model, an efficient method for learning high-quality vector representations of words from large amounts of unstructured text data. Unlike most of the previously used neural network architectures for learning word vectors, training of the Skip-gram model does not involve dense matrix multiplications. Fig 1.1 shows the Skip-gram model architecture, where the training objective is to learn word vector representations, that are good at predicting the nearby words.

This makes the training extremely efficient: an optimized single-machine imple- mentation can train on more than 100 billion words in one day (Mikolov et al., 2013a).

The word representations computed using neural networks are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns, where many of these patterns can be represented as linear translations. For example, the result of a vector calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector (Mikolov et al., 2013b). They also

16 Figure 1.1: Skip-gram model architecture (Mikolov et al., 2013a)

come up with several extensions of the original Skip-gram model, as word represen- tations can be limited by their inability to represent idiomatic phrases that are not compositions of the individual words. For example, “Boston Globe” is a newspaper, and so it is not a natural combination of the meanings of “Boston” and “Globe”.

Therefore, using vectors to represent the whole phrases makes the Skip-gram model considerably more expressive (Mikolov et al., 2013b).For this they first identify a large number of phrases using a data-driven approach, and then treat the phrases as in- dividual tokens during the training (Mikolov et al., 2013b). To evaluate the qual- ity of the phrase vectors, they develop a test set of analogical reasoning tasks that contains both words and phrases. Thus for a typical analogy pair from the test set is “Montreal”:“MontrealCanadiens” :: “T oronto”:“T orontoMapleLeafs”,

17 where it is considered to have been answered correctly if the nearest representation to vec(“MontrealCanadiens”) - vec(“Montreal”) + vec(“T oronto”), is vec(“T orontoMapleLeafs”), they obtain the desired result. Finally, another inter- esting property of the Skip-gram model is achieved where they find that simple vector addition can often produce meaningful results, e.g., vec(“Russia”) + vec(“river”) is close to vec(“V olgaRiver”), and vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”).

This compositionality suggests that a non-obvious degree of language understanding can be obtained by using basic mathematical operations on the word vector represen- tations (Mikolov et al., 2013b).

In works to follow, Le et al. (Le and Mikolov, 2014), propose Paragraph Vector, an unsupervised framework that learns continuous distributed vector representations for pieces of texts. The texts can be of variable-length, ranging from sentences to documents. The name Paragraph Vector is to emphasize the fact that the method can be applied to variable-length pieces of texts, anything from a phrase or sentence to a large document. In this model, the vector representation is trained to be useful for predicting words in a paragraph. More precisely, they concatenate the paragraph vector with several word vectors from a paragraph and predict the following word in the given context. Both word vectors and paragraph vectors are trained by stochastic gradient descent and back-propagation (Le and Mikolov, 2014) . While paragraph vectors are unique among paragraphs, the word vectors are shared (Le and Mikolov,

2014). At prediction time, the paragraph vectors are inferred by fixing word vectors and training the new paragraph vector until convergence. Their empirical results on several text classification and tasks show that Paragraph Vectors

18 outperform bag-of-words models as well as other techniques for text representations, achieving state-of-the-art results on those tasks (Le and Mikolov, 2014).

1.3.3 Semantic Lifting

The term Semantic Lifting, has been used from the perspective of automated knowl- edge extraction, to refer to the process of associating content items with suitable se- mantic objects as “metadata” to turn “unstructured” content items into semantic knowledge resources (Interactive Knowledge Stack, 2012). It refers to a way to extract knowledge from content in an automated way using unstructured as well as struc- tured data in ontologies, within complex knowledge domains. Thus semantic lifting can make explicit “hidden” metadata in content items. Structured content provides implicit semantics through the structure definition, e.g. table definitions in relational databases, XML schemas, field definitions for addressbooks, etc. from an application standpoint, where application programs are designed to “know” how to interpret the structures and the data within. Unstructured content consists of images, texts, videos, music, web pages composed of various types of media items that are meaningful only to humans but not to machines. To become meaningful to and ’discoverable’ by ma- chines, content must be described semantically by metadata, e.g. what the text or image is about. In the case of mixed content – structured databases may be used to store unstructured content types, such as texts, images etc., and documents can be composed of unstructured content items such as free text and images as well as more structured information, e.g. tables and charts. Also, content metadata may exist in many forms, viz. free text descriptions, descriptive content related keywords or tags

19 from fixed vocabularies or in free form, taxonomic and classificatory labels, media spe- cific metadata, e.g. mime-types, encoding, language, bit rate, and even media-type specific structured metadata schemes for photo, image and video content, content re- lated structured knowledge markup, e.g. to specify what objects are shown in an image or mentioned in a text, and what the actors are doing etc. Formal semantic metadata may also be a part of the system, where data representation uses a formalism that defines the concept of (logical) entailment for reasoning, with a formal semantic in- terpretation, viz. soundness: conclusions are valid entailments; completeness: every valid entailment can be deduced; and decidability: a procedure exists to determine whether a conclusion can be deduced. The various embodiments of formal semantic metadata could be via Knowledge Representation Systems, Description Logics and standard representations such as RDF, OWL etc. On the other hand “Semantics in

Content Management Systems (CMS)” provide various methods to include metadata, e.g. organize content in hierarchies, hierarchical taxonomies, attachment of properties to content items for metadata, and content type definitions with inheritance. These methods are used in CMS systems in ad-hoc fashion without clear semantics. Therefore no well-defined reasoning is possible. In these systems, semantic lifting may thus be used as content enhancement to support data exchange and seamless interoperability between different sub-systems. From the standpoint of large CMS systems, seman- tic lifting usage would include: (1) the import of external sources/content/documents via automatic extraction and analysis, e.g. for indexing; (2) integration of external content into content repository; (3) transformation to ensure content matches internal

20 CMS structures and metadata schemes; (4) cross-referencing/linking among CMS con- tent items and external content; (5) detect related or additional content; and (6) add pointers/links to related or additional content (Interactive Knowledge Stack, 2012).

Azzini et al. 2013, employ the idea of semantic lifting in a process mining (a process management technique) context. They perform process mining based on the computation of frequencies among event dependencies to reconstruct the workflow of concurrent systems, and apply appropriate semantic lifting in their framework to extract knowledge from event and workflow logs recorded by an information system.

In particular, they show how applying an appropriate semantic lifting to the event and workflow log may help to discover the process that is actually being executed. In this case, the semantic lifting procedure corresponds to all the transformations of low- level systems logs carried out in order to achieve a conceptual description of business process instances, without knowing the business process a priori, and by semantic lifting they are thus able to demonstrate its effectiveness for data loss prevention aimed at preventing the loss of critical business information in companies(Azzini et al., 2013).

1.4 Concepts, in Context

In this section I present different versions of how of the term concept has been defined drawing from various disciplines and then present the definition we adopt for purposes of this research proposal.

Merriam-Webster dictionary defines the term “concept” as: 1.something conceived in the mind: e.g. thought, notion. 2 an abstract or generic idea generalized from particular instances, e.g. the basic concepts of psychology, the concept of gravity.

21 Carey et al. state that concepts are the fundamental building blocks of our thoughts and beliefs. They play an important role in all aspects of cognition. Concepts arise as abstractions or generalizations from experience; from the result of a transfor- mation of existing ideas; or from innate properties. Thus, a concept is instantiated

(reified) by all of its actual or potential instances, whether these are things in the real world or other ideas (Carey, 1999).

Concepts are studied as components of human cognition in the cognitive science dis- ciplines of linguistics, psychology and philosophy, where ongoing debate asks whether all cognition must occur through concepts. Concepts are used as formal tools or models in mathematics, computer science, databases and artificial intelligence where they are sometimes called classes, schema or categories. In informal use, the word concept often just means any idea, but formally it involves the abstraction component. Concepts are sometimes known by other names in everyday language such as “kinds”, “types” or

“sorts”, as in “an oak is a kind of tree”, or “this object is a kind of tree” (Carey, 1999;

Margolis and Laurence, 1999).

Further, concepts can be organized into a hierarchy, higher levels of which are termed “superordinate” and lower levels termed “subordinate”. Additionally, there is the “basic” or “middle” level at which people will most readily categorize a concept, which is also the level at which we deal with concepts in this research proposal. For example, a basic-level concept would be “chair”, with its superordinate, “furniture”, and its subordinate, “easy chair” (Eysenck, 2006).

Merrill, 1977, provides an extensive treatment of concepts and their types from an instructional design standpoint, where concepts must be taught. According to Merrill, a concept is a set of specific objects, symbols, or events which are grouped together

22 on the basis of shared characteristics and which can be referenced by a particular name or symbol. Most of the words in any given language refer to classes or categories of symbols, objects or events rather than to particular instances of these categories.

Usually it is necessary to use modifying words to make one of these general class words refer to a particular instance. Thus, the word ”cat” refers to a set of objects which share particular characteristics such that one can learn to distinguish cats from dogs, rabbits or any other animal. Thus any given cat represents a member of the general set ”cat”, where in order to refer to a specific cat, one must use modifying words to make the general word specific. Thus, ”my cat”, ”that cat in the zoo”, ”the yellow cat with the short tail” illustrate the use of the modifying words to make the general category ”cat” specific to a particular instance. The word instance is a general term used to refer to both members and non-members of a concept class. There are two kinds of instances: examples and non-examples. An example is an instance which is a member of the concept under consideration. Thus sandstone and shale are examples of the concept class “sedimentary rock” (Merrill and Tennyson, 1977).

“Object Concepts” are further defined as concepts that exist in time and space and can easily be represented by drawings, photographs, models or the object itself, such as things that one might find in a school textbook, e.g. strait, immigrant, whale, rivers etc., while “Symbol Concepts”consist of particular kinds of words, numbers, marks and numerous other items that represent or describe objects, events or their relationships, either real or imagined. To illustrate, these may be symbol concepts such as noun, verb, predicate, paragraph from language arts, or these may be symbol concepts like those from mathematics, such as the unknown quantity, equation, whole number, fraction, decimal etc. Still other “Event Concepts” are defined as interactions of objects in

23 a particular way in a particular period of time, and some such concept names would include e.g. birthday party, censorship, diagnosis, digestion, football game etc. (Merrill and Tennyson, 1977)

1.4.1 Concepts

In contrast with the above ideas, however, we define the idea of a concept from solely a cognitive and language part-of-speech standpoint. Thus, in this dissertation we formulate our definition of a concept based around the nominalized sense of a real-world idea or unit of cognition, expressed within a natural language utterance or text. But first, to establish some common ground, we define what a term is – we define a “term” to mean either a word or a phrase. Thus, more concretely,

Definition:A concept, in this work, refers to nominalized senses of words or phrases (terms) that exist at the mid to “leaf” level of either a conceived hierarchy, or of an existing ontology, primarily as a hyponym of some higher level hypernym concept, and possibly itself having other hyponyms and synonyms.

Examples of concepts, as defined in this work, are therefore surface form mani- festations as singleton or multi-term nominal instances of higher-level hypernyms or semantic classes, and may be instances such as school bus, shoes, apple pie, crumbly chocolate chip cookie, ibuprofen and atrial fibrillation. Concepts by definition in this work, are therefore surface forms of a more generic, abstract or KB-defined hypernym- level concept.

A related concept, to a given concept, is thus words or phrases in a nominal sense, that represent hyponyms of semantically related, paradigmatically parallel (Turney and Pantel, 2010; Fellbaum, 1998) hypernyms. Thus, related concepts to shoes may

24 be like sandals, flip-flops, boots or clogs where either all share a common hypernym like “Footwear”, or come from possible related hypernyms like “Open Footwear” and

“Closed Footwear”.

From the perspective of reasoning with large amounts of unstructured texts for knowledge discovery, where a multitude of relations may hold between related concepts

(Hendrickx et al., 2009), we may primarily be interested in concepts that are taxonom- ically similar and paradigmatically parallel but are still semantically related, and we hope to find a construct that allows for bridging between the two. This work hypothe- sizes that Contexts may be able to provide such a bridge, and hopes to demonstrate that it is indeed novel contexts that allow for the “transfer” of the notion of related- ness between concepts at two ends, either of which may be existing or conceived, and which may otherwise have been unrelated. For relatedness described in this way, it then becomes necessary to define the idea of a latent concept, as used in this work.

1.4.2 Latent Concepts

Definition: We define a latent concept as a term, or set of terms c = {ta, tb, tc ...}, such that, in conjunction, this set of terms, or single term, refers in the nominal sense, to either an existing (observed) or conceived (abstract) concept. In other words these could be concepts or semantic classes accessible only via their surface forms and can be seen as possible “hypernyms” to instance or subclass hyponyms. Examples of latent concepts could be “Squishy Objects” for which surface form manifestations could be like marshmallows and cookie dough, or, “Taking treatment for back pain” which could have surface forms like ibuprofen and acupuncture. However, another example could also be where a set of terms occurring in the text within a small neighborhood or context

25 00 00 window, like C1 = {“motorized vehicle , “one − wheeled } which in conjunction, may refer to a latent concept like “segway” where the literal term segway itself is not explicitly mentioned in the text, but referred to by this set.

Thus a “latent concept”, just like any other concept, may or may not currently exist in the concept hierarchy of some known ontology or knowledge resource, but is reified as a real-world idea, on account of the usage of the actual term, set of terms, i.e. surface forms representing it.

Thus, “Hyperglycemia” and “Taking medication for high blood pressure” (both present in the UMLS ontology – CUI:C0020456 and CUI: C2054151 respectively), the abstract “Squishy Objects” (potentially not present in any or knowledge re-

00 00 00 source) and the set C2 = {“personalized , “two − wheeled , “vehicle } referring to a concept like “bicyle” or “motorcycle” when these literal terms are not explicitly present in the text, may all be examples of latent concepts, surface forms of which, can occur in novel contexts, or as a result of reasoning with some text, either by using or not using a knowledge base lookup.

As would be described in upcoming chapters, specifically through the introductions accompanying figures 4.1 and 6.1, for the respective choices of recommendation tasks, viz. neural semantic tagging and neural domain adaptation–based question answer- ing, we seek to illustrate the utility of uncovering latent concept associations toward knowledge discovery.

But before describing the research contributions in subsequent chapters, we first provide some more definitions to establish common grounds of vocabulary for the rest of this dissertation, and that will be helpful in formulating the problem statement.

26 1.4.3 Context

Definition: We further expand on Guha’s definition of contexts outlined in section

1.3.1 from the notion of “rich objects within a domain that cannot be fully described”, to capture the notion of a neighborhood. Thus, for purposes of this dissertation, we define a context as simply a neighborhood, or a scope, i.e. “a boundary within a certain frame of reference”, within a system or a collection currently being examined, within which, some concepts exist. Thus in a scenario a context is simply a region of text containing some concepts. In Figure 1.2 this is represented by the blue dotted ovals.

Some examples of contexts by this definition may be, a sentence or paragraph within a document, a document snippet, an individual document, a cluster of docu- ments, a corpus, an image, a small area within an image, a collection of images, a particular branch within an ontology or taxonomy, a full taxonomy, and other similar neighborhoods.

Figures 4.1 and 6.1 better illustrate this idea where, the light blue ovals represent

“contexts” that contain various concepts or other contexts, orange ovals represent

“latent concepts” (if dashed border) or actual concepts (if solid border) present in an ontology, the green ovals are hyponyms, synonyms or surface form lexical variants of the concepts they represent, the purple solid lines represent different surface forms being related by way of the shared context they occur in (see section 1.4.4), and the red dashed lines represent the “hidden relation” between latent concepts as a result of a purple line relating two corresponding surface forms.

27 1.4.4 Shared Context

Definition: Drawing on the definition of a context, a shared context is thus, simply a scope that is shared by certain concepts. A shared context may thus exist independently of other contexts, or it may span different contexts, or subsume other contexts, in order for concepts to share this context, but it is shared purely by virtue of the concepts occurring within its scope, that share this context.

Thus, words within a sentence can have as the shared context, the whole sentence, the containing paragraph, the containing document, and may also have shared context with other paragraphs from other documents that contain the same or similar words.

1.4.5 Novel Context

Definition: We define a novel context as a scope of terms containing a single concept or a set of concepts which have not been observed thus far, as co-occurring exactly in this form, in the system or collection under consideration, and which is thus representative of some new context.

Thus, examples of novel contexts may be a partial or fully formed sentence of a natural language query, or any text snippet, or any image/other signal snippet, that we have not observed thus far with respect to a collection. Novel contexts may or may not relate to latent concepts occurring within a system under consideration.

1.5 Problem Statement and Research Task

Given the ground covered thus far, we are now ready to formulate our problem definition.

28 1.5.1 Problem Definition

Research Statement: The primary research task in this dissertation is to show how leveraging associations between latent concepts, (section 1.4.2) may help to address novel contexts (section 1.4.5) by discovering related concepts in some downstream recommendation task setting, and how knowledge transfer methods may significantly aid and augment this process of discovery. I seek to achieve this by finding evidence to support the following main ideas.

• Main Idea 1: We may leverage the assumption that hidden associations may

exist between latent concepts expressed via their surface forms, to develop models

that learn to predict novel associative terms in documents, that may improve

query matching.

Thus, hidden associations or latent relations, not currently present in a knowledge

base, may exist between same or related “latent concepts” expressed via their

corresponding surface forms concepts or entities, in a large labeled or unlabeled

collection. Thus,

E.g. 1. lexical form1 = hypotensive drug, lexical form2 = high blood pressure,

latent concept1 = “Antihypertensives”,

latent concept2 = “Hypertensive disease”,

latent relation = treatsFor

E.g. 2. lexical form1 = high blood sugar, lexical form2 = elevated glucose level,

latent concept1 = “Hyperglycemia”,

latent concept2 = “Hyperglycemia”,

latent relation = sameAs.

29 Here we must note that the terms in the first example, refer to different latent concepts related by a potential relation like treatsFor, whereas the terms in the second example are meant to refer to the same concept where we may view the potential relation or hidden association, to be sameAs. This is intended for purposes of this research where we neither seek to extract relations pre-defined within an established knowledge resource, nor do we seek to give an exact label to potential relations that may exist between concepts. At times, due to polysemy, latent concepts represented by their surface forms may even have multiple rela- tions between them. Thus we are not trying to define the exact relation between two concepts. Rather we believe that by leveraging any potential relations that may exist between the latent concepts, and as a result between their represen- tative surface forms, we may be able to retrieve the target surface forms we are interested in.

Thus, we hope to take advantage of potential relations or hidden associations that we believe may exist between latent concepts as they freely occur in the observed text data (as surface noun phrases), so as to discover a set of surface form concepts in a collection (e.g. documents or corpus), each of which, may bear some relation to, another set of surface forms seen in the novel context (query or question). This, we believe, is akin to using the out-of-vocabulary relations made possible by concepts in a novel context and concepts seen in the observed data, to find related, within-vocabulary concepts and entities in the data.

Prior related works on hypernym discovery, notably due to Ritter et al. (2009), have pointed out the limitations of carrying out that task using lexical resources

30 such as WordNet (Miller, 1995) given its limited coverage of proper nouns, where, using the WordNet on a test set of noun phrases randomly selected from their Web corpus, they found only 17% of the proper nouns and 64% of the common nouns were covered by WordNet, with missing information even for noun phrases that are found in WordNet. E.g., WordNet has an entry for “Mel

Gibson” as an instance of the class “actor”, a subclass of “human”, but does not indicate that he is a “man”. Further, one would need to use other knowledge sources or extractions from a corpus to find that Mel Gibson is also an instance of “movie star”, “Australian”, or “Catholic”.

They further point out, that previous methods using mainly lexical patterns to discover hypernyms suffer from limited precision and recall. Indeed, in some of the earliest research in automatically discovering hypernyms from text (Hearst,

1992) called Hearst patterns, she presented manually discovered lexical patterns

(regular expression-based grammar parses) that detect hypernyms C to entities

E that are fairly good evidence that an entity E is a subclass or member of

C. However, due to the local nature of these patterns, they are quite prone to errors, and don’t scale or generalize very well. E.g. a sentence with “... urban birds in cities such as pigeons ...” matches the pattern “C such as E” with C bound to city and E bound to pigeon, leading to city as a hypernym of pigeon, or hypernymOf(city) = pigeon which is false.

Approaches such as Ritter et al. (2009), that incorporate lexical patterns for the task of hypernym discovery, define hypernymy as a relation between surface forms (i.e. literal E and C) and not “synsets” as in WordNet. This is in contrast,

31 we believe, to our approach, where similar to WordNet, we define any potential relation, including hypernymy, to hold between “latent concepts” and through them, between their respective surface forms, as outlined by main idea 1. This abstraction, coupled with removing the necessity to specify or determine an exact relation, we believe, allows for context to play a greater role in showing us if there may exist “a relation” between two latent concepts. It is an important to note here that our primary research goal is to show improved in downstream tasks which we believe is being made possible by matching to the right concepts in target text via latent concept relations. Thus we want to find the right concepts in context to enable this goal, leading us towards a process for concept discovery, however the goal is not concept or hypernym discovery by itself primarily just for purposes of vocabulary expansion, resolution, or knowledge base population but for improved downstream recommendation. Thus, we believe that the potential relations that we hypothesize may exist between latent concepts, play a similar role to the lexical patterns used by past approaches for tasks such as hypernym discovery, except that, our methods for realizing these patterns may be more generic and dynamic owing to more local and global context-based neural approaches that we employ, as evidenced also by recent more effective approaches to hypernym discovery (Zhang et al., 2018, 2016; Hassan et al., 2018), and due to thus leveraging shared contexts to infer potential relatedness. This leads us to main idea 2.

32 • Main Idea 2: Shared Contexts, such as those made possible e.g. via neural

language modeling–based vector space models, may be used to leverage the pos-

sible hidden relations or novel associations between latent concepts, expressed

via their surface forms, as opposed to modeling directly for the latent concepts.

As described earlier, we believe that because a shared context may exist inde-

pendently of other contexts, or may span different contexts, or subsume other

contexts, words within a sentence can have as shared context, the whole sentence,

the containing paragraph, the containing document, or the containing collection

(domain). We believe words within a query sentence for example, may also have

shared context with paragraphs or portions from other documents that contain

the same or related words. Thus this idea suggests that these shared contexts

could be used to find related texts to a starting text wherein the concepts ap-

pearing in both the texts may bear some relation.

• Main Idea 3: Directly inferring the hidden associations between latent concepts

can lead to discovery-based insights in prediction and recommendation tasks,

e.g. IR and QA, by alleviating the need to obtain intermediate task-specific

relationship labels.

Chapters viz 4, 5 and 6, we believe, thus represent methods to better elicit these free-form relations possible between the surface forms of the abstract latent concepts, without explicitly training on pre-defined relations or aiming to predict or name them.

This, we believe, is made possible via the designed training objectives to upweight, learn encodings for, rank, or adapt related texts appropriately, but with the aim of

finally presenting the texts bearing the appropriate concepts in a retrieved document

33 Figure 1.2: Example of Hidden Associations between Latent Concepts in Context.

or answer, having some relation to the concepts in some novel, user-generated context, such as a complex clinical query or a specific product-related question.

Thus, as an example, the alternate lexical forms high blood sugar and elevated glucose level may exist in separate documents in a collection, both referring to the same underlying latent concept “hyperglycemia” where the literal text hyperglycemia may not have explicit occurrence anywhere in that collection. This example, we believe, represents a sameAs relationship (corresponding to a binary predicate of the same name) between the alternate lexical forms high blood sugar and elevated glucose level, for a latent concept “hyperglycemia”, that does exist in the UMLS ontology. Examples of such latent concept associations specific to each research work in this dissertation will be expanded upon in detail in associated illustrations in Chapters 5 and 6 introductions and via the examples at the end of those chapters.

However, now consider a case, as depicted in Figure 1.2, where marshmallows and water balloons are surface forms for a latent concept like “Squishy Objects”, and smores

34 and jell-o are surface forms for a latent concept like “Desserts”; where, an explicit rela- tion like madeFrom between smores and marshmallows, or their corresponding latent concepts, may not exist in a knowledge base. Similarly the idea that some “Squishy

Objects” (marshmallows) may be semantically similar to some types of “Desserts”

(jell-o) due to both individual surface forms implicitly satisfying a predicate like made-

OfGelatin, may be the result, we believe, of an implicit hidden relation between the two latent concepts “Squishy Objects” and “Desserts”. Here madeOfGelatin, we believe, may act as a hidden association referred to by a concept within a novel context (such as jell-o from Context 3), alluding to some real-world likelihood that some desserts made of gelatin might be squishy objects. This rule or relation, may not only be absent from a knowledge base, but may be difficult to add to a knowledge base using existing methods, distant supervision (e.g. section 2.2.3), if the entities, concepts or semantic classes being referred to, do not already exist in that knowledge base or some ontology.

This is akin to the problem of certain semantic classes satisfying a selectional preferen- tial constraint (section 2.2.5) in which the relation that holds between the two classes is not known a priori but for which the “context in which the two appear related, acts as a proxy”. However, we know that a query regarding water balloons should give us items from a collection (which may consist of documents or answers) relating only to the latent concept Squishy Objects, whereas a more complex query or “novel context”, regarding “gelatinous squishy objects that are desserts” should ideally give us items containing the correct surface forms, i.e marshmallows and jell-o, involving latent con- cepts from two portions of a “conceived hierarchy” given that both marshmallows and jell-o are both “Squishy Objects” and “Desserts”, and as a result of the hidden relation madeOfGelatin between these latent concepts.

35 Because the same context (e.g. Context 3 that is about gelatin-based desserts) may relate multiple latent concepts (e.g. “Squishy Objects” and “Desserts”) (section 1.4.2),

(Medin et al., 1993), and also because the same latent concept, like “Squishy Objects” may occur in multiple contexts (e.g. as the surface forms marshmallows versus water balloons) (section 1.4.3), (Medin et al., 1993; Turney and Pantel, 2010), therefore, this research attempts to implicitly model for these potential relations or hidden associa- tions between latent concepts, by leveraging distributional semantics-based approaches for representation of the concepts, along with strategies to enable inference of these hidden relations, by techniques such as maximum likelihood-based language modeling and neural domain adaptation methods for effective knowledge transfer, thus further augmenting semantic matching to enable discovery of related surface form concepts in novel contexts.

This phenomenon of potential hidden associations existing between latent concepts in a given context, is better depicted in Figure 1.2.

1.5.2 What is Concept Discovery?

Thus we ask what concept discovery might look like in practice. We attempt to provide a perspective on this process, based on the problem definition outlined in this research.

Concept Discovery, we believe, is a step towards finding what we mean, in a certain context.

One insight is that it may be more towards finding “related” concepts from an analogic perspective, not just for purposes of attaching to an existing ontology or hierarchy, but to satisfy a completely new relation that holds true in a certain context,

36 generally unseen. Two well-known forms of expressing an information need are via a short phrase-based search query, such as into a search engine or document retrieval system, or a natural language question such as on online community question answering fora.

As a trivial example, ordinarily unrelated entities or concepts (noun phrases, in this work) such as “scarlet macaw” and “platypus” occurring in separate documents d1 and d2 may become related by a novel context such as “exotic pets” that may occur as terms in a query or as a phrase in a document dp which could be related to both d1 and d2. Now, if by some means, documents d1 and d2 were semantically tagged with the phrase “exotic pets” via dp, those documents would surface in the event of such a query.

Similarly, consider seemingly disparate concepts such as “paper products” and

“handheld electronics” that may not bear any known relation to one another in a well-established ontology or knowledge graph. But if what we really “meant to find” were the “most-ordered products on Amazon”, then there suddenly appears an intuitive relation between these ordinarily unrelated concepts that are neither “semantically as- sociated” nor “taxonomically similar” according to those definitions given by (Turney and Pantel, 2010). However in the absence of an exhaustive database of all possible relations, it may be near impossible to uncover such relations by modeling explicitly for them.

Rather if we approach this problem from the perspective of finding (discovering) concepts that may be related to a given concept not only in a similarity sense but additionally also in an analogic sense, in some context, where the context itself may

37 represent another concept, then this could increase the odds of capturing rare but useful relations, in order to find and arrive at what is needed.

E.g. If X and Y are similar with respect to many properties, then what is known about X may well transfer to Y. In fact, one reason to say “X and Y are similar” instead of “X and Y are similar with respect to properties P1, P2, and so forth” is that one may wish to leave open the possibility that unknown properties are shared by X and

Y. By making a nonspecific similarity claim about X and Y, one explicitly creates an expectation for new commonalities to be discovered (Gelman Wellman, 1991; Medin

Ortony, 1989; Wellman Gelman, 1988) (Medin, 1989).

As research in analogy suggests, what is crucial in analogical reasoning is not over- all similarity but relational or structural similarity (Gentner, 1983, 1989). Some terms are defined to make this point clear. First, one may distinguish between attributional and relational similarity. Roughly, the distinction is as follows: Attributes are predi- cates taking one argument (e.g., X is red, X is large), whereas relations are predicates taking two or more arguments (e.g., X collides with Y, X is larger than Y). Attributes are used to state properties of objects; relations express relations between objects or propositions. (Gentner, 1983) has argued that relational similarity has a special sta- tus in analogical reasoning so that, when one suggests that “an atom is like the solar system”, the intended meaning involves relations such as “revolves around”, “more massive than”, and “exerts forces” rather than attributes such as “hot” or “yellow”.

According to Gentner’s structure mapping theory (Gentner, 1983), interpreting an analogy is fundamentally a matter of finding a common relational structure. Thus the objects in the two domains are placed in correspondence on the basis of holding like

38 roles in the relational structure, not on the basis of intrinsic attributional similarity.

Thus here, matches and mismatches in object attributes can be neglected.

Thus concept discovery, we believe, may be thought of as simply the process of

finding related concepts by way of leveraging the hidden associations that would serve to make two concepts related. Thus concept discovery may be broadly characterized as follows:

• Any two concepts may be related, given the right context. This context therefore

serves to define the relation between the two concepts.

• A hidden relation is made possible only by a “novel context” that is not previously

seen, defined or considered in context of the data.

• A hidden relation may hold between latent concepts via their surface forms which

are not necessarily canonical concepts or entities in an existing knowledge resource

• Only these hidden relations or associations between latent concepts can serve to

“discover” related concepts in a given context

• Within-vocabulary concepts and entities present in the data, serve to address

novel contexts, that may be seen as bringing the out-of-vocabulary relations or

predicates that need to be satisfied

• Concept discovery can thus be viewed as a two-step process: “matching” a novel

context to the appropriate hidden relation between latent concepts, and “retriev-

ing” the surface forms of the matched related concept as the “discovered” terms

or concept.

39 • In contrast to related tasks such as paraphrase identification, where the under-

lying relation is on the lines of “sameAs”, in concept discovery the scope may

be much broader potentially covering “any” relation made possible by an unseen

context.

• Concept discovery may be broader in scope than fact-based relation extraction

between entities or canonical concept forms derived from existing knowledge re-

sources for purposes of knowledge base population or completion

• The “goodness” or “effectiveness” of discovered concepts can only be measured

by proxy downstream tasks such as semantic tagging for information retrieval or

subjective question answering (QA) (as opposed to fact-based QA), and other

similar recommendation tasks

We contrast our outlined process of concept discovery for improved downstream recommendation, with the task of hypernym detection well studied in the literature.

In this respect, Camacho-Collados et al. (2018) point out that evaluation benchmarks for modeling hypernymy have been designed such that in most cases they are reduced to binary classification, where a system has to decide whether a hypernymic relation holds between a given candidate pair of terms. Criticisms to this experimental setting underscore the fact that supervised systems tend to benefit from the inherent mod- eling of the datasets in the hypernym detection task, leading to lexical memorization phenomena (Shwartz et al., 2016). In this respect, recent works have attempted to alleviate this issue by including a graded scale for evaluating the degree of hypernymy on a given pair (Vuli´cet al., 2017). Importantly (Espinosa-Anke et al., 2016) proposed to frame the problem as “hypernym discovery”, where, “given the search space of a

40 domain’s vocabulary”, and “given an input term”, the goal is to discover its best (list of) candidate hypernyms. This formulation addresses one of the main drawbacks of the evaluation criterion earlier described, and better frames the evaluated systems within downstream real-world applications (Camacho-Collados et al., 2018). This improved formulation is similar in spirit to how we have described the process of concept discov- ery in the context of this research, where automated discovery of related concepts, it is expected, may lead to improved semantic matching in downstream recommendation tasks, and therefore, may be best evaluated as such, in terms of gains observed in those tasks, as outlined in the final characterization of this process above. Espinosa-Anke et al. (2016) also provides a good intuition as to why knowledge transfer methods may aid this process of automated concept discovery towards improved downstream rec- ommendation. Also, given that more recent, established works in hypernym discovery

(Espinosa-Anke et al., 2016; Camacho-Collados et al., 2018; Zhang et al., 2018), an area that is related to our work, define as their goal and operate such that they aim to “discover” the best (set of) candidate hypernyms for “input concepts or entities”, given the search space of “a domain’s pre-defined vocabulary” – it therefore provides us the intuition for referring to this process as concept discovery or automated concept discovery, in the context of this research.

From an operational standpoint with respect to a traditional supervised model training and prediction setup or in an unsupervised language modeling setup this may translate to achieving better generalization to out-of-vocabulary query terms such as by lowered perplexity, or to better generalize to unseen data from the same or different domains.

41 Thus, in this work, the concepts involved, may themselves be ordinarily known, but what is being discovered are simply other related concepts that bear a new or previously unknown relation to the original concept, in some new context, this relation being made possible by concepts occurring in a novel context, and thus serving to bring us one step closer to perhaps finding what we mean.

1.6 Main Research Contributions

This section provides a brief introduction to the main research contributions, as it relates to the background literature in chapter 2, and the problem statement defined in section 1.5.

1.6.1 Exploring Topic Models Over Complementary Signals to Explain Subscriber Engagement

Customer churn, also known as the attrition rate, is a well-studied problem, es- pecially in industries like telecommunications, retail CRM, the social Web, and to a great extent in online and gaming communities (Archambault et al., 2013; Cousse- ment and Van den Poel, 2008; Resnick et al., 1994; Idiro Analytics, 2017a,b; De Bock and Van den Poel, 2010; Kawale et al., 2009; Tarng et al., 2009; Debeauvais et al.,

2011; Karnstedt et al., 2011; Dror et al., 2012; Dave et al., 2013). With the advent of social media and news aggregator websites, newspapers large and small, face the challenge of an ever declining print readership, resulting in “customer churn” (Pew

Research Centers, 2013). The Columbus Dispatch Printing Company (The Dispatch), is a newspaper with a circulation of 1.2 million subscribers. In this work, our objective was to explore various structured and unstructured data available within the Dispatch, from its print and on-line properties, to gain insight into various factors affecting user

42 engagement. These insights were used to come up with predictive models of customer churn using features mined from both, transactional databases or Web-based textual data to determine which factors most impact user engagement.

We thus build several models of subscriber churn on this data, providing a compar- ison of the feature sets of these models, with respect to their predictivness, in partic- ular, juxtaposing the web-based topic and sentiment features from WEB-NEWS against

WEB-CLICKLOG data, which we find to be complementary sources of signal. We high- light particularly how “topics” transfer across the two sources, in the context of churn prediction. All of our models based on temporal, unigram and Web-based topic and sentiment features, show statistical significance, and improved prediction over base- line models that utilize only transaction metadata, showing that features mined from

Web news content and on-line user activity do have influence on newspaper subscriber engagement. The insights drawn on this data by means of topic transfer across sources for prediction, informs the next works where the goal is to improve the per- formance on recommendation tasks by transfer of latent concepts within and across data domains. Also the course granularity and bag-of-words approaches to topic modeling that can limit the discovery process, is addressed in the next work where distributed models of representation are employed with the aim of large scale unsupervised discovery (Mikolov et al., 2013b; Le and Mikolov, 2014).

1.6.2 A Phrasal Embedding-based Generalized Language Model for Semantic Tagging for Query Expansion

In this work, we draw on the insights from the modeling experiences of the previous work. We demonstrate that our proposed method of semantic tagging via phrasal

43 embedding GLM (Phrase2VecGLM)–based document ranking, which makes no use of external knowledge sources, used for document and thus query expansion in both direct and pseudo-relevance feedback settings, can prove an effective means to address complex queries such as in clinical decision support systems (Roberts et al., 2016b). We run experiments with this model on both, direct query expansion and pseudo-relevance feedback–based query expansion settings. In all of our experiments our system gives better retrieval performance on query expansion than even human–provided query expansion terms and surpasses various other baselines making use of resolved concept terms out of the UMLS ontology. This work is especially useful to solve the problem of lack of keyword coverage for all documents in any collection.

1.6.3 Sequence-to-Set Semantic Tagging with Neural Atten- tion for Complex Query Reformulation and Automated Text Categorization

Next, I propose an alternate generative approach to semantic tagging by multi-label prediction, drawing on similar approaches to document summarization and machine translation (Sutskever et al., 2014; Bahdanau et al., 2014), using a sequence-to-set– based formulation of Phrase2VecGLM – which we call Seq2Set. This amounts to an inductive transfer learning setting for the transfer of latent concepts within the same domain, where no document labels are available in the source domain for the unsupervised task setting so that models have to be trained by self-taught learning bootstrapping by TF-IDF–based pseudolabels derived from the corpus itself. The inductive aspect is from the fact that this Seq2set framework is then adapted into two additional settings for multi-label prediction, viz. supervised and semi-supervised, each being evaluated on an alternate text categorization task having a known set of tags

44 for documents. The Seq2set model employs sequential neural models and an attention mechanism in an encoder-decoder framework to learn a semantic space that takes the sequence-based encoding of a document and maximizes the likelihood for the correct semantic tags to be predicted by the decoder. Thus it provides an end-to-end trainable neural architecture for semantic tagging of documents. Like Phrase2VecGLM, this model is also evaluated for the same document retrieval task for the unsupervised query expansion setting via semantic tagging of documents on the 2016 TREC CDS dataset, and two additional settings: supervised and semi-supervised; for an automated text categorization task with a known set of tags on two additional datasets – the del.icio.us folksonomy and Ohsumed abstracts respectively, surpassing the known state-of-the art baseline for this task on those datasets.

1.6.4 Learning to address complex product-related queries with product reviews by neural domain adaptation

The final research contribution is to learn to answer complex subjective queries about products, using product reviews as an alternate rich source of data, as described in the original work on this by (McAuley and Yang, 2016). This task represents a transductive transfer learning setting as we develop a solution using neural domain adaptation with the source domain being represented by answered questions (having labels), and the target domain of rich opinion data being represented by product reviews

(having no labels) that is different from the source domain yet similar, by the novel application of domain adversarial training approach of Ganin et al. 2016, to affect knowledge transfer across domains (McAuley and Yang, 2016; Ganin et al., 2016).

Thus we now aim to effectively transfer latent concepts across domains in finding

45 answer to specific, subjective product-related questions. Our experiments show that the proposed neural domain adaptation approach for question-answer/review sentence pair classification via domain adversarial training shows good results for learning to answer specific product-related questions with customer reviews, as it is able to efficiently leverage plentiful unlabeled review data during training to better generalize to review data at inference time, outperforming numerous baseline models that cannot easily incorporate such data

As would be further elaborated in the chapters to follow, we aim to demonstrate that by leveraging distributional semantics-based approaches for representation of candidate concepts, alongside strategies such as maximum likelihood-based language modeling, sequence-to-sequence modeling with attention, and neural domain adaptation, we could enable inference of novel associations between latent concepts, by the effective transfer of knowledge within and across domains. In this way, this research aims to augment semantic matching to enable knowledge discovery in novel contexts in downstream recommendation tasks.

The remainder of this dissertation is organized as follows: Chapter 2 provides an extensive survey of related works as they relate to the problem statement in section

1.5, contrasting them with our work or explaining how they served to support the development of the various works in this dissertation relating to the research hypotheses in section 1.4. Chapters 3 through 6 provide detailed accounts of the problem definition, architecture, setup, experiments and discussion of results pertaining to each of the main research contributions highlighted in the immediately previous sections, while providing insights into how the particular works serve to support the original research

46 hypotheses. Finally Chapter 7 summarizes the contributions and lays the ground for future work.

47 CHAPTER 2

LITERATURE SURVEY AND SIGNIFICANCE OF THIS RESEARCH

This chapter provides an overview of related research areas investigated that are crucial in grounding how the chosen tasks address the research hypotheses identified in the problem definition statement in 1.5. The following sections continue with discus- sions that are germane to the development of this research. These include the following topics: (a) perspectives on similarity – this is covered from a standpoint of cognition and “computation of semantic matching” – this section serves to provide context for the development of the terminology (definitions) and research statement formulation in sections 1.4 and 1.5 of this thesis, (b) distributional and vector space semantics via latent semantic analysis, latent relational analysis and latent dirichlet allocation – these areas serve to highlight the role that our neural representational approaches and methods for latent concept analysis play in the research tasks, (c) knowledge transfer in its various schemes and settings – this section aims to highlight methods regarding the key aspects of this research, i.e. how and in what kinds of task settings may we hope to learn and transfer latent concepts, including an overview of natural language processing owing to the nature of our tasks and transfer learning within NLP such as large scale pre-trained word embedding models, and finally, (d) various related works

48 on concept and relational extraction such as knowledge base completion, early methods for concept discovery from text, distant supervision, multi-way classification of seman- tic relations in past SemEval shared tasks, paraphrase detection and extraction, and natural language understanding tasks like semantic inference via learning selectional preferences and named entity recognition. Appendix C highlights more related areas to this research in terms of knowledge representation, and relevant projects such as never- ending language learning. All of these areas represent tasks, methods or paradigms that are instrumental in the development of the research contributions presented, and are intended to highlight how they still do not fully address some of challenges in seman- tic matching, such as finding hidden relations between latent concepts as identified earlier on in this dissertation. Thus, the goal in providing this extensive background is to highlight fundamental concepts that have been applied, and to position the re- search contributions presented in later chapters, in the light of these related areas. For background regarding aspects of probability and information theory, neural networks, various neural models and machine learning, that were applied to and form the basic building blocks of the methods described in this dissertation, the reader is referred to

Appendices A and B respectively.

2.1 Perspectives on Similarity

2.1.1 Respects for Similarity

In their seminal 1993 article, Medin et al., submit that similarity is one of the most central theoretical constructs in psychology. It pervades theories of cognition.

Transfer of learning is said to hinge crucially on the similarity of the transfer situation to the original training context (Osgood, 1949; Thorndike, 1931; Medin et al., 1993).

49 An important Gestalt principle of perceptual organization is that similar things will tend to be grouped together. Gestalt psychologists argued that these principles exist because the mind has an innate disposition to perceive patterns in the stimulus based on certain rules (Spelke, 1990). These principles are organized into five categories:

Proximity, Similarity, Continuity, Closure, and Connectedness (Spelke,

1990). The likelihood of successfully remembering depends on the similarity of the original encoding to those operations during retrieval (Roediger, 1990; Medin et al.,

1993). Citing the body of research associated with criticisms of similarity, Murphy and Medin (1985) note that “the relative weighting of a feature (as well as the relative importance of common and distinctive features) varies with the stimulus context and task, so that there is no unique answer to the question of how similar is one object to another” (p. 296). They argued that similarity is too flexible to define categories and that it is more like a dependent than an independent variable:“The explanatory work is on the level of determining which attributes will be selected, with similarity being at least as much a consequence as a cause of conceptual coherence” (Murphy

Medin, 1985, p. 296). Philosopher Nelson Goodman (1972) argued that similarity, like motion, requires a frame of reference (Medin et al., 1993) so that just as one has to say what something is moving in relation to, one also must specify in what respects two things are similar. For example, if Mary says that John is similar to Bill, one may have no idea what she means until she adds the observation that they both are avid chess players. Thus, similarity seems to disappear when it is analyzed closely, because the meaning is conveyed by the specific respects, not by the general notion of similarity. On a broad level, one may distinguish at least three distinct types of similarity: similarity as measured indirectly, direct judgments of similarity, and similarity as a theoretical

50 construct. As an example of addressing similarity indirectly, people may be asked to identify individual confusable stimuli, and the pattern of confusion errors may reveal underlying similarities (Medin et al., 1993).

Murphy and Medin suggest that similarity is too unconstrained to ground other cognitive processes such as categorization, by the following argument: Similarity is assumed to be based on matching and mismatching properties or predicates. Two things are similar to the extent that they share predicates and dissimilar to the extent that predicates apply to one entity but not the other. However, any two things share an arbitrary number of predicates and differ from each other in an arbitrary number of ways (Goodman, 1972; Watanabe, 1969). The only way to make similarity non- arbitrary is to constrain the predicates that apply or enter into the computation of similarity. It is these constraints and not some abstract principle of similarity that should enter one’s accounts of induction, categorization, and problem solving. To gloss over the need to identify these constraints by appealing to similarity is to ignore the central issue. Nelson Goodmans critique suggests that just how serious the respects problem is may depend on both the goals of researchers and the domain in question.

We consider specifically the implications of people’s judgments of the typicality or goodness of example of instances of a concept, that have been shown to vary with the context provided (e.g., Roth Shoben, 1983). Barsalou (1982) has demonstrated that similarity judgments also vary when particular contexts are specified. For example, a snake and a raccoon were judged much less similar when no explicit context was given than when the context of pets was provided. The general idea is that the context tends to activate or make salient context-related properties, and, to the extent that examples being judged share values of these activated properties, their similarity is increased.

51 Respects for Similarity finds expression in, and forms an underlying basis for each of the works outlined in this dissertation that support the research hypotheses while ad- dressing the main idea of transfer of latent concepts and recovering hidden associations between them.

2.1.2 From Frequency to Meaning

Turney and Pantel, 2010, provide the following perspectives on similarity from a vector space representation standpoint. Pair–pattern matrices are suited to mea- suring the similarity of semantic relations between pairs of words; that is, relational similarity. In contrast, word–context matrices are suited to measuring attributional similarity. The distinction between attributional and relational similarity, explored in depth by Gentner (1983), suggests that attributional similarity between two words a and b, sima(a, b) ∈ R , depends on the degree of correspondence between the prop- erties of a and b, while the relational similarity between two pairs of words a : b and c : d, simr(a : b, c : d) ∈ R, depends on the degree of correspondence between the relations of a : b and c : d. The term semantic relatedness in computational lin- guistics (Budanitsky Hirst, 2001) corresponds to attributional similarity in cognitive science (Gentner, 1983). Examples are synonyms (bank and trust company), meronyms

(car and wheel), antonyms (hot and cold), and words that are functionally related or frequently associated (pencil and paper) (Turney and Pantel, 2010).

Although we might not usually regard antonyms as similar, but antonyms have a high degree of attributional similarity (e.g. hot and cold, black and white). We prefer the term attributional similarity to the term semantic relatedness, because attributional

52 similarity emphasizes the contrast with relational similarity, whereas semantic relat- edness could be confused with relational similarity. In computational linguistics, the term semantic similarity is applied to words that share a hypernym (car and bicycle are semantically similar, because they share the hypernym vehicle) (Resnik, 1995).

Thus semantic similarity is a specific type of attributional similarity. Thus the the term taxonomical similarity may be preferred to the term semantic similarity, as the term semantic similarity may be misleading. But intuitively, both attributional and relational similarity involve meaning, so both deserve to be called semantic simi- larity. Words are semantically associated if they tend to co-occur frequently (e.g., bee and honey). Also, words may be taxonomically similar and semantically associated

(doctor and nurse), taxonomically similar but not semantically associated (horse and platypus), semantically associated but not taxonomically similar (cradle and baby), or neither semantically associated nor taxonomically similar (calculus and candy) (Tur- ney and Pantel, 2010).

Sch¨utzeand Pedersen (1993) defined two ways that words can be distributed in a corpus of text: If two words tend to be neighbours of each other, then they are syn- tagmatic associates. If two words have similar neighbours, then they are paradigmatic parallels. Syntagmatic associates are often different parts of speech, whereas paradig- matic parallels are usually the same part of speech. Syntagmatic associates tend to be semantically associated (bee and honey are often neighbours); paradigmatic parallels tend to be taxonomically similar (doctor and nurse have similar neighbours) (Turney and Pantel, 2010). This section highlights the various ideas about the possible ways in which relationships can exist between concepts, that is integral to the thought process behind the development of each of the works detailed in this dissertation.

53 2.1.3 Distributional and Vector Space Models of Semantics

Vector Space Models (VSMs) have several attractive properties. VSMs extract knowledge automatically from a given corpus, thus they require much less labour than other approaches to semantics, such as hand-coded knowledge bases and ontologies

(Turney and Pantel, 2010). VSMs perform well on tasks that involve measuring the similarity of meaning between words, phrases, and documents. Most search engines use VSMs to measure the similarity between a query and a document (Manning et al.,

2008). The leading algorithms for measuring semantic relatedness use VSMs (Pantel

Lin, 2002a; Rapp, 2003; Turney, Littman, Bigham, Shnayder, 2003). The leading al- gorithms for measuring the similarity of semantic relations also use VSMs (Lin Pantel,

2001; Turney, 2006; Nakov Hearst, 2008). Pair–pattern matrices are suited to mea- suring the similarity of semantic relations between pairs of words; that is, relational similarity. In contrast, word–context matrices are suited to measuring attributional similarity. The distinction between attributional and relational similarity has been explored in depth by Gentner (1983) (Turney and Pantel, 2010).

The theme that unites the various forms of VSMs can be stated as the statistical semantics hypothesis: statistical patterns of human word usage can be used to figure out what people mean (Furnas et al., 1983). This general hypothesis underlies several more specific hypotheses, such as the bag of words hypothesis, the distributional hypoth- esis, the extended distributional hypothesis, and the latent relation hypothesis (Turney and Pantel, 2010).

54 Distributional and vector space semantics forms the basis for how words and phrases are learned and also how these representations may be updated in the process of model learning, in Chapters 4, 5 and 6 of this dissertation.

2.1.4 Hypotheses for Word- and Document-level semantics

Turney and Pantel, 2010, outline five hypotheses which they interpret in terms of vector spaces for deriving various semantic and attributional similarities, citing origins for each.

Statistical semantics hypothesis: (Statistical patterns of human word usage can be used to figure out what people mean (Furnas et al., 1983).)

– If units of text have similar vectors in a text frequency matrix, then they tend to have similar meanings. They assume this to be a general hypothesis that subsumes the four more specific hypotheses that follow (Turney and Pantel, 2010).

Bag-of-words hypothesis: (The frequencies of words in a document tend to indicate the relevance of the document to a query (Salton et al., 1975) (Salton et al., 1975)).

– If documents and pseudo-documents (queries) have similar column vectors in a term–document matrix, then they tend to have similar meanings (Turney and Pan- tel, 2010).

Distributional hypothesis: (Words that occur in similar contexts tend to have similar meanings (Harris, 1954; Deerwester et al., 1990) (Harris, 1954; Firth, 1957;

Deerwester et al., 1990).)

– If words have similar row vectors in a word–context matrix, then they tend to have similar meanings (Turney and Pantel, 2010).

Extended distributional hypothesis: (Patterns that co-occur with similar pairs

55 tend to have similar meanings (Lin Pantel, 2001) (Lin and Pantel, 2001b).)

– If patterns have similar column vectors in a pair–pattern matrix, then they tend to express similar semantic relations (Turney and Pantel, 2010). This hypothesis is very related to what this research is aimed at, in terms of transfer of latent concepts.

Latent relational hypothesis: (Pairs of words that co-occur in similar patterns tend to have similar semantic relations (Turney et al., 2003) (Turney and Littman, 2003).)

– If word pairs have similar row vectors in a pair–pattern matrix, then they tend to have similar semantic relations (Turney and Pantel, 2010). This hypothesis is also closely related to our research goal of aiming to capture hidden relations between latent concepts.

2.1.5 Latent Semantic Analysis

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of doc- uments and the terms they contain by producing a set of concepts related to the documents and terms. Latent semantic indexing (LSI) is an indexing and retrieval method that uses the mathematical technique of singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts (Deerwester, 1988; Deerwester et al.,

1990).

56 Deerwester et al. found an elegant way to improve similarity measurements with a mathematical operation on the term–document matrix, X, based on linear algebra.

The operation is truncated Singular Value Decomposition (SVD), also called thin SVD.

Deerwester et al. briefly mentioned that truncated SVD can be applied to both docu- ment similarity and word similarity, but their focus was document similarity. Landauer and Dumais (1997) applied truncated SVD to word similarity, achieving human-level scores on multiple-choice questions from the Test of English as a Foreign

Language (TOEFL). Truncated SVD applied to document similarity is called Latent

Semantic Indexing (LSI), but it is called Latent Semantic Analysis (LSA) when applied to word similarity (Deerwester et al., 1990) . As further elaborated in section 4.2, in the context of our problem definition, LSA, LSI, and a related approach LDA–based topic modeling (Blei et al., 2003; Griffiths and Steyvers, 2004) present certain draw- backs, viz. they only consider word co-occurrences at the level of full documents to model term associations (which may not always be reliable), instead of smaller neigh- borhoods within documents. Furthermore, these are parameterized approaches, where the number of concepts, or topics K, is fixed; and the final topics learnt are available as bags of words or n-grams from which topic labels must yet be inferred by an expert, motivating the use of more fine-grained local approaches.

2.1.6 Latent Relational Analysis

Turney, in 2005, introduced the notion of Latent Relational Analysis (LRA) as a method for measuring relational similarity, that has potential applications in many areas, including , word sense disambiguation, machine transla- tion, and information retrieval (Turney, 2005). Consider word sense disambiguation

57 for example. In isolation, the word “plant” could refer to an industrial plant or a liv- ing organism. Suppose the word “plant” appears in some text near the word “food”.

A typical approach to disambiguating “plant” would compare the attributional sim- ilarity of “food” and “industrial plant” to the attributional similarity of “food” and

“living organism” (Banerjee and Pedersen, 2002). In this case, the decision may not be clear, since industrial plants often produce food and living organisms often serve as food. Relational similarity is correspondence between relations, in contrast with attributional similarity (Medin et al., 1993). When two words have a high degree of attributional similarity, we say they are synonymous. When two pairs of words have a high degree of relational similarity, we say they are analogous. For example, the word pair mason : stone is analogous to the pair carpenter : wood; the relation between mason and stone is highly similar to the relation between carpenter and wood. Past work on semantic similarity measures, LSA for instance, had mainly been concerned with attributional similarity, but not between two relations (Furnas et al., 1983). In contrast the LRA method, in which the Vector Space Model (VSM) of information retrieval was adapted to the task of measuring relational similarity, achieved a score of

47% on a collection of 374 college-level multiple-choice word analogy questions, Tur- ney and Littman, (2005). In this approach the VSM represents the relation between a pair of words by a vector of frequencies of predefined patterns in a large corpus. LRA served to extend the VSM approach in three ways: (1) the patterns were derived auto- matically from the corpus (not predefined), (2) Singular Value Decomposition (SVD) was used to smooth the frequency data (like in LSA), and (3) automatically generated synonyms were used to explore reformulations of the word pairs, where a thesaurus is used to explore these reformulations. LRA achieved 56% on the 374 analogy questions,

58 statistically equivalent to the average human score of 57%. On the related problem of classifying noun-modifier relations, LRA achieved similar gains over the VSM. LRA thus performed better than the VSM approach, when evaluated with both problems

– SAT word analogy questions and with the task of classifying noun-modifier expres- sions, where the VSM represented the relation between a pair of words with a vector, in which the elements are based on the frequencies of 64 hand-built patterns in a large corpus (Turney, 2005). However, the LRA approach for the VSM is still bound to the use of a search engine and a thesaurus for alternate lexical form extraction and semantic enrichment respectively, and is generally not scalable to the kind of implicit novel relation extraction desirable for addressing novel contexts.

2.1.7 Latent Dirichlet Allocation (LDA)

Blei et al. (2003) argue that while bag-of-words models like TF-IDF have appealing features, e.g. a basic identification of sets of words that are discriminative for docu- ments in the collection—-the approach provides a relatively small amount of reduction in description length and reveals little in the way of inter- or intra- document statistical structure. This led to improved compression based methods such as LSI (Deerwester et al., 1990) that captured maximum variance in TF-IDF featurized document data, by singular value decomposition of a term-document matrix. However, it still left to be desired to develop a generative probabilistic model of text corpora to study the ability of LSI to recover aspects of the generative model from data. This lead to the develop- ment of pLSI (Hofmann, 1999) that models each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of “topics.” Thus each word is generated from a

59 single topic, and different words in a document may be generated from different topics.

According to Blei et al. (2003), while Hofmann’s work is a useful step toward proba- bilistic modeling of text, it is incomplete in that it provides no probabilistic model at the level of documents. In pLSI, each document is represented as a list of numbers

(the mixing proportions for topics), and there is no generative probabilistic model for these numbers which leads to several problems: (1) the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting, and (2) it is not clear how to assign probability to a document outside of the training set. Moreover, in these “bag-of-words” models where the order in which words appear in a document does not matter, there is an assumption of exchangeabil- ity, for words, and although less stated, also for documents, where the specific ordering of documents in a corpus can be neglected. Thus the documents are conditionally independent and identically distributed.

Latent Dirichlet allocation (LDA) was therefore developed by Blei et al. (2003), in order to capture these notions of exchangeability and address the shortcomings of pLSI, where the goal is to find a probabilistic model of a corpus that not only assigns high probability to members of the corpus, but also assigns high probability to other “sim- ilar” documents. Thus, LDA is a “generative” probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over “latent topics”, where each topic is characterized by a distribution over words. Due to the Dirichlet prior on words for topics and topics for documents, LDA is a generative probabilistic model of a corpus. The main difference with pLSA is that the Dirichlet priors for the document-topic and word-topic distributions, lend this model to better general- ization over unseen documents giving lowered perplexity at inference time. LDA as a

60 dimensionality reduction technique, in the spirit of LSI, but with proper underlying generative probabilistic semantics that make sense for the type of data that it models.

Since exact inference is intractable for LDA, the parameters of LDA are usually esti- mated using Gibbs sampling Porteous et al. (2008). This technique forms the basis for the topic modeling methods used in Chapter 3.

2.2 Related Works in Concept and Relation Extraction

2.2.1 Early Work in Concept Discovery from Text

One of the earliest works on and discovery from text was by Lin and Pantel, in 2001, where they employ a clustering approach to mine for appropri- ate concepts. The main challenge cited was that broad coverage lexical sources such as WordNet often included many rare senses of words, while missing domain-specific senses. They present a clustering algorithm called CBC (Clustering-By-Committee) that automatically discovers concepts from text, by initially finding a set of tight clus- ters called committees that are well scattered in the similarity space. Most clustering algorithms like K-means or K-medoids represent a cluster by the centroid of all it’s members, but when averaging over all elements in a cluster, the centroid of a cluster may be unduly influenced by elements that only marginally belong to the cluster or by elements that also belong to other clusters. In the CBC algorithm, the centroid of a cluster is constructed by averaging the feature vectors of a subset of the cluster members, viewed as a committee that determines which other elements belong to the cluster (Lin and Pantel, 2001a). Then they proceed by assigning elements to their most similar cluster. By carefully choosing committee members, the features of the centroid tend to be the more typical features of the target class, or concept, such as

61 “state capital”. CBC can handle a large number of elements, a large number of output clusters, and a large sparse feature space, as it discovers clusters using well-scattered tight clusters or committees. In their experiments, they showed that CBC outperforms several well known hierarchical, partitional, and hybrid clustering algorithms in cluster quality (Lin and Pantel, 2001a). This section serves to provide and overview of how concept discovery was thought of, defined and implemented in past works.

2.2.2 Multi-way classification of semantic relations between pairs of nominals

Hendrickx et al. introduce a new task – multi-way classification of mutually exclu- sive semantic relations between pairs of common nominals, which was included as a part of the SemEval-2010 shared task, as a way of addressing main challenges in the extraction of semantic relations from English text, and addressing the shortcomings of previous data sets and shared tasks. Semantic relations between pairs of words are an interesting case of semantic knowledge that lends itself to robust , both as an end in itself and as an intermediate step in a variety of NLP applications.

It can guide the recovery of useful facts about the world, the interpretation of a sen- tence, or even discourse processing. The automatic recognition of semantic relations can have many applications, such as information extraction (IE), document summa- rization, machine translation, construction of thesauri and semantic networks, and also information retrieval and question answering. It can also facilitate auxiliary tasks such as word sense disambiguation, language modeling, paraphrasing or recognizing textual entailment (Hendrickx et al., 2009).

62 A fundamental question in relation classification is: “whether the relations between nominals should be considered out of context or in context”. When one looks at real data, it becomes clear that context does indeed play a role. Consider, for example, the noun wood shed: it may refer either to a shed made of wood, or to a shed of any material used to store wood. This ambiguity is likely to be resolved in particular contexts. In fact, most NLP applications will want to determine not all possible relations between two words, but rather the relation between two instances in a particular context. While the integration of context is common in the field of

IE (cf. work in the context of ACE1), much of the existing literature on relation extraction considers word pairs out of context (thus, types rather than tokens). They highlight a major limitation of the Semeval-2007 Task 4, the data set avoided the challenge of defining a single unified standard classification scheme by creating seven separate training and test sets, one for each semantic relation, which made the relation recognition task on each data set a simple binary (positive / negative) classification task which clearly does not easily transfer to practical NLP settings, where any relation can hold between a pair of nominals which occur in a sentence or a discourse (Hendrickx et al., 2009).

Thus they design and define and inventory of semantic relations between between nominals, which should ideally be exhaustive (should allow the description of rela- tions between any pair of nominals) and mutually exclusive (each pair of nominals in context should map onto only one relation). The nine relations defined for this task are: Cause-Effect, Instrument-Agency, Product-Producer, Content-Container, Entity-

Origin, Entity-Destination, Component-Whole, Member-Collection, Communication-

Topic. They add a tenth element to this set, the pseudo-relation OTHER. It stands

63 for any relation which is not one of the nine explicitly annotated relations. This is motivated by modelling considerations. Presumably, the data for OTHER will be very non-homogeneous. By including it, we force any model of the complete data set to correctly identify the decision boundaries between the individual relations and “every- thing else”. This encourages good generalization behaviour to larger, noisier data sets commonly seen in real-world applications (Hendrickx et al., 2009).

2.2.3 Distant Supervision and Missing Data Modeling

Modern models of relation extraction for tasks like Automatic Content Extraction

(ACE) are based on supervised learning of relations from small hand-labeled corpora.

The NIST ACE RDC 2003 and 2004 corpora, for example, include over 1,000 documents in which pairs of entities have been labeled with 5 to 7 major relation types and

23 to 24 sub-relations, totaling 16,771 relation instances (Mintz et al., 2009). ACE systems then extract a wide variety of lexical, syntactic, and semantic features, and use supervised classifiers to label the relation mention holding between a given pair of entities in a test set sentence, optionally combining relation mentions (Mintz et al.,

2009). Supervised relation extraction suffers from a number of problems, however.

Labeled training data is expensive to produce and thus limited in quantity. Also, because the relations are labeled on a particular corpus, the resulting classifiers tend to be biased toward that text domain (Mintz et al., 2009). An alternative approach, purely unsupervised information extraction, extracts strings of words between entities in large amounts of text, and clusters and simplifies these word strings to produce relation-strings. Unsupervised approaches can use very large amounts of data and extract very large numbers of relations, but the resulting relations may not be easy to

64 map to relations needed for a particular knowledge base. A third approach has been to use a very small number of seed instances or patterns to do bootstrap learning. These seeds are used with a large corpus to extract a new set of patterns, which are used to extract more instances, which are used to extract more patterns, in an iterative fashion. The resulting patterns often suffer from low precision and semantic drift.

Mintz et al propose an alternative paradigm, distant supervision, that combine some of the advantages of each of these approaches (Mintz et al., 2009). This approach does not require labeled corpora, avoiding the domain dependence of ACE-style algorithms, and allowing the use of corpora of any size. Their experiments used Freebase (Bollacker et al., 2008), a large semantic database of several thousand relations, to provide distant supervision. For each pair of entities that appears in some Freebase relation, they find all sentences containing those entities in a large unlabeled corpus and extract textual features to train a relation classifier that is able to extract 10,000 instances of 102 relations at a high precision. They also analyze feature performance, showing that syntactic parse features are particularly helpful for relations that are ambiguous or lexically distant in their expression (Mintz et al., 2009).

Ritter et al further expand on this work by modeling missing data in distant su- pervision for information extraction. Distant supervision algorithms learn information extraction models given only large readily available databases and text collections, where heuristics are used for generating labeled data, for example assuming that facts not contained in the database are not mentioned in the text, and facts in the database must be mentioned at least once. Ritter et al. propose a new latent-variable approach that models missing data by introducing a joint model of information extraction and missing data, which relaxes the hard constraints used in previous work to generate

65 heuristic labels, and provides a natural way to incorporate side information through a missing data model, for instance modeling the intuition that text will often mention rare entities which are likely to be missing in the database (Ritter et al., 2013).

This section is relevant in terms of comparison with our detailed works from the standpoint of directly modeling for data that is missing that may help to reason about inferring relations that can then be extracted and used for purposes of knowledge base completion in traditional information extraction settings.

2.2.4 Knowledge Base Completion

The goal of the Knowledge Base Completion (KBC) task is to fill in the missing piece of information into an incomplete triple in a knowledge base. For instance, given a query < DonaldT rump, president of, ? > one should predict that the target entity is USA. More formally, given a set of entities ε and a set of binary relations R over these entities, a knowledge base (sometimes also referred to as a knowledge graph) can be specified by a set of triplets < h, r, t > where < h, t >∈ ε, are head and tail entities respectively and r ∈ % is a relation between them. In entity KBC the task is to predict either the tail entity given a query < h, r, ? >, or to predict the head entity given . Not only can this task be useful to test the generic ability of a system to reason over a knowledge base, but it can also find use in expanding existing incomplete knowledge bases by deducing new entries from existing ones. Projects such as Wikidata (https://www.wikidata.org/) or earlier Freebase (Bollacker et al.,

2008) have successfully accumulated a formidable amount of knowledge in the form of < entity1 − relation − entity2 > triplets. Given this vast body of knowledge, it would be extremely useful to teach machines to reason over such knowledge bases. One

66 possible way to test such reasoning is knowledge base completion (KBC) (Kadlec et al.,

2017). Here too, although the goal is the discovery of related concepts, the focus is mainly on fact-based reasoning for relation completion for knowledge base population rather that finding related concepts in some new context.

2.2.5 Semantic Inference – Learning Selectional Preferences

Semantic inference is a key component for advanced natural language understand- ing. However, existing collections of automatically acquired inference rules have shown disappointing results when used in applications such as textual entailment and ques- tion answering (Bhagat et al., 2007). Several important applications including question answering (Moldovan et al., 2003; Harabagiu and Hickl, 2006), information extraction

(Giuliano et al., 2006), and textual entailment (Szpektor et al., 2004) already rely heav- ily on inference. Thus several researchers have created resources for enabling semantic inference, e.g. WordNet (Fellbaum, 1998) and Cyc (Lenat, 1995). Although impor- tant and useful, these resources primarily contain prescriptive inference rules such as

“XdivorcesY ” => “XmarriedY ”. In practical NLP applications, however, plausible inference rules such as “XmarriedY ” => “XdatedY ” are very useful. This, along with the difficulty and labor-intensiveness of generating exhaustive lists of rules, has led researchers to focus on automatic methods for building inference resources such as inference rule collections, and paraphrase collections (Barzilay and McKeown, 2001;

Bhagat et al., 2007; Lin and Pantel, 2001a; Szpektor et al., 2004).

Using these resources in applications has been hindered by the large amount of in- correct inferences they generate, either because of altogether incorrect rules or because of blind application of plausible rules without considering the context of the relations or

67 the senses of the words. What is missing is knowledge about the admissible argument values for which an inference rule holds, which we call Inferential Selectional Prefer- ences. Selectional Preference Models aim to learn inferential selectional preferences for filtering out incorrect inference rules. However learning selectional preferences to inference rules still limit the scope of our learning enrichment to the use of knowledge bases of these inference rules, and do not offer a fully unsupervised automated solu- tion. Despite robust systems out there for learning selectional preference, many issues must still be addressed, e.g. the models need to be improved in order to address the problem of incorrect inference rules as also the issue of antonymy relations that are not addressed by the distributional hypothesis (Bhagat et al., 2007). Thus while such inference rule bases are valuable for reasoning with large texts, they may not always generalize to new relations that become plausible in unseen contexts.

Selectional preferences, in a nutshell, are basically common sense knowledge of

“what will do what to what” (Neubig, 2017). For example, “I ate salad with a fork” is perfectly sensible with “a fork” being a tool, and “I ate salad with my friend” also makes sense, with “my friend” being a companion. On the other hand, “I ate salad with a backpack” doesn’t make much sense because a backpack is neither a tool for eating nor a companion. These selectional preference violations can lead to nonsensical sentences and can also span across an arbitrary length due to the fact that subjects, verbs, and objects can be separated by a great distance. The selectional preferences of a predicate is the set of semantic classes that its arguments can belong to (Wilks,

1973). Resnik gave an information theoretical formulation of the idea (Resnik, 1996).

Pantel et al., extended this idea to non-verbal relations by defining the relational selec- tional preferences (RSPs) of a binary relation p as the set of semantic classes C(x) and

68 C(y) of words that can occur in positions x and y respectively (Pantel et al., 2007).

Here, the set of semantic classes C(x) and C(y) can be obtained either from a man- ually created taxonomy like WordNet as proposed in the above previous approaches or by using automatically generated classes from the output of a word clustering algo- rithm. For example, given a relation like “XlikesY ”, its RSPs from WordNet could be

{individual, social group, ...} for X and {individual, food, activity, ...} for Y (Bhagat et al., 2007). Our work towards concept discovery although similarly in spirit in being concerned with finding plausible arguments to predicates pre-eliminates to a great ex- tent implausible matches solely due to the pre-condition of the novel context that we impose, which acts as the proxy for the relation or predicate potentially relating two concepts.

2.2.6 Paraphrase Detection, Identification and Extraction

Paraphrases are textual expressions that convey the same meaning using different surface words (Bhagat et al., 2007). For example consider the following sentences:

Google acquired YouTube. (1) Google completed the acquisition of YouTube. (2) Since they convey the same meaning, sentences (1) and (2) are sentence level paraphrases, and the phrases “acquired” and “completed the acquisition of” in (1) and (2) respectively are phrasal paraphrases (Bhagat and Ravichandran, 2008). Paraphrases provide a way to capture the variability of language and hence play an important role in many natural language processing (NLP) applications. For example, in question answering, paraphrases have been used to find multiple patterns that pinpoint the same answer

(Ravichandran and Hovy, 2002); in statistical machine translation, they have been used to find translations for unseen source language phrases (Callison-Burch et al.,

69 2006); in multi-document summarization, they have been used to identify phrases from different sentences that express the same information (Barzilay et al., 1999); in information retrieval they have been used for query expansion (Anick and Tipirneni,

1999; Bhagat and Ravichandran, 2008). Learning paraphrases requires one to ensure identity of meaning. Since there are no adequate semantic interpretation systems available today, paraphrase acquisition techniques use some other mechanism as a kind of “pivot” to (help) ensure semantic identity. Each pivot mechanism selects phrases with similar meaning in a different characteristic way. A popular method, the so-called distributional similarity, is based on the dictum of Zelig Harris “you shall know the words by the company they keep”: given highly discriminating left and right contexts, only words with very similar meaning will be found to fit in between them.

For paraphrasing, this has been often used to find syntactic transformations in parse trees that preserve (semantic) meaning (Bhagat and Ravichandran, 2008). To contrast with our work paraphrase identification is very similar in spirit to concept discovery the way we define it, except in this case the underlying relation is assumed to be on the lines of “sameAs00, whereas for concept discovery the scope is much broader and potentially covers “any” relation made possible primarily by a novel context. Through their work on large scale acquisition of paraphrases from surface patterns, Bhagat et al,

2008, demonstrate that highly precise surface paraphrases can be obtained from a very large monolingual corpus, made scalable by recent advances in theoretical computer science. They also show that these paraphrases can be used to obtain high precision extraction patterns for information extraction. However, they still believed that more work needs to be done to improve the system recall towards developing a minimally supervised, easy to implement, and scalable relation extraction system, indicating the

70 need for implicit relation extraction with minimal use of external knowledge (Bhagat and Ravichandran, 2008). Complementary to previous methods, Xu et al. 2014 are able to capture lexically divergent paraphrases, using a principled latent variable modeling approach that jointly models paraphrase relations between word and sentence pairs, assuming only sentence-level paraphrase annotations during learning, by incorporating a multi-instance learning assumption, that two sentences under the same topic are paraphrases if they contain at least one word pair, which they call an anchor pair, that is indicative of a sentential paraphrase. This model design also allows exploitation of arbitrary features and linguistic resources, such as part-of-speech features and a normalization lexicon to discriminatively determine word pairs as paraphrastic anchors or not. Their model differs from previous approaches in that its primary design goal and motivation is targeted towards short, lexically diverse text on the social web (Xu et al., 2014).

Whether considering sentential or phrasal paraphrases as described in this section, again, the same argument as before holds, in that we hope to capture relations and analogies beyond “sameAs” which is the primary focus of this dissertation.

2.2.7 Named Entity Recognition for Relation Extraction

Named entity recognition (NER) is a challenging learning problem, often being the

first step in a pipeline towards entity-relation extraction on large corpora. Although this is not a task that we explicitly target in this dissertation, it is closely related to our problem definition in the context of from unannotated corpora, hence deserves special mention. On the one hand, in most and domains, there is only a very small amount of supervised training data available. On the other,

71 there are few constraints on the kinds of words that can be names, so generalizing from this small sample of data is difficult. Thus state-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small, supervised training corpora that are available. As a result, carefully constructed orthographic features and language-specific knowledge resources, such as gazetteers, are widely used for solving this task, which are costly to develop, making NER a challenge to adapt. Thus, unsupervised learning from unan- notated corpora offers an alternative strategy for obtaining better generalization from small amounts of supervision. Recently Lample et al. (2016) propose neural archi- tectures for NER that use no language-specific resources or features beyond a small amount of supervised training data and unlabeled corpora. Their models capture two main intuitions: First, since names often consist of multiple tokens, reasoning jointly over tagging decisions for each token is important (This is similar to the intuition be- hind the phrasal embedding–based GLM described in Chapter 4). Second, token-level evidence for “being a name” includes both orthographic evidence (what does the word being tagged as a name look like) and distributional evidence (where does the word being tagged tend to occur in a corpus). Thus, to capture orthographic sensitivity, they use character-based word representations, and to capture distributional sensitiv- ity, they combine these with distributional representations (Mikolov et al., 2013b).

Their BiLSTM–CRF based model for this task that employs character and word em- beddings, obtains state-of-the-art NER performance in Dutch, German, and Spanish, and very near the state-of-the-art in English without any hand-engineered features or gazetteers. Recent work such as (Bhatia et al., 2018) centers on effectively adapting

72 these neural NER architectures towards low resource settings using parameter trans- fer methods for various clinical notes datasets important for medical text analysis.

They complement a standard hierarchical NER model with a general transfer learning framework consisting of parameter sharing between the source and target tasks. To mitigate the problem of exhaustively searching for model optimization, we propose the

Dynamic Transfer Networks (DTN), a gated architecture which learns the appropriate parameter sharing scheme between source and target datasets, and are able to show significant improvement over baseline architectures, when trained on various public medical and non-medical datasets and transferring to other datasets, also mitigating exponential search for an optimal model. However, this work does not target the problem of inferring hidden relations on out of vocabulary concepts or entities.

2.2.8 Automated Text Categorization and Semantic Tagging

In a view from the earlier decade, text categorization (TC) is generally the activity of labeling natural language texts with thematic categories from a predefined set and is a content-based document management task related to information retrieval (IR).

It began to be applied in many contexts, ranging from document indexing based on a controlled vocabulary, to document filtering, automated metadata generation, word sense disambiguation, population of hierarchical catalogues of Web resources, and in general any application requiring document organization (Sebastiani, 2002). According to this comprehensive survey on machine learning for automated text categorization due to Sebastiani (2002), until the late ’80s the most popular approach to text catego- rization in an “operational” (i.e., real-world applications) community, was a knowledge engineering (KE) one, consisting in manually defining a set of “rules” encoding expert

73 knowledge on how to classify documents under the given categories. In the ’90s this approach lost popularity (especially in the research community) in favor of the machine learning (ML) paradigm, according to which a general inductive process automatically builds an automatic text classifier by learning, from a set of pre-classified documents, the characteristics of the categories of interest. The advantages of this approach are an accuracy comparable to that achieved by human experts, and a considerable savings in terms of expert labor power, since no intervention from either knowledge engineers or domain experts is needed for the construction of the classifier or for its porting to a dif- ferent set of categories. Current-day text categorization is therefore a discipline at the crossroads of ML and IR, and as such it shares a number of characteristics with other tasks such as information/ knowledge extraction from texts and text mining. There is still considerable debate on where the exact border between these disciplines lies, and the terminology is still evolving. “Text mining” is increasingly used to denote all the tasks that, by analyzing large quantities of text and detecting usage patterns, try to extract probably useful (although only probably correct) information. According to this view, text categorization is an instance of text mining. While text categorization enjoys a rich literature now, it is still fairly scattered . Text categorization as defined by

Sebastiani (2002) is the task of assigning a Boolean value to each pair < d, c >∈ D×C, where D is a domain of documents and C = c1, ..., c|C| is a set of predefined categories.

The pair < dj, ci > may get a value of True or False indicating a decision to file dj under ci or not. Text categorization may come in different flavors such as single label versus multi-label categorization, “category-pivoted” versus “document-pivoted” text categorization and “hard” versus “ranking” categorization. In document-pivoted cat- egorization, given a dj we may want to find all the ci under which it should be “filed”,

74 while with category-pivoted, given ci we want to find all the documents that should be filed under it. In Chapters 4 and 5 of this dissertation our models do not assume a fixed set of categories and want to include more over time, so we dynamically try to achieve both category- and document- pivoted categorization for our unsupervised task setting, with one view feeding the other. Also while a complete automation of the

TC task requires a True or False decision for each pair < dj, ci >, a partial automation of this process simply rank the categories in C according to their estimated appropri- ateness to dj , without taking any “hard” decision on any of them. Such a ranked list would be of great help to a human expert in charge of taking the final categorization decision. In our works in Chapters 4 and 5 we also perform this type of document ranking within our pipeline for automated categorization.

In addition we also perform “hard” categorization of documents in our supervised and semi-supervised task settings of Chapter 5 to validate our proposed model frame- work. This is in direct contrast to and performs direct comparison with one of the more recent state-of-the-art ML-based automated text categorization works due to Soleimani and Miller (2016), specifically, by the application of LDA-based topic modeling. Here they propose a semi-supervised multi-label topic model (MLTM) for jointly achiev- ing document and sentence-level class inferences. Under this model, each sentence is associated with only a subset of the document’s labels (including possibly none of them), with the label set of the document the union of the labels of all of its sen- tences. For training, they use both labeled documents, and, typically, a larger set of unlabeled documents, performing experiments on the del.icio.us folksonomy dataset and the Ohsumed dataset originally developed for TREC for evaluating information

filtering. The MLTM model, in a semi-supervised fashion, discovers the topics present,

75 learns associations between topics and class labels, uses this to predict labels for new

(or unlabeled) documents, and determines label associations for each sentence in ev- ery document. Like our models of Chapters 4 and 5, theirs does not require any ground-truth labels on sentences or documents. Given that this fairly recent work in automated text categorization outperforms several benchmark methods with respect to both document and sentence-level classification, and that our Seq2set framework of Chapter 5 outperform MLTM for the supervised and semi-supervised task settings, gives us reason to believe that we are on a reasonable path towards achieving concept discovery in novel contexts for downstream recommendation tasks.

2.3 Knowledge Transfer and Natural Language Processing

Chin (2013), 2013 dissertation provides a comprehensive account of all aspects of knowledge transfer. Thus some of its relevant sections are provided verbatim here below, so as to aid in the discussion of the methods that were applied to this research.

Human beings learn from prior experiences, and so can automated systems, e.g. we

find it is easier to learn French after having learned Spanish or to learn ballroom dancing after having learned figure skating. Transferring knowledge from one situation to another related situation often increases the speed and quality of learning. This observation is relevant to human learning, as well as machine learning. The goal of knowledge transfer is to train a system to recognize and apply knowledge acquired from previous tasks to new tasks or new domains. An effective knowledge transfer system facilitates the learning processes for novel tasks, where little information is available

(Chin, 2013).

76 Knowledge transfer, more commonly known as transfer learning and domain adap- tation, believes that the generalization of a learned model may occur across tasks. In contrast, traditional machine learning limits the generalization to being within a task.

Extensive research literature on transfer learning refers to it with many related names, e.g., inductive learning, multi-task learning, reinforcement learning, lifelong learning, knowledge transfer, domain transfer (or adaptation), knowledge reuse, information reuse, classifier reuse, and auxiliary classifier selection. Traditionally, transfer learning assumes that transfer occurs among related tasks. However, the relatedness between tasks may not be apparent at the very outset, and what knowledge to transfer or how exactly to apply this may not be immediately apparent. For example, we may wonder whether the transfer of learning can be achieved between using a fork and using a pair of chopsticks (Chin, 2013).

2.3.1 A Taxonomy of Knowledge Transfer

A major assumption in many machine learning and data mining algorithms is that the training and future data must be in the same feature space and have the same distribution. However, in many real-world applications, this assumption may not hold.

For example, we sometimes have a classification task in one domain of interest, but we only have sufficient training data in another domain of interest, where the latter data may be in a different feature space or follow a different data distribution. In such cases, knowledge transfer, if done successfully, would greatly improve the performance of learning by avoiding much expensive data labeling efforts (Pan and Yang, 2010).

Pan and Yang (2010), provide a good taxonomical overview of the relationship between the different settings for knowledge transfer as depicted in Fig. 2.1. Thus

77 Figure 2.1: An Overview of Different Settings of Transfer Learning by Pan and Yang (2010)

78 if the transfer is observed, we may want to know what contributes to the transfer of learning between the two tasks, so that we develop improved models of learning (Chin,

2013). Here, and as shown in the Figure 2.1 the term transductive transfer learning refers to the case of where the source and target tasks are the same, although the domains may be different. Thus in this definition of the transductive transfer learning setting, we only require that part of the unlabeled target data be seen at training time in order to obtain the marginal probability for the target data, so that we use the term transductive to emphasize the concept that in this type of transfer learning, the tasks must be the same and there must be some unlabeled data available in the target domain. This corresponds to exactly the task of domain adaptation, where the difference lies between the marginal probability distributions of source and target data; i.e., the tasks are the same but the domains are different. In contrast, in the inductive transfer learning setting, the target task is different from the source task, no matter when the source and target domains are the same or not. In this case, some labeled data in the target domain are required to induce an objective predictive model fT (.) for use in the target domain, and according to different situations of labeled data available, this may correspond to the multi-task learning setting or a self-taught learning setting

(Pan and Yang, 2010).

Maximizing the utility of information provides opportunities to improve the process of knowledge discovery. In the field of machine learning and natural language processing

(NLP), obtaining training labels is often expensive, while an enormous amount of unlabeled data are often available (e.g. learning to answer product-related questions with product reviews where labeled data is unavailable). Therefore, maximizing the utility of available label information would benefit the learning process. In this context,

79 this work aims to solve better the prediction and recommendation-based research tasks at hand, by first abstracting out with respect to the task, what types of knowledge transfer, specifically with respect to inferring latent concept relations, may help to improve or augment the solution for these tasks (Chin, 2013).

2.3.2 Natural Language Processing – An Overview and Tasks

(Ruder, 2019) 2019 thesis provides a great background of the area of natural lan- guage processing, which is the primary focus area of this dissertation, hence I reproduce much of the descriptions provided therein verbatim here below, so as to ease the reader the task of referring back to that document. Natural language processing (NLP) aims to teach computers to understand natural language. As the facility for language is abstract, we try to define NLP by way of its tasks, which generally map a text to linguistic structures that encode its meaning (Smith, 2011). Thus we describe vari- ous NLP tasks ordered roughly from low-level syntactic tasks to high-level semantic tasks. (Ruder, 2019)

Part-of-speech (POS) tagging – POS tagging is the task of tagging a word in a text with its corresponding part-of-speech. A part-of-speech is a category of words with similar grammatical properties. Common English parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, etc. Part-of-speech tagging is difficult as many words can have multiple parts of speech. Parts of speech can be arbitrarily fine-grained and are typically based on the chosen tag set. The most common tag set used by the Penn (Taylor et al., 2003) comprises 36 tags.

However, POS tags vary greatly between languages due to cross-lingual differences.

The creation of a “universal” tagset has been an ongoing development: Petrov and

80 McDonald (2012) proposed a tag set of 12 coarse-grained categories, while the current tag set of the 2.0 (Nivre et al., 2016) contains 17 tags. When applying a standard POS tagger to a new domain, current models particularly struggle with word-tag combinations that have not been seen before. (Ruder, 2019)

Chunking – Chunking, also known as shallow , aims to identify continuous spans of tokens that form syntactic units. Chunks are different from parts of speech as they typically represent higher order structures such as noun phrases or verb phrases.

Approaches typically use BIO notation, which differentiates the beginning (B) and the inside (I) of chunks. O is used for tokens that are not part of a chunk. Both POS tagging and chunking act mostly on the grammatical and syntactic level, compared to the following tasks, which capture more of the semantic and meaning-related aspects of the text. (Ruder, 2019)

Named entity recognition (NER) – NER is the task of detecting and tagging entities in text with their corresponding type, such as PERSON or LOCATION. BIO notation is also used for this task, with O representing non-entity tokens. Entity categories are pre-defined and differ based on the application. Common categories are person names, organizations, locations, time expressions, monetary values, etc. NER is a common component of information extraction systems in many domains. Despite high numbers ( 92–93 F1) on the canonical CoNLL-2003 newswire dataset, current

NER systems do not generalize well to new domains. NER corpora for specialized domains such as medical and biochemical articles (Song et al., 2005; Krallinger and

Valencia, 2005) have consequently been developed.

Semantic role labelling (SRL) – SRL aims to model the predicate-argument structure of a sentence. It assigns each word or phrase to its corresponding semantic role

81 such as an agent, goal, or result. The FrameNet project [Baker et al., 1998] defined the

first large lexicon consisting of frames such as Apply heat and associated roles, known as frame elements such as Cook, Food, and Heating instrument. Predicates that evoke this frame, e.g. “fry”, “bake”, and “boil” are known as lexical units. PropBank

(Kingsbury and Palmer, 2002) added a layer of semantic roles to the Penn Treebank.

(Ruder, 2019) While this is a very related core NLP task to the process of concept discovery outlined in this thesis, this task is not specifically geared toward out-of- vocabulary relations and concepts.

Dependency parsing – Dependency parsing is the task of extracting a depen- dency parse of a sentence that represents its grammatical structure and defines the relationships between “head” words, and words which modify those heads. Depen- dency parsing differs from constituency parsing, which focuses on identifying phrases and the recursive structure of a text. A typed dependency structure labels each rela- tionship between words, while an untyped dependency structure only indicates which words depend on each other. Dependency parsing is used in many downstream appli- cations such as coreference resolution, question answering, and information extraction as the relations between the head and its dependents serve as an approximation to the semantic relationships between a predicate and its arguments. (Ruder, 2019)

Topic classification – Topic classification and sentiment analysis are text classi-

fication tasks. They assign a category not to every word but to contiguous sequences of words, such as a sentence or document. Topic classification aims to assign topics that depend on the chosen dataset, typically focusing on the news domain. As certain keywords are often highly predictive of specific topics, word order is less important for

82 this task. Consequently, traditional BOW models with tf-idf weighted unigram and features are often employed as strong baselines. (Ruder, 2019)

Sentiment analysis – Sentiment analysis is the task of classifying the polarity of a given text. Usually this polarity is binary (positive or negative) or ternary (positive, negative, or neutral). Most datasets belong to domains that contain a large number of emotive texts such as movie and product reviews or tweets. In review domains, star ratings are generally used as a proxy for sentiment. Sentiment analysis has become one of the most popular tasks in NLP. Variants of the task employ different notions of sentiment and require determining the sentiment with respect to a particular target.

(Ruder, 2019)

Language modeling – A language model is a probability distribution over se- quences of words. Language modeling aims to predict for each word in a text the next word. It is the simplest yet most effective unsupervised learning model for prediction, as it only requires access to the raw text. Despite its simplicity, language modelling is fundamental to many advances in NLP and has many concrete practical applications such as intelligent keyboards, email response suggestion, auto-correction, etc.

(Ruder, 2019)

Since language modeling is the best candidate among the core NLP tasks at han- dling for out-of-vocabulary items, it is the most related of core NLP tasks to this work. Chapters 4 and 5 of this dissertation therefore revolve mainly around language modeling tasks aimed at improving downstream semantic matching.

83 2.3.3 Transfer Learning for NLP – Language Model Pre-training on Sesame Street

Recently, language model pre-training has been shown to be effective for improving many natural language processing tasks. We describe here below some of the recent contextualized or language model–based word representation learning algorithm that we have either experimented with or considered using for our various works. Since these are the standard methods for achieving transfer across multiple core NLP tasks and datasets in recent times, they deserve mention in the context of this dissertation, although it is important to note that our means of achieving transfer of latent concepts in the various works outlined in this thesis has been solely through the very architec- tural design and modeling considerations for the tasks being solved, rather than by application of pre-trained language models, as it is also computationally prohibitive to adapt some of these models. There are two existing strategies for applying pre- trained language representations to downstream tasks: feature-based and fine-tuning.

The feature-based approach, such as ELMo Peters et al. (2018), uses task-specific ar- chitectures that include the pre-trained representations as additional features. The

fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT)

(Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pre-trained parameters. The two approaches share the same objective function during pre-training, where they use uni- directional language models to learn general language representations.

Probabilistic FastText – The Probabilistic FastText (PFT) word embedding model of Athiwaratkun et al. (2018) represents each word with a Gaussian mixture density, where the mean of a mixture component given by the sum of n-grams, can

84 capture multiple word senses, sub-word structure, and uncertainty information. This model outperforms the n-gram averaging of FastText (Bojanowski et al., 2017) get- ting state-of-the-art performance on several word similarity and disambiguation bench- marks. The probabilistic word representations with flexible sub-word structures, can achieve multi-sense representations that also give rich semantics for rare words. This makes them very suitable to generalize for rare and out-of-vocabulary words, a primary goal throughout this dissertation, thus motivating the use of PFT-based word vector pre-training over regular FastText for training of many our Seq2set framework models of Chapter 5.

ELMo – One main consideration when attempting to capture context surrounding words in variable-length documents is to use contextualized word embeddings that can explicitly capture the language model underlying sentences within a document.

ELMo (Embeddings from Language Models) word vectors Peters et al. (2018) present such a choice where the vectors are derived from a bidirectional LSTM trained with a coupled language model (LM) objective on a large . The representations are a function of all of the internal layers of the biLM. Using linear combinations of the vectors derived from each internal state has shown marked improvements over various downstream NLP tasks, because the higher-level LSTM states capture context- dependent aspects of word meaning (e.g., they can be used without modification to perform well on supervised word sense disambiguation tasks) while lower-level states model aspects of syntax. In our work in Chapter 5 we use the ELMo API 1 to generate embeddings fine-tuned for our corpus with dimension settings of 50 and 100 using only the top layer final representations.

1https://allennlp.org/elmo

85 BERT – The Bidirectional Encoder Representations from Transformers (BERT) word representation model alleviates the unidirectionality constraint in models like

ELMo or OpenAI GPT by using a “masked language model” (MLM) pre-training objective, inspired by the Cloze task (Taylor, 1953). The masked language model ran- domly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to- right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows pre-training a deep bidirectional Trans- former. In addition to the masked language model, they also use a “next sentence prediction” task that jointly pre-trains text-pair representations. BERT is the first

fine-tuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures, advancing the state of the art for eleven NLP tasks. Our experiments with downstream fine-tuning of pre-trained BERT language model–based embeddings for some of our Seq2set framework models of Chapter 5 does not yield significantly im- proved results however, when compared to using regular Skip-gram–based word vectors of Mikolov et al. (2013a).

This concludes our survey of the specific areas from the literature pertaining to the architectural and modeling considerations for the works covered in this disserta- tion. The following chapters provide a detailed account of each of the main research contributions.

86 CHAPTER 3

EXPLORING TOPIC MODELS OVER COMPLEMENTARY SIGNALS TO EXPLAIN NEWSPAPER SUBSCRIBER ENGAGEMENT

3.1 Introduction

Customer churn refers to when a customer (player, subscriber, user, etc.) ceases his or her relationship with a company. Online businesses typically treat a customer as churned once a particular amount of time has elapsed since the customer’s last interac- tion with the site or service. With the emergence of social media and news aggregators on the Web, the print newspaper industry is faced with a declining subscriber base resulting in customer churn (Pew Research Centers, 2013). It is therefore critical for newspapers to predict and mitigate churn. The Columbus Dispatch Printing Company

(The Dispatch), is a newspaper with a circulation of 1.2 million subscribers. In this work, our objective was to explore various structured and unstructured data available within the Dispatch, from its print and on-line properties, to gain insight into various factors affecting user engagement (in this work, as we see in the later sections, these factors are an equivalent of learned topics from the data). These insights were used to come up with predictive models of customer churn using features mined from both,

87 transactional databases or Web-based textual data to determine which factors most impact subscriber engagement.

Any available Web data, indicative either of user activity, e.g. search clicklogs, or of readership, e.g. Web news, could be a rich source of signal for the task of churn prediction, hence these sources were included into our study to determine factors af- fecting churn. The intuition behind our use of clicklogs is to find out how digital user behavior impacts print subscription and vice versa, and clicklogs could provide impor- tant types of signal related to traffic patterns and preferences of users, e.g. searches for items that: (i) cannot be easily or directly found in news or advertisements, (ii) are very related to news or advertised products, or (iii) are unrelated to news (Newman et al., 2006; Nandi and Bernstein, 2009). Similarly, Web news offers almost similar content to the subscriber as print. We therefore use Web news as a source of signal to look at the impact of environmental context on consumer behavior, hypothesizing that top ranking news items may provide valuable cues into user engagement, and hence may be correlated (Csikszentmihalyi et al., 2013; Jin et al., 2005; Mooney and Roy,

2000; Resnick et al., 1994). We therefore mine these Web information sources to build several models of subscriber churn on this data, using a maximum entropy model that can incorporate a large variety of features, and an unsupervised learning approach such as LDA-based probabilistic topic modeling, well-suited to the task of learning topic features from the Web data. This resulted in TopChurn (Das et al., 2015b) – a novel system that uses these topic models(Blei et al., 2003; Steyvers and Griffiths,

2007; Newman et al., 2006) as a means of extracting dominant features from Web NEWS and CLICKLOG data and employs a maximum entropy model for churn prediction, by identifying features most indicative of subscribers likely to drop subscription within a

88 specified time frame. All of our models based on temporal, unigram and Web-based topic and sentiment features, show statistical significance, and improved predictabil- ity over baseline models that utilize only transaction metadata, showing that features mined from Web news content and on-line user activity do have influence on newspaper subscriber engagement. Thus our experiments validate that these models significantly outperform baselines for various prediction windows (Berger et al., 1996; Klein and

Manning, 2003).

We then extend these analyses by finding correlations of the top-ranked features from our various models, with churn. Based on these insights and more news data, we obtain refined web models of churn by tuning parameters on topic models. Further, we hypothesize that news and click logs web data having significant temporal overlap in the subscriber base, complement each other in interesting ways for the same topics.

In particular, we perform topic inference on click logs based on topic models learned on news data, to identify interesting trends from these complementary signals, fur- ther validated by performing sentiment analysis on the topics in our inferred models.

Our topic inference models on the complementary signals from web data, representing the NEWS and CLICKLOG domains, further enhance our analyses about which factors, whether temporal, transactional or derived from web usage and sentiment, directly affect newspaper subscriber engagement. We use these augmented analyses to present several key insights into customer engagement. We thus provide a comparison of the feature sets of these models, with respect to their predictiveness, in particular, juxta- posing the web-based topic and sentiment features from NEWS and CLICKLOG sources which we find to be complementary sources of signal, highlighting particularly how topics transfer across the two sources, in the context of churn prediction.

89 The insights drawn on this data by means of topic transfer across sources for pre- diction, informs the next works where the goal is to improve the performance on recom- mendation tasks by transfer of latent concepts within and across data domains. Also the course granularity and bag-of-words approaches to topic modeling that can limit the discovery process, is addressed in the next work where distributed models of rep- resentation are employed with the aim of large-scale unsupervised discovery (Mikolov et al., 2013b; Le and Mikolov, 2014).Thus this work represents a set of experiments em- ploying transfer learning via inferred topic models for prediction, that serve to inform much of the subsequent direction of this research has taken. Thus, in the following sections we will focus on only this knowledge transfer-based concept learning aspect of our analyses, relevant to this dissertation.

3.2 Motivation and Background

Customer churn, or the customer attrition rate, has been a well-studied problem, in industries like telecommunications, retail CRM, the social Web, and to a great extent in online and gaming communities (Archambault et al., 2013; Coussement and Van den

Poel, 2008; Resnick et al., 1994; De Bock and Van den Poel, 2010; Kawale et al., 2009;

Tarng et al., 2009; Debeauvais et al., 2011; Karnstedt et al., 2011; Dror et al., 2012;

Dave et al., 2013). The Columbus Dispatch Printing Company, referred to here on as

The Dispatch, is a newspaper with a circulation of 1.2 million subscribers. Being the only mainstream newspaper in its home city, it comprises of a news daily and weekly, several websites for local news, sports and events, associated blogs, and a couple of

TV stations, all part of the Dispatch Media Group, that has at its disposal large amounts of content, shared by all its outlets and distributed through multiple modes

90 (print, Web and mobile), all delivered by a common server infrastructure. Some of the problems that the Dispatch faced with respect to print circulation subscription were: losing existing subscribers, no new or younger subscribers, and a slower turn-around time for changing the business model. Thus the organization wanted to use its diverse reserves of electronic and Web data, to come up with better business models that can drive subscriber engagement and retention.

Since The Dispatch operates in a captive market, where its strategy is to offer similar content across all its delivery vehicles, our hypothesis is that traffic across the various modes of delivery, and hence the datasets we consider, are correlated. Our research objective here therefore, was to jointly explore these structured and unstruc- tured datasets to gain insight into various factors affecting user engagement.

To our knowledge our work is the first to study patterns of both on-line and offline behavior of customers, by tying together Web and relational databases of user activity, for the task of predicting customer churn, in contrast to previous works in this space e.g. Coussement et al. Coussement and Van den Poel (2008) that only look at features from transactional data of newspaper subscribers. Our experimental results confirm our intuition for using Web features to model subscriber churn and demonstrate the value of extracting signal from the Web, both transactional and non-transactional. In addition, our human-guided approach to improving our models of churn prediction via parametrized exploration of the models and topic inference on complementary signals from web data, further enhance our analyses about which factors, whether tempo- ral, transactional or derived from web usage and sentiment, directly affect newspaper subscriber engagement.

91 3.3 Related Work

Notable previous works such as by Iwata et al. for extending subscription periods, efficiently extract information from log data and purchase histories using Cox propor- tional hazards models and survival analysis techniques (Iwata et al., 2006), to find frequent purchase patterns in users with a long subscription period, infer these users’ interests and use it to improve recommendations for new users. Coussement et al. apply a SVM classifier to construct a churn model for newspaper subscription using mainly transaction and metadata based features, comparing two parameter-selection techniques (Coussement and Van den Poel, 2008). An often-used performance cri- terion in churn prediction is lift (De Bock and Van den Poel, 2010), that measures how many times a classification model improves identification of potential churners over random guessing. De Bock et al. investigate use of probability estimation trees

(PETS) and alternative fusion rules, to improve lift performance of four well-known ensemble classifiers (De Bock and Van den Poel, 2010).

Previous work on churn in online communities such as Dror et al. focus on predicting churn in new users, specifically within their first week of activity on a popular CQA website (Dror et al., 2012), while others such as Karnstedt et al. explore the relation between a user’s value within a community in an on-line social network, constituted from various user features, and probability of the user churning (Karnstedt et al.,

2011). Kawale et al. use influence vector models (Kawale et al., 2009) by taking into account online players’ game engagement and social influence from neighbors in influence propagation, in a MMORPG setting, for gamer churn prediction. Dave et al. use a timespent based model (Dave et al., 2013), speed of discovery and information

92 theoretical analyses to find a subset of informative recommendations that are most indicative of user retention in an on-line personalized content discovery setting.

Maximum entropy is a powerful statistical model, which has been widely applied in information retrieval and text mining (Klein and Manning, 2003; Jin et al., 2005). The goal of a maximum entropy model is to find a probability distribution, which satisfies all the constraints in the observed data while maintaining maximum entropy (Jin et al.,

2005). One of the advantages of such a model is that it enables the unification of information from multiple knowledge sources into one framework. Each knowledge source can be considered as a set of constraints in the model. From the intersection of all these constraints, a probability distribution with the highest entropy can be learned

(Jin et al., 2005). As a first step towards finding ways to solve the overall problem of customer retention, given the subscribers of the paywall websites and print services of the Dispatch we want to determine what criteria or features in the usage data or engagement statistics contribute to continued subscription and identify the features and chronological sequence of events if any, allowing us to predict when a user is about to discontinue subscription.

The rich body of literature surrounding analysis of web content using probabilistic topic models (Steyvers and Griffiths, 2007; Blei et al., 2003; Griffiths and Steyvers,

2004), primarily work done by Newman et al., for analyzing topics and entities in news articles (Newman et al., 2006), inform our decision to use topic models as a means to reason about readership and engagement from news signal. A rich source of usage patterns lies within web server access logs which contains information from both search as well as navigational queries. Mining of query logs (Silvestri, 2010), is studied extensively by Silvestri et al. mainly from both a user’s and a search engine’s

93 standpoint for web usage optimization purposes, but a lot of the principles for query log analysis, such as the nature and type of queries, basic statistics such as query popularity, average query length, distance between repeated queries and approximating user intent with query expansion and suggestion, informs a lot of our work with the use of server query logs as a signal for churn prediction. Interesting works such as

Qurk (Marcus et al., 2011), a crowd-sourced database for query processing by Marcus et al., and the useful human-assisted semi-automated process for conversion of existing web pages to pages suitable for mobile viewing to aid companies low on development personnel (Wei et al., 2013) as described by Wei et al. inform our notions for having an iterative human-analyst-in-the-loop approach for parametrized exploration of our predictive models so as to drive the best possible insights. In this regard, our prior work on undertaking systematic research on Big Data (Das et al., 2015a), particularly by contrasting qualitative and quantitative methods, puts into greater perspective the approach we take for the design of experiments required in this particular study, and the subsequent analysis of the outcomes thereafter (Das et al., 2015a).

Ours is the first work we find in the literature that looks at harnessing both trans- actional and Web data within a large media organization, studying patterns of both online and offline customer behavior, and using both unsupervised and supervised learning approaches for incorporating features into a maximum entropy framework to predict newspaper subscriber churn. This study contributes to the existing literature by investigating the effectiveness of the maximum entropy based method (Klein and

Manning, 2003; Daum´eIII, 2004; Holmes et al., 1994) in extracting signal from het- erogeneous sources of data such as a transactional database (structured) and the Web

94 Figure 3.1: Overview of Dataset

(environmental) (Csikszentmihalyi et al., 2013), while employing an unsupervised ap- proach such as LDA-based topic modeling to extract the most informative features

(Blei et al., 2003; Steyvers and Griffiths, 2007) for this task.

3.4 Dataset

The Dispatch dataset comprises large amounts of transactional, content and server access log data, and from various divisions of the enterprise, like news websites, print and digital subscriptions, and includes news stories, blog content and comments data from thirteen different websites, and newspaper subscriber transactional data with subscription history, viz. current status of a subscription, time-stamped transactions for start, stop, update, or renewal of a subscription, and associated memo text as shown in figure ??. The experiments are performed on the print subscription history portion of the dataset, integrating features derived from Web data like activity, stories and sentiment, to create models of churn based on linguistic, temporal and transaction metadata features. The experiments are performed on the circulation and subscription history portion of the dataset, integrating activity and behaviors from the Web as applicable.

95 Figure 3.2: Customer transactions with memo text

For exploratory analysis and predictive modeling, as well as for studying com- plementary signals, we focus mainly on two orthogonal components of this data: 1)

TopChurnTRANS — a structured, relational database of transactional information con- taining user complaint text that is tied directly to user requests and user satisfaction, and 2) TopChurnWEB — unstructured textual data from the Web, comprising further of two complementary sources: (i) TopChurnWEB-NEWS — i.e. news stories and blog articles from 13 websites, and (ii) TopChurnWEB-CLICKLOGS — server access logs also from 13 websites reflecting user activity and website traffic on various days. Both these types of Web data in combination are referred to as TopChurnWEB. Figure ?? gives an overview of the dataset accessible to us that was used in our work for the extended analysis with complementary models of WEB-NEWS and WEB-CLICKLOGS. Additionally, the subscriber transactions in TopChurnTRANS are as shown in Figure 2 with the ac- tual account ids, transaction ids and names de-identified. There are 5 types of Status an account may be in, given by the TRANS-TYPE field, viz. START, STOP, PRODCHG,

COMPLAINT and RESTART.

96 3.5 Churn Prediction

3.5.1 Methodology

The maximum entropy model is a discriminative probabilistic approach, where fea- tures are the elementary pieces of evidence that link aspects of what we observe d with a category c that we want to predict. Maximum entropy is a powerful statistical model, which has been widely applied in information retrieval and text mining (Klein and Manning, 2003; Jin et al., 2005). The goal of a maximum entropy model is to

find a probability distribution, which satisfies all the constraints in the observed data while maintaining maximum entropy (Jin et al., 2005). One of the advantages of such a model is that it enables the unification of information from multiple knowledge sources into one framework. Each knowledge source can be considered as a set of constraints in the model. From the intersection of all these constraints, a probability distribu- tion with the highest entropy can be learned (Jin et al., 2005). Maxent models are easy to include large numbers of non-independent features into, and generally handle overlapping features well. We want a distribution which is uniform (high entropy) ex- cept in specific ways we require, thus minimizing commitment, or maximizing entropy, and resembling some reference distribution (i.e. the data) (Klein and Manning, 2003).

Adding constraints (features), lowers the maximum entropy of the distribution for the model, raising the maximum likelihood of the data, bringing the distribution away from uniform and closer to the actual distribution of the data (Klein and Manning,

2003). Thus in this work, we use the maximum entropy model, or logistic regression trained for a binary classification problem, using sets of features developed from text in complaints, web news, and search click logs. We formalize the task of churn prediction

97 as a supervised learning experiment for binary classification. Given the history of a customer account up to date D, we learn one of two class labels, ACTIVE or DROPPED, i.e. we predict the status of the account–whether the customer will remain subscribed or not, at date D+K. We investigate two timescales K : 3 and 6 months. The details of the data preparation, label generation, topic modeling and feature set generation are outlined in detail in Das et al. (2015b) and in following sections.

3.5.2 Data Preparation and Topic Modeling

The subscriber transactions in TopChurnTRANS are as shown in Figure 2 with the actual account ids, transaction ids and names de-identified to protect privacy. There are 5 types of Status an account may be in, given by the TRANS-TYPE field, viz. START,

STOP, PRODCHG, COMPLAINT and RESTART. The data file of 3.72 million transactions is sorted by the transaction start date so all transactions for an account are in sequential order. We use this transaction data to obtain the sequence of status changes pertaining to an account and generate account histories one per subscriber account, corresponding to a single instance in our training, and create features from the counts, metadata and content of any complaints occurring for that account. Further we use a portion of the status sequence corresponding to an account history to generate the gold standard label, described in detail in section 4.2. Other textual features such as sentiment scores & unigram frequencies etc. Bird (2006); Documentation (2014), and metadata features, are also calculated off the complaints text in the TopChurnTRANS dataset, and included as needed into the various models. The text in the TopChurnWEB (NEWS and CLICKLOGS) datasets are processed via unsupervised learning. We use Stanford

NLP’s TMT v0.4 Group (2014) to create topic models for the TopChurnWEB news and

98 search click log data and also calculate sentiment scores for the same Documentation

(2014). Since we intended to use the Web data to predict churn for subscribers in our database, we had to identify elements that would allow an appropriate join between the two datasets, as the ACCOUNT element identifying a customer is present only in

TopChurnTRANS but not in TopChurnWEB (NEWS and CLICKLOGS).

The only element that allows this are the timestamps that are shared across these datasets, so we normalize dates across the board in order to facilitate this join. Thus the TopChurnWEB dataset was summarized as follows:

• TopChurnWEB-NEWS comprising ¿ 100K articles over 180 days, was summarized

by grouping all news articles occurring within a given day and formatting the

dates to join correctly with dates in our subscriber transactions database, so

as to generate one concatenated document per each day, containing titles and

content. We then run topic modeling on this set of documents and get topic

distributions for 50 topics for each date for the news.

• We parse 3.4TB of TopChurnWEB-CLICKLOGS server log data to produce a unique

4-tuple of the form {Requestor IP, TimeStamp, SearchTerms, SearchQuery}

Nandi and Bernstein (2009). Our parsed data produced on an average about 1500

Web searches for each day of the available time period, on the 13 websites of the

Dispatch. These are grouped similar to news producing a single concatenated

document of Web searches on a given day, on which we again ran topic modeling

to get distributions over 50 topics for each date for the clicklogs.

Our processed dataset contains 605K unique customer accounts, all of which were active at some time within the last three years. Roughly 41% of them were ACTIVE at

99 date D-3 and 59% were at DROPPED, and 40% of them were active at date D-6 and 60% were at DROPPED. We did not balance the dataset as we wanted to keep our models as close to the real-world as possible. Some basic status change statistics we have are: Of the 246628 ACTIVE customers in the dataset 11% changed status from STOP to RESTART once, 74% never stopped, and 15% changed status from STOP to RESTART more than once. Of the 359073 DROPPED customers in the dataset 6% changed status from STOP to RESTART once, 79% stopped at some point never to restart, and 15% changed from

STOP to RESTART more than once.

3.5.3 Label Generation for Predictive Modeling

In order to make our models predictive, and since we don’t know what actual state an account will be in, K time steps out in the future, we model this by carving out the time window for prediction K, from the current data itself as follows: 1) For our experiments we consider the unit of time to be months, so that prediction windows are set to 3 months, 6 months etc. Since this is set up as a binary classification problem, we use a partial sequence of most recent states an account goes through, in chronological order, to generate one of two class labels, ACTIVE or DROPPED. 2) Since not all account histories are of same length, we first normalize the history for a given account by generating a time-step for each month in the account’s history until the last recorded transaction date. For each date in the history we then map a status as follows: If the timestamp is an actual transaction date in the data, we use its original status. For every interim date we generate a filler status of either TRUE or FALSE such that if the previous status in the sequence was a STOP or FALSE, then the current filler time-step status is set to FALSE; otherwise it is set to TRUE. 3) Once this normalized

100 !

!

Status! DROPPED OVERALL

TopChurnTRANS Top-1,2,3 [1437] Paper, Redelivered’ > [5044] [1141] Too early, Late Tomorrow ’ > [1933] [1022]

TopChurnWEB-NEWS Top-1,2,3 [65124] Inflation’> [108258] Council’> [3193] [4834] [1350] Woods, Dublin, Friday’> [4391]

TopChurnWEB-CLICKLOGS Top-1,2,3 [71176] Download'> [117779] [71141] Facebook'> [117688] [70605] Police’> [116800]

Figure 3.3: Top ranking topics by dataset with counts

history is produced with a sequence of states the account went through, we hold out the statuses for the last K time steps, for label generation. Thus this partial sequence used for label generation is of length equal to the prediction interval K, e.g. it is the

3 last time-steps of account history for K =3. 4) An instance that has in its partial sequence, either a STOP|FALSE as the very last status, or a STOP|FALSE followed by nothing other than one or more COMPLAINTs (a subscriber may complain even after they have dropped), gets the label DROPPED. Any other combination of the five possible statuses in any order, gets an ACTIVE label. 5) Once the partial sequence has been used for label generation, it is then forever discarded and never used in the final model.

3.5.4 Exploratory Trend Analysis

Prior to creating models of churn and running experiments on different datasets, we wanted to gain insight into some of the reasons that subscribers either complain most, or express satisfaction, and ask if there might be certain trends within the news

101 or Web searches that might affect readership in print or online. For this we ran LDA- based topic modeling on the datasets Newman et al. (2006); Steyvers and Griffiths

(2007); Griffiths and Steyvers (2004); Blei et al. (2003); Group (2014), annotated the topics, and joined with our processed account histories such that all Web news and all Web searches are mapped to account activity by matching timelines, to obtain a topic-account association. We then ranked the topics that co-occur most frequently with ACTIVE and DROPPED accounts.

Figure 4 shows the highest ranking topics associated with accounts for the TopChurnTRANS,

TopChurnWEB-NEWS, and TopChurnWEB-CLICKLOGS datasets. Section 6 further details what insights these topic rankings bring into our experiments. Further, topic analysis done on a smaller subset of the data containing complaint text for only the DROPPED subscriptions shows finer-grained topics, specific to only accounts that have a STOP status, indicative of reasons why a subscriber may stop their subscription either tem- porarily or longer term. This analysis reveals several reasons for DROPPED subscriptions such as due to “Vacation and Job Loss”, “Product Offers, Subscription Downgrades” and “Moving,Will Call to Resume”, as shown in Figure 5.

3.5.5 Feature generation

We generate features from the text and metadata fields of the TopChurnTRANS and TopChurnWEB datasets and have the following categories of feature sets shown in

Figure 3.6, where K denotes the churn prediction window. The features calculated for the TopChurnTRANS dataset consists of six categories: Complaints-Statistics,

Service Keywords, Temporal: Short and Long-term effects, Status History,

TopK-Unigrams (we pick k=100) and Sentiment Documentation (2014). The

102 ( Table(5.(Models( Table(6.(Feature(categories(for(Churn(models( ( Complaints-based Features Applicable Models complaints-YN baseline+K • complaints-3-5-YN – binary feature for complaints between 3 and 5 complaintstats+K • complaints-gt-5-YN – binary feature for complaints > 5 Temporal Features – temporal features calculated for short and long term • AvgComplaintGap – Averaged time steps between COMPLAINT statuses in history temporal+short+K • NumComplaints – Total # of COMPLAINTS in history temporal+long+K • AvgProdChg- Averaged time steps between PRODCHG statuses in history temporal+all+K • NumProdChg – Total # of PRODCHG in history • STOP-RESTART-YNGT1 - # of STOP statuses followed by RESTART in history • TotalTrans - Total # of Transactions in history Top-100 unigram features – binary features for top 100 unigrams missed-YN,todays-YN,redelivered-YN,paper--YN,wsj-YN,auto-YN,account-YN,paper- topKunigrams+K YN,cut-YN,batch-YN,per-YN,sub-YN,dm-YN,paper-credit-YN,credit-YN,will-YN,pub:1- YN,ppr-YN,ivr-YN,vol-YN,date-YN,ic-YN,on-YN,prior-YN,delivery-YN,nyt-YN,restart- YN,d6-YN,contact-YYN,with-YN,in-YN,ensure-YN,for-YY,new-YN,no-YN,mrs-YN,cc- YN,customer-YN,called-YN,sample-YN,mr-YN,…,complaint-YN,satisfied-YN,that-YN,be- YN,this-YN,at-YN,message-YN,late-YN,upset-YN,offered-YN,vacation-YN,papers- YN,payment-YN,stop-YN,issues-YN,up-YN,door-YN,carrier-YN,pprs-YN,deliver-YN,her- YN,time-YN,he-YN,been-YN,where-YN,said-YN,have-YN,if-YN,by-YN,back-YN,too-YN,sure- YN,address-YN,posting-YN,get-YN,missing-YN,pls-YN,due-YN,only-YN,coupons- YN,declined-YN,early-YN Sentiment features – sentiment scores for text in TRANS and WEB • AverageSentiment – Averaged score of sentiment polarity over associated text complaintsentiment+K • MaxSentiment – Maximum score of polarity over associated text web+news+sentiment+K web+clicklog+sentiment+K Topic features – topic distribution features for WEB models Topic00, Topic01, Topic02, Topic03, Topic04…Topic49 web+news+topics+K web+clicklog+topics+K web+news+all+K web+clicklog+all+K Service-based Features – keywords related to service in complaints is-INCOMPLETE-YN,is-CANCEL-YN,is-REDELIVERY-YN,is-DELIVERY-YN,is- service+K MANAGER-YN,is-UPSET-YN,is-THREATEN-YN,is-PROBLEM-YN,is-CREDIT-YN,is- MISSED-Y,is-NOT-SATISFIED-YN,is-NOT-READING-YN,is-WET-PAPER-YN,is-VERY- UPSET-YN,is-INCOMPLETE-YN,is-CANCEL-YN,is-REDELIVERY-YN,is-DELIVERY- YN,is-MANAGER-YN,is-UPSET-YN,is-THREATEN-YN,is-PROBLEM-YN,is-CREDIT- YN,is-MISSED-Y,is-NOT-SATISFIED-YN,is-NOT-READING-YN,is-WET-PAPER-YN,is- VERY-UPSET-YN ( (

( Topic 3 – Topic 6 – Topic 11 – “Vacation, Job “Product Offers, “Moving, Will Loss” Subscription Call to Resume” Downgrades” Cancel, offered,decline call, when, called, wants, d, rate, lower, back, restart, wanted,due,her offers, mail, resume, new, , cancelled, online, Sunday, going, ready, just, with, only, offered, for, return, then, acct, downgrade, address, past, said, promoted, line, date,may,moving, job, lost, refused, settled, once, account, over, discount, sure, with, she vacation, made great, digital

( (

Figure 3.4: Topic models for DROPPED complaints

TopChurnWEB-based feature sets consist of 2 categories: (i) a Topic Distribution

Group (2014) over a set of 50 topics, and (ii)Sentiment scores Documentation (2014), for processed text in both WEB-NEWS and WEB-CLICKLOGS. Figure 3.5 depicts how these topic features are temporally joined along with subscriber transaction–based features to obtain models for predicting churn.

3.5.6 Experimental Results

Several experiments were run for the prediction of dropped subscriptions with dif- ferent sets of features active, as shown in Figures 3.7 and 3.8. The results for our experimental models that were run for K =3 and K =6, are detailed in Figures 3.7

and 3.8, with performance shown in terms of overall Accuracy, Precision of the +ve

(DROPPED) class, Error rate, the Area Under ROC (AUC) and the F1 performance

measure of churn prediction. The experimental results for our churn models using the

TopChurnTRANS dataset show the temporal, unigram and status history models to be

clear winners for all prediction windows, with status history outperforming all models

103 Figure 3.5: WEB and ClickLog Topic Model Feature Integration for Churn Prediction

Figure 3.6: Feature Categories for Churn models

104 with an AUC=0.91 for 3-months out prediction. Predictive power declines across the board for 6-month prediction for TopChurnTRANS, but this is expected given there is greater uncertainty with larger K. However, for experiments with the TopChurnWEB dataset, comparing purely text-based models of Web news and clicklogs against an augmented baseline having a complaint sentiment feature added to complaints-YN, we

find the opposite effect, i.e. not only does inclusion of the WEB features increase predic- tive ability over this new baseline, but predictive power also actually increases slightly for larger interval K, thus the model with all NEWS and CLICKLOGS topics and senti- ment features added to complaint sentiment, contributes the most, with AUC=0.69 for

K =6—we leave this effect as a topic of future investigation. Another interesting finding from the complaintstats model shows that subscribers with >= 3 but <= 5 com- plaints tend towards dropping subscription, subscribers who never complained almost always drop, and subscribers with > 5 complaints almost always remain ACTIVE—these, we confirm are stable subscribers, but may just be routine callers or vacationers.

3.5.7 Feature Performance

We also analysed feature performance in order to gain more insight into which features are most informative for churn prediction. For this we calculate the Infor- mation Gain for each feature for predicting the target variable, for each of our churn models. Figures 3.11 and 3.12 show top contributing features sorted in descending order of value of Information Gain with number of features for a model in paren- theses. For individual feature performance, we find that for the WEB models, sen- timent features from CLICKLOGS rank much higher than those for NEWS; where all topics rank higher than sentiment. For temporal models, we find information gain

105 Figure 3.7: Churn Prediction, K =3

of features flips depending on short or long term effects being studied, so previously less-informative features become more predictive for larger K, e.g. AvgComplaintGap and NumComplaints start to have more influence in our short-term effects tempo- ral model, for K =6. Intuitively, we do expect the temporal models to be more sensitive to such effects, shown in fact by our results to be true. This analysis also reveals that topics found co-occurring highly with our +ve DROPPED class dur- ing the exploratory phase, e.g. < topic05 : ’Song, Lyrics, Movie, Download’ > and < topic15 : ’Today Stories, Boxscore Standings, House Fire, Firefighters, Police’ > from

CLICKLOGS are actually very predictive in our experiments, showing that Web searches tend to be both, quite unrelated and highly related to the news. Also, as seen from

106 Figure 3.8: Churn Prediction, K =6

!

baseline(complaints.YN)2 baseline+all_temporal2 baseline+topKunigrams2 baseline+status_sequence2

1 1 1 1

0.5 0.5 0.5 0.5

0 0 0 0 0 0.5 1 0 0.5 0 0.5 0 0.5 K=3, AUC=0.64 K=3, AUC=0.77 K=3, AUC=0.779 K=3, AUC=0.91

1 1 1 1

0.5 0.5 0.5 0.5

0 0 0 0 0 0.5 1 0 0.5 0 0.5 0 0.5 K=6, AUC=0.65 K=6, AUC=0.74 K=6, AUC=0.782 K=6, AUC=0.81

Figure 3.9: ROC curves for top performing churn models on TopChurnTRANS

107 ! ! baseline2(complaint.YN2+2 baseline+webnews2 baseline+weball2 complaint_sentiment)2 (news+news_sentiment)2 (news+clicklogs+allsentiment)2 ! 1 1 1 !

0.5 0.5 ! 0.5 !

0 0 0 0.5 0 0 0.5 ! 0 0.5 K=3, AUC=0.662 K=3, AUC=0.674 K=3, AUC=0.686 ! ! 1 1 1 !

0.5 0.5 ! 0.5 ! 0 0 0 0.5 0 0 0.5 ! 0 0.5 K=6, AUC=0.664 K=6, AUC=0.675 K=6, AUC=0.69 ! ! ! ! ! Figure 3.10: ROC – TopChurnWEB topics+sentiment ! ! Figures 4 and! 9, local news such as given by have more predictive value, over national news such as

, for this dataset, both of which also ranked highly during exploratory trend analysis.

3.6 Extended Analysis - Explaining Customer Engagement

3.6.1 Exploring Topic Models to Explain User Engagement

Next, in keeping with our original objective, we extend our original analysis for insight discovery with new incoming web data, with new sets of experiments for the following analyses:

108 1. Correlation Analysis: We first perform correlation analysis of the high-ranked

features from the original runs of our predictive models. Based on the direction-

ality of the coefficients we undertake a deeper dive especially into the web models

as these are information-rich sources to understand user engagement.

2. Parameterized exploration: To do this we specifically perform a parameter-

ized exploration separately on the click log data and news data, with different

parameter settings such as number of topics, minimum length of words in a topic,

number of iterations to convergence etc., by qualitatively looking at the topics

learned as well as their distribution over time, to identify the potentially “best”

number of topics for each type of web model. It must be noted that for this

analysis we incorporate an additional cut of news data in addition to the set

originally used in TopChurn (Das et al., 2015b), to enable greater overlap in the

time frame between news and click logs, for our analyses.

3. Investigating Complementarity of Data Sources: We then perform topic

inference on click logs from topic models learned on the news data, and vice versa

and compare how various topics affect readership or activity over time over both

data sources.

4. Causal Analysis: Finally from the results obtained, we perform sentiment

analysis on the documents pertaining to certain selected high-performing features

or topics, in order to gain further insights into readership or engagement specific

to the two complementary data sources.

We now explain each of these analyses in further detail. For the purposes of this research proposal we describe and highlight the experimentation and findings for only

109 the relevant portions of these experiments relating to topic inference on complementary signals i.e. NEWS and CLICKLOGS for topic transfer, providing only summaries for the remainder of the analyses.

3.7 Correlation Analysis

For each of our models, we derive the values of the Pearson’s correlation coefficient for each of the covariates against each other, the results of which are listed in the additional Tendency field as shown in Figure 3.13. This directionality information provides us with more direct insight into the potential causes for the observed effects.

Some of the effects that stand out from the correlation analysis were as follows:

1. A summary-statistic based feature that is positively correlated with DROPPED

turns out to be SSTOP RESTART YNGT1 (which is a STOP followed by a RESTART

in the near-past (3 months)), for the temporal+all model, which clearly shows

that a user that has recently stopped their subscription and restarted is more

likely to drop.

2. The baseline feature complaints-YN is always negatively correlated with DROPPED

with highest strength in the case of both WEB models, is consistent with findings

from the previous work where this feature was highly predictive of drop, indi-

cating that transactional activity on an account, even if complaint, predicts an

account that is likely to remain active.

3. For the topic-based features derived from the WEB-CLICKLOGS model the following

topics show negative correlation with DROPPED, indicating greater user interest:

,

110 ,

,

, ,

02: Dom Tiberi>, ,

,

.

4. For the topic-based features derived from the WEB-NEWS model the following topics

show negative correlation with DROPPED, indicating greater user interest:

(financial topic),

inflation > (social, financial)

,

Center>,

insurance>,

,

,

5. In both cases we find that topics impacting “daily life” such as events of interest

and social/law-and-order issues, disasters and scandals take precedence in user

interest as corroborated by both data sources from the Search log and News

Contexts, although, we do find that CLICKLOGS have more of a local flavor to

111 the searches for event schedules and social/law-and-order incidents, whereas the

top-ranked NEWS topics have more of a national flavor to them, however these

also relate to current events of financial and social importance, which underlines

both, the “temporal” and “controversial” aspects of news consumption and of

top-ranked WEB topic features in general.

6. We also find that sentiment-based features such as AvgSentiment and

MaxSentiment are on average more predictive in the case of our WEB-NEWS models

as opposed to the WEB-CLICKLOG models, which matches our intuition as we ex-

pect sentiment to be more important in news consumption than in searches, how-

ever the relatively higher predictiveness (albeit slightly lower than for WEB-NEWS)

of sentiment features for searches, gives a key insight revealing perhaps some-

thing about the nature of web searches itself, that searches are highly motivated

by certain user interests that appear more time-sensitive in nature.

Thus our correlation analyses via this case study, reveal many interesting findings about the temporal and web models. The web topic-based models reveal that the two signals appear to complement one another, where for both NEWS and CLICKLOGS we

find similar types of themes related to events (local, national, and political) and social issues such as those relating to law and order (investigations or trials), and disasters, either natural (tornados) or man-made (such as acts of terror) play a greater role towards user engagement. This warrants studying the topic features of the web models from these complementary sources more closely, perhaps by direct comparison of topics from one source with the same topics from the other. Thus, a parametrized exploration of these web models and investigating the complementarity by topic inference on one

112 of the sources given topics models learnt from the other source, emerged as possible ways to further study these effects by juxtaposing the topic features with one another, which is what we did next.

3.8 Parameterized exploration

Here we carry out an iterative exploration process where we perform parameter tuning on topic models for WEB-NEWS and WEB-CLICKLOGS. While all of our web models created in the initial round in TopChurn (Das et al., 2015b) come out statistically significant using 50 learned topics, we wanted to see if these models could be improved upon, somehow leading finally to better insights. Thus, using a script for parameter tuning that was run on the news corpus (combining previous and newer news data, and then separately on the click log data, we estimate a “desired” number of topics for the web models using each type of signal. This script (Group, 2014) splits a document into two subsets: one used for training the models, the other used for evaluating their perplexity on unseen data (Brown et al., 1992). Perplexity is scored on the evaluation documents by first splitting each document in half. The per-document topic distribution is estimated on the first half of the words (Group, 2014).The model then computes an average of how surprised it was by the words in the second half of the document, where surprise is measured in the number of equi-probable word choices, on average, with lower numbers meaning a surer model (Group, 2014). The perplexity scores are of course not comparable across corpora because they would be affected by different vocabulary size. However, they certainly can be used to compare models trained on the same data as in this case, as we do this once for news data, and once for the clicklogs. However, a note of caution here is that models with better

113 perplexity scores don’t always produce more interpretable topics or topics better suited to a particular task. Perplexity scores can be only used as stable measures for picking among alternatives, for lack of a better option (Group, 2014).

In general, we expect the perplexity to go down as the number of topics increases, but that the successive decreases in perplexity will get smaller and smaller. A good rule of thumb is to pick a number of topics that produces reasonable output (by manual inspection of the topic summary generated for a model). After some initial tuning of parameters by hand, and varying the number of topics from 30 all the way up to 140, we found that anything below 60 was too coarse-grained for our combined news corpus as well as clicklogs, Thus in our final parameter tuning experiments we only vary the number of topics between 80 and 120 in steps of 20, and minimum word length for the topic model words to be between 5 and 6. Parameters such as the topic smoothing and term smoothing were initially also varied, however these did not yield any significantly noticeable differences in the topics output, hence these were held constant at their default values. These new models used for running the logistic regression classifier were all generated using the same algorithm, where each learned topic model for its specific parameter combination setting, was joined with the transaction data for the

3 and 6-month baseline prediction models, as earlier done in TopChurn (Das et al.,

2015b).

All of these web models generated are also statistically significant against our cho- sen baselines for 3 and 6 month windows, with all of the 3-month models outper- forming the 3-month baseline of AUC = 0.64, with AUC >= 0.673 for WEB-NEWS and

AUC >= 0.662 for WEB-CLICKLOGS, and all of the 6-month models beating the 6-month baseline of AUC = 0.65, with AUC >= 0.673 for WEB-NEWS and AUC >= 0.665 for

114 WEB-CLICKLOGS. From our experiments we find that the models with number of top- ics N=100 perform the best over both prediction windows for both WEB-NEWS and

WEB-CLICKLOGS data, for a minimum word length setting, mlf = 6. However, since web search queries tend to be short and we didn’t want to miss out on potentially im- portant words that could affect topic distributions, we run experiments also with mlf

= 5 for WEB-CLICKLOGS and intuitively, the better performing models for this setting turn out to be the ones with N=120, for both prediction windows as now more words being admitted into the vocabulary is leading to a finer-grained topic distribution.

However the performance for these models are still lower compared to those with mlf

= 6. Hence from the parametrized exploration phase, we go into the next step of our analysis with the the notion that number of topics, N = 100 and mlf = 6 are the best learned parameter settings for our purposes, leading also to computationally more reasonable-sized models. Figure ?? outlines the performance of each model along with the corresponding parameter settings.

3.9 Investigating Complementarity of Signals for Insights into Causes

In the ensuing step, we make use of the findings from the previous step to use a topic setting that can work well for both news and click logs, so as to be able to use the “most suitable” number of topics to perform topic inference — where we first learn topic models on one dataset and then use this set of learned topics for inference on a different dataset, i.e. to learn topic distributions for the same set of topics on the other dataset. Our hypothesis from the correlation analysis is that the web news and click logs data sources complement each other in interesting ways on the same customer

115 Table 3.1: Performance for TopChurnWEB models with parametrized exploration

Model Data Source Prediction Num of Topics Min Word Lgth. AUC Window WEB-NEWS 3 80 6 0.6732* WEB-NEWS 3 100 6 0.6730 WEB-NEWS 3 120 6 0.6730 WEB-NEWS 6 80 6 0.6738 WEB-NEWS 6 100 6 0.6743* WEB-NEWS 6 120 6 0.6727 WEB-CLICKLOGS 3 80 6 0.6627 WEB-CLICKLOGS 3 100 6 0.6645* WEB-CLICKLOGS 3 120 6 0.6631 WEB-CLICKLOGS 6 80 6 0.6649 WEB-CLICKLOGS 6 100 6 0.6664* WEB-CLICKLOGS 6 120 6 0.6653 WEB-CLICKLOGS 3 80 5 0.6628 WEB-CLICKLOGS 3 100 5 0.6630 WEB-CLICKLOGS 3 120 5 0.6639* WEB-CLICKLOGS 6 80 5 0.6650 WEB-CLICKLOGS 6 100 5 0.6651 WEB-CLICKLOGS 6 120 5 0.6655*

Table 3.2: Inferred Models Performance based on best parameters learned for TopChurnWEB

Topic Model Topic Inference Prediction Number of Dataset Dataset Window Topics AUC WEB-NEWS WEB-CLICKLOGS 3 80 0.6621 WEB-NEWS WEB-CLICKLOGS 3 100 0.6618 WEB-NEWS WEB-CLICKLOGS 3 120 0.6623* WEB-CLICKLOGS WEB-NEWS 3 80 0.6708 WEB-CLICKLOGS WEB-NEWS 3 100 0.6720 WEB-CLICKLOGS WEB-NEWS 3 120 0.6725*

116 base, which can provide novel insights into user engagement. Thus, in this step of investigation, we perform topic modeling on the news data which we then infer on the click logs, and then repeat the same process, vice versa. We just use one prediction window, 3 months and fix the minimum word length to 6 for our inferred topic models as these were run mainly to aid with the following analysis, for further exploration of the two WEB datasets, in tandem. Table ?? outlines our findings for the performance of each of these web models derived by topic inference. As we can see, the inferred models with number of topics = 120 has the best model performance with AUC = 0.6623 and

AUC = 0.6725 for NEWS and CLICKLOGS respectively, which are also statistically significant against the baseline of AUC = 0.64 for K=3 months. From comparison of both Figures ?? and ??, where although the parameterized exploration experiments revealed that web models with N=100 gave a better performance overall, models with

N=120 came in a close second. Thus we choose the inferred topic distributions from this particular experiment for number of topics N=120, for further analysis in the next step.

Using the conclusions from the previous experiments of parameter tuning and in- ferred topics for model selection, our goal in this investigation is to use the ”best” model from the previous findings, for qualitative analysis of interesting as well as pre- dictive topics from the two complementary signals, to arrive at novel insights from the data, which can then be further validated by sentiment analysis done on a subset of documents from those topics. The charts in Figure 3.14 illustrate some of the insights we were able to mine, about readership and engagement, by performing topic inference, transfering from NEWS to CLICKLOG and vice versa, owing to the complementarity of these two signals. In all of the charts, the Y-axis represents the weight (probability)

117 of the particular trend (topic), for the given date, where the X-axis represents the timeline.

3.10 Discussion

Here we find quite a few interesting effects, some that are actual hidden patterns in the data and others, that are an effect of the modeling process itself. As we can see from

Figs 14.a and b, for the topic which is the directly inferred one on click logs from the news for this subject, this signal is rather weak in the click logs, albeit spread throughout the time line, so that it is hard to see the correlation between news and click logs by looking at only this directly inferred topic feature in isolation. However, we know from the modeling process, that while inferred topics have the same identifiers as the training dataset, because topic inference on a dataset differs from the dataset on which the model was originally trained, we expect that the inference dataset may make use of each learned topic in a way that is not exactly the same as how the topics were used during training (Group,

2014), as the term usages and distributions, and topic distributions over documents for the learned v/s inferred topics are also different, although the greatest possible overlap is attempted to be learned during inference. The way the topics were actually used in the inferred model can be determined by qualitatively analyzing the words in each of the inferred topics.

Analyzing the term usage statistics for the inferred model in this way, we find another topic mixture

Memorial> that has significant web search activity in which both terms “George” and “Zimmerman” occur. When we consider this more mixed topic, the correlation

118 between the two web data sources, for this particular topic learned from the news and inferred on the click logs, becomes more prominent, and can then be used to reason about user engagement for this topic from both sources. If we contrast Figure 3.14 a) and b) we find that potentially, while what might be the pure component of the

Zimmerman Trial topic, does show up in Figure 14.a, a lot of this probability mass may have been subtracted out and added on to potentially other topic mixtures such as which may actually be a noise topic. A noise topic may be one that attracts a lot of terms (and thus weight from the topic distribution) from purer topics by virtue of being similar to a lot of topics at once. Thus, it is possible that we need to empirically evaluate using inferred models from N=100 or N=140, instead of the

N=120 we derived, to get purer signal in terms of analyzing correlatedness for that particular topic. This leads to the conclusion that it may not be so much that an effect is not present, as the fact that noise components may steal weight from purer components which is a direct effect of the modeling process and its constraints.

However, this may not be the case for all topics, such as the more prominent ones, as indicated by some of our other results. The other interesting findings in regard to this observation are, as seen in Figures 3.14c-f, direct inferred (same identifier) topics like tend to be more popular in the news, than in web searches, as clearly indicated by the trend lines, complementing each other quite nicely, as the news presence for this topic seems to be backed up by a lower-weighted albeit consistent web search activity. This may also allude to the fact that mundane topics like the Economy are not a topic of web searches, or the fact that the newspaper is already covering this topic well so that users looking for any information related to this are able to easily find it. However, the opposite seems to be

119 true of topics such as or the . Sports, Weather, and Community related topics e.g. with respect to school shutdowns etc. tend to be well-searched for, as indicated by Figures 3.14. e and f of our model, which is an important insight for a newspaper, that needs to drive up readership and user engagement, so the newspaper may want to ensure that information related to these types of topics are always covered regularly with high quality, and are also easy for the users to find. Additionally, we find that topics that we deem potentially sensitive or controversial, such as

Boston Marathon Bombings> and

Trial> have much purer components in the topic models, tending to have a lot of weight around the period when the topic peaked in the news, but also that such topics tend to be searched for, not only during this peak period of time, but also consistently searched over a longer period of time, even after the topic is no longer in the news.

Also sentiment analysis of documents for these sensitive topics show that these topics as a whole often carry a negative polarity, while also scoring higher in their subjectivity score

All of our models based on temporal, unigram and Web-based topic and sentiment features, show statistical significance, and improved predictability over baseline models that utilize only transaction metadata, showing that features mined from Web news content and on-line user activity do have influence on newspaper subscriber engage- ment, and that our cross-domain topic model inference approach helped us to learn to select the best model, due to potentially learning the overlap between bags of words for the same topic in a different domain. We thus carry the valuable lessons learnt

120 from this type of topic transfer to investigate complementarity of signals, into the pri- mary focus of this dissertation, i.e. for discovery of novel associations between latent concepts.

3.11 Conclusion and Future Work

A large media organization typically has at its disposal large amounts of histori- cal transaction data with user complaints, updates to subscriptions, published content on the Web and on-line user activity from search click logs. We build several mod- els of subscriber churn on such a dataset, providing a comparison of the feature sets of these models, with respect to their ability to predict subscriber churn, in particu- lar, juxtaposing the web-based topic and sentiment features from NEWS and CLICKLOG sources which we find to be complementary sources of signal in this context. All of our models based on temporal, unigram and Web-based topic and sentiment features, show statistical significance, and improved predictability over baseline models that uti- lize only transaction metadata, showing that features mined from Web news content and on-line user activity do have influence on newspaper subscriber engagement and ultimately, churn. Further, we find that an iterative human-analyst-in-the-loop ap- proach for parametrized exploration of the data can reveal further novel insights from complementary signals, as more data comes into the system over time.

There is substantial potential for future work in a number of research directions.

One would be to develop and validate the effectiveness of enhanced churn models by incorporating semantically enriched content from transactional data and the Web. In addition, in the future, we may want to use topic models to investigate how news and click log search correlates with complaint topics and sentiment from the transactions.

121 Recent advancements in deep learning methods such as distributed representations of words Mikolov et al. (2013b) and sentences Le and Mikolov (2014) could also poten- tially be employed in lieu of topic models, to effectively mine these heterogenous and complementary signals of newspaper data, for churn prediction. We therefore plan to develop neural models Bengio et al. (2003); Le and Mikolov (2014) of complaints text and also news and search text, by training vector representations of these signals, to aid in future prediction and recommendation tasks.

We would also like to confirm certain trends we found in our models for prediction, e.g. the fact that our web models tend to be more predictive for larger prediction windows, by performing more experiments with many more time windows to investigate this effect. Finally, in the future we hope to incorporate a social aspect to this analysis, by integrating comments from Twitter data that occurred during the same time as the news or web searches (this time perhaps also including navigational queries which we have left out from the data thus far in our analyses) and investigate what kinds of trends emerge which we would validate via potential user studies or crowd-sourcing.

We believe that the ultimate next step for our work would be to scale our models for prediction into an online setting with guided inputs from a human expert, so as to be able to recommend or project in a real-time manner, the specific topic areas within news, web searches and user complaints, that are driving user engagement at the present time, or within a specified time period.

122 Figure 3.11: Most informative features, K =3

Feature Infogain Rank Model ! complaints-gt-5-YN 0.0421 1 base+complaintstats+3 (3) complaints-3-5-YN 0.0132 2 ! !

AvgSentiment 0.0314 1 base+complaintsentiment+3 MaxSentiment 0.0302 2 (3)

StatusSeq 0.6708 1 base+statusseq+3 (2) complaints-YN 0.0706 2 complaints-YN 0.0706 0 base+temporal+short+3 (7) TotalTrans 0.0626 1 NumComplaints 0.0391 2 AvgComplaintGap 0.0272 3

TotalTrans 0.1311 1 base+temporal+long+3 (7) complaints-YN 0.0706 2 NumComplaints 0.0685 3 AvgGapComplaints 0.0604 4

LongTermTotalTrans 0.1311 1 base+temporal+all+3 (13) complaints-YN 0.0706 2 NumComplaints 0.0688 3 ShortTermTotalTrans 0.0626 4 AvgComplaintGap 0.0604 5

NumMatches 0.0172 1 base+web+clicklog+all+3 topic05 0.017 2 (54) topic15 0.0167 6 MaxSentiment !0.0166 9 AvgSentiment !0.0166 10 topic33 0.0164 13

NumMatches 0.0228 1 base+web+news+all+3 (54) topic32 0.0113 8 topic03 0.0112 11 topic00 0.0105 22 … AvgSentiment 0.0084 53 MaxSentiment 0.0083 54

NumMatches 0.0228 1 base+web+news+ AvgSentiment 0.0084 2 sentiment+3(4) MaxSentiment 0.0083 3 missed-YN 0.0659 1 base+topKunigrams+3 (110) todays-YN 0.0633 2 redelivered-YN 0.0452 3 paper-YN 0.0434 4

123 Figure 3.12: Feature comparison with K =6

TotalTrans 0.0831 1 base+temporal+long+6 complaints-YN 0.0721 2 (7) NumComplaints 0.0659 3 AvgComplaintGap 0.0595 4 NumProdChg 0.0586 5 complaints-YN 0.0720 1 base+temporal+short+6 NumComplaints 0.0343 2 (7) AvgComplaintGap 0.0247 3 TotalTrans 0.0197 4 NumProdChg 0.0177 5

TotalTrans 0.0831 1 base+temporal+all+6 complaints-YN 0.0720 2 (13) NumComplaints 0.0661 3 AvgComplaintGap 0.0595 4 NumProdChg 0.0588 5

complaints-YN 0.0722 1 base+sentiment+6 AvgSentiment 0.0317 2 (3) MaxSentiment 0.0309 3

!

124 Model Feature Infogain Rank Tendency

base+complaintstats+3 complaints-YN complaints-gt-5-YN 0.0706 1 -ve complaints-3-5-YN 0.0421 2 -ve (3) 0.0132 3 -ve

complaints-YN 0.0706 1 -ve base+complaintsentiment+3 AvgSent 0.0314 2 -ve (3) MaxSent 0.0302 3 -ve

complaints-YN 0.07059 0 -ve TotalTrans 0.06261 1 -ve base+temporal+short+3 (7) NumComplaints 0.03911 2 -ve AvgGapComplaints 0.02716 3 -ve

TotalTrans 0.1311 1 -ve base+temporal+long+3 complaints-YN 0.07058 2 -ve (7) NumComplaints 0.06854 3 -ve AvgGapComplaints 0.0604 4 -ve

LongTermTotalTrans 0.13112 1 -ve complaints-YN 0.0706 2 -ve base+temporal+all+3 NumComplaints 0.06882 3 -ve (13) ShortTermTotalTrans 0.06261 4 -ve AvgGapComplaints 0.06042 5 -ve SSTOP_RESTART_YNGT1 0.00203 12 +ve

complaints-YN -0.31383 1 -ve topic21 -0.05995 2 -ve topic04 -0.05984 3 -ve topic29 -0.05814 4 -ve topic09 -0.05292 7 -ve topic48 -0.05201 8 -ve topic02 -0.05002 10 -ve topic34 -0.04607 14 -ve topic28 -0.03712 16 -ve base+web+clicklog+all+3 topic35 -0.03595 18 -ve (54) topic15 -0.02718 27 +ve MaxSentiment 0.03204 29 +ve AvgSentiment 0.03205 30 +ve topic11 0.05275 36 +ve topic33 0.06055 41 +ve topic37 0.06454 46 +ve topic19 0.07088 51 +ve topic13 0.07122 52 +ve topic05 0.07570 53 +ve

complaints-YN -0.31383 1 -ve topic13 -0.07919 2 -ve topic03 -0.05207 3 -ve topic08 -0.04990 4 -ve topic09 -0.04989 5 -ve topic23 -0.04896 6 -ve topic24 -0.04717 7 -ve base+web+news+all+3 topic39 -0.04327 8 -ve (54) topic41 -0.04212 9 -ve topic12 -0.04203 10 -ve AvgSentiment 0.00739 19 +ve MaxSentiment 0.04943 38 +ve topic34 0.03205 47 +ve topic40 0.07151 51 +ve topic35 0.07204 53 +ve topic32 0.09957 54 +ve

complaints-YN 0.07069 1 -ve base+web+news+ NumMatches 0.0228 2 +ve sentiment+3 (4) AvgSentiment 0.00836 3 +ve MaxSentiment 0.00831 4 +ve

Figure 3.13: Correlation analysis of predictive features in TopChurn models

125 ! ! Fig!14!a.!Trayvon!Martin,!George!Zimmerman!Trial!—!Topic!066! Fig!14!b.!Trayvon!Martin,!George!Zimmerman!Trial!—!Topics! ! 066,!037! !

! ! Fig!14!c.!Economy!—!Topic!001! Fig!14!d.!Boston!Bombings!—!Topic!036! !

! ! Fig!14!e.!Season,!Scores,!Players,!Teams!—!Topic!073! Fig!14!f.!Weather,!School!Closings,!Community,!Church!—! Topic!031! !

Figure 3.14: Comparison between topics learned from WEB-NEWS and fit/inferred on WEB-CLICKLOGS

126 CHAPTER 4

PHRASE2VECGLM: NEURAL GENERALIZED LANGUAGE MODEL–BASED SEMANTIC TAGGING FOR COMPLEX QUERY REFORMULATION

4.1 Introduction

As explained in section 1.4.2, “latent concept” refers to the nominalized sense of an underlying concept, that may not perhaps be explicitly mentioned in a collection and may or may not exist in some ontology or knowledge source but appeals to a common sense understanding of the real world. E.g. the alternate lexical forms high blood sugar and elevated glucose level may exist in separate documents in a collection, both refer- ring to the same underlying latent concept hyperglycemia that may not have explicit occurrence anywhere in that collection. This example represents a sameAs relationship between the alternate lexical forms high blood sugar and elevated glucose level, (corre- sponding to a binary predicate of the same name) for a latent concept hyperglycemia, that does exist in an ontology called the UMLS. Figure 4.1 exactly illustrates this idea, and could serve to provide some intuition regarding how the proposed solution addresses the given problem definition.

127 Figure 4.1: Example of Hidden Associations between Latent Concepts for the Medical Domain present in the UMLS ontology.

128 Here, the light blue ovals represent different Contexts as defined in section 1.5, the orange ovals represent Concepts, in this case in the UMLS ontology identified with a unique CUI identifier, the small green ovals represent hyponym Concepts, regarded as atoms in the UMLS. The purple arrows represent the possible co-occurrence of hy- ponyms in a shared context, thus giving rise to a novel association or a latent relation such as being administered in between the concepts “Antihypertensive Agents” and

“Taking Medications...” and is treatment for between the concepts “Taking Medications...” and “Episodic hyper- tension”, shown by the broken red arrows.

Knowledge graphs, traditionally driven by knowledge bases, represent facts about entities and tend to capture their relationships very well, achieving state-of-the-art performance in fact-based information retrieval. However, in domains such as medi- cal information retrieval, where addressing the specific information needs of complex queries, requires understanding the query intent by capturing novel associations be- tween such latent concepts as shown in Figure 4.1, these systems can fall short. In this work, we develop a novel, completely unsupervised, neural generalized language model-based document ranking approach to semantic tagging of documents. Using the document to be tagged as a query into the model, we retrieve candidate phrases from its top-ranked related documents, and thus associate every document with novel related concepts extracted from the text. Extending the word embedding-based generalized language model due to (Ganguly et al., 2015a), to employ phrasal embeddings, we use the semantic tags thus obtained in downstream query expansion, both directly and in feedback loop settings. Our method, evaluated using the TREC 2016 clinical decision support challenge dataset, shows statistically significant improvement, not only over

129 various baselines that use standard MeSH terms and UMLS concepts for query expan- sion, but also over baselines using human expert–assigned concept tags for the queries, run on top of a standard Okapi BM25 based document retrieval system.

4.2 Motivation and Background

A central tenet of classical information retrieval is that the user is driven by an

“information need”, defined by Schneiderman, Byrd, and Croft, as ”the perceived need for information that leads to someone “using an information retrieval system in the

first place”. Furthermore, in a taxonomy of web searches, that can be extended to any large scale document retrieval system, an “informational search” is one where the intent is to acquire some information assumed to be present on one or more web pages (Shneiderman et al., 1997; Broder, 2002).The purpose of such queries is to find information assumed to be available on the web or document retrieval system, in a static form. i.e. no further interaction is predicted, except reading. By static form we mean that the target document is not created in response to the user query. This distinction is important to make since the blending of results characteristic to third generation search engines might lead to dynamic pages. In any case, informational queries are closest to classic IR, and this is the type of querying scenario we primarily address via this work. What is unique to informational queries, say for example on the

Web, is that they are extremely wide, for instance cars or San Francisco, while some are narrow, for instance “normocytic anemia”, or “Scoville heat units”. It is interesting to note, that in almost 15% of all searches the desired target is a good collection of links or documents on the subject, rather than a single good document, i.e. a good hub, rather than a good authority (Broder, 2002; Kleinberg, 1999).

130 The hypothesis is that, similar to human experts who can determine the aboutness of an unseen document by recollecting meaningful latent concepts gleaned via shared contexts learned from past experience, a completely unsupervised machine learning model could be trained to associate documents within a large collection with meaningful concepts “discovered” (Li et al., 2011; Halpin et al., 2007; Bhagat and Hovy, 2013; Xu et al., 2014; Kholghi et al., 2015) by fully leveraging shared contexts within and between documents, that are found to be “related” (Turney and Pantel, 2010; Pantel et al.,

2007; Bhagat and Ravichandran, 2008), to better address the vocabulary gap between potential user queries and the documents , thereby naturally augmenting results for downstream retrieval or question answering tasks (Lin and Pantel, 2001a; Diekema et al., 2003; McAuley and Yang, 2016). To our knowledge, ours is the first work that employs word and phrase-level embeddings in a language modeling approach as opposed to a VSM-based approach (Sordoni et al., 2014), to semantically tag documents with appropriate concepts for use in retrieval tasks (Kholghi et al., 2015; De Vine et al.,

2014; Zhang et al., 2016; Zuccon et al., 2015; Sordoni et al., 2014; Tuarob et al., 2013), in contrast to previous similar approaches to document categorization for retrieval, based on clustering (Lin and Pantel, 2002, 2001b), LDA-based topic modeling (Blei et al., 2003; Griffiths and Steyvers, 2004; Tuarob et al., 2013) and supervised or active learning approaches (Kholghi et al., 2015).

A major problem of approaches like LSA (Deerwester et al., 1990) (refer to Ch.2 sections) and LDA–based topic modeling (refer to Ch.2 sections) (Blei et al., 2003;

Griffiths and Steyvers, 2004) is that unlike more local methods like CBOW or skip- gram (Mikolov et al., 2013b), they only consider word co-occurrences at the level of full documents to model term associations, instead of smaller neighborhoods within

131 documents, which may not always be reliable. Furthermore, these are parameterized approaches, where the number of topics K is fixed; and the final topics learnt are available as bags of words or n-grams from which topic labels must yet be inferred by an expert. Contrast this with a tagging approach for semantic labeling of documents, where a much more intuitive, directly human-interpretable meaning, is attached to the resource. Thus, our approach using word and phrasal embeddings in a language model framework, that take into account local co-occurrence information of terms in the top ranked documents retrieved in response to a query (corresponding to the relevance feedback step in IR), may lead to a better modeling of query versus document term dependencies (Ganguly et al., 2015b), by better capturing the latent concepts represented in the documents, and therefore lend itself to direct unsupervised extraction of meaningful terms related to a query document or query.

Thus in this work, we develop a novel, fully unsupervised, language model–based ranking method that uses top-ranked candidate phrases for semantic tagging of doc- uments, thus associating them with novel related concepts extracted from the text.

For this we extend the word embedding-based generalized language model due to Gan- guly et al. (Ganguly et al., 2015b), to employ phrasal embeddings, and evaluate its performance on an IR task using the TREC 2016 clinical decision support challenge dataset. Our semantic tag recommendation method, used for query expansion both directly and via feedback loop, shows statistically significant improvement not just over various baselines that use standard MeSH terms and UMLS concepts for query expansion, but also over baselines using expert–assigned concept tags for queries, all evaluated on top of a standard Okapi BM25 based document retrieval system. The main contributions of this work are as follows:

132 1. We present a novel use for a language modeling approach that leverages shared

context between documents within a collection via learnt embeddings, finding

the right trade-off between the local context around each term versus its global

context within the collection, to discover meaningful term associations.

2. Our method is fully non-parametric making no assumptions about the underlying

distribution of the data and fully unsupervised in that it includes no outside

sources of knowledge in the training, leveraging instead the shared contexts within

the document collection itself, via word and phrasal embeddings, mimicking a

human that potentially reads through the documents in the collection and uses

the seen information to make relevant concept tag judgments.

3. Our method presents a black-box approach for tagging any corpus of documents

with meaningful concepts, treating it as a closed system. Thus the concept

associations can be pre-computed offline or periodically, as new documents are

added to the collection and can reside outside of the document retrieval system,

allowing for it to be plugged into any such system, or for the underlying retrieval

system to be changed.

4.3 Problem Definition

4.3.1 Dataset and Task

The TREC Clinical Decision Support (CDS) task track investigates techniques for linking medical records to information relevant for patient care, where the goal is to evaluate biomedical literature retrieval systems for providing answers to generic clinical questions about patient cases. In order to make biomedical information more accessible

133 and to meet the requirements for the meaningful use of electronic health records, a goal of clinical decision support systems is to anticipate the needs of physicians by linking medical cases with information relevant for patient care, seeking to thus simulate the requirements of such systems and encourage the creation of tools and resources necessary for their implementation (Roberts et al., 2016a).

For the 2016 TREC CDS challenge, actual electronic health records (EHR) of pa- tients, in the form of case reports, as shown in Figure 4.2 are used, where a case report typically describes a challenging medical case —for our purposes, a complex, subjective query that we want to meet the information need for. The query topics in the chal- lenge, 30 in total, corresponding to such case reports are divided into 3 topic types,

Diagnosis, Test and Treatment.The target document collection for the track is the

Open Access Subset of PubMed Central (PMC). PMC is an online digital database of freely available full-text biomedical literature. Because documents are constantly being added to PMC, to ensure the consistency of the collection, the 2016 task provides a snapshot of the open access subset on March 28, 2016, containing 1.25 million articles consisting of title, keywords, abstract and body sections.

In this work, we develop our semantic tag recommendation method for query ex- pansion, as a blackbox system using only a subset of the entire collection corresponding to a set of 100K documents for which partial human judgments, and therefore inferred measures evaluation (Voorhees, 2014), are made available by TREC CDS, and we eval- uate our method on a separate search engine setup using an ElasticSearch (Gormley and Tong, 2015) instance, that indexes all of the 1.25 million PMC articles provided by TREC CDS challenge (Chen et al., 2016). Our unsupervised document tagging method as outlined in Section 4.6 only employs the abstract field of the 100K PMC

134 articles, for developing the general language model–based document ranking for tag recommendation, used for subsequent query expansion.

4.4 Semantic tag recommendation models

Search, especially a complex one, using only few keywords from the query is not usually very efficient. In general, a particular information need may consist of various topics and as such can be represented with different keywords which may not coincide exactly with the terms entered in the query by the user (Rivas et al., 2014; Lin and

Pantel, 2001b,a; Bhagat and Ravichandran, 2008). The user query can include key- words that are not present in documents, but documents could be relevant because they have other words with the same meaning (Rivas et al., 2014; Turney and Pantel,

2010; Lin and Pantel, 2001b,a; Bhagat and Ravichandran, 2008). One way to allevi- ate the issue of vocabulary mismatch could be via learning semantic classes or related candidate concepts in the text (Lin and Pantel, 2001b, 2002; Xu et al., 2014; Bhagat and Ravichandran, 2008) and subsequently tagging documents or content with these semantic concept tags, that can then be used as a means for either query-document keyword matching, or for query expansion, to facilitate downstream retrieval or ques- tion answering tasks (Li et al., 2011; Tuarob et al., 2013; Halpin et al., 2007; Lin and

Pantel, 2001a; McAuley and Yang, 2016).

Li et al. take an approach similar to ours, where they attempt to emulate hu- man tagging behavior to recommend tags by considering the concepts contained in documents. However the concept space here is derived from Wikipedia, the largest knowledge repository and they model the likelihood that a tag represents a KB-based concept in the document (Li et al., 2011). Halpin et al. in their detailed study of the

135 dynamics of human tagging behavior rightly observe that tags provide the link between the users of a system and the resources or concepts they search for, hence serving as a methodology for information retrieval. Also, tagging allows for much greater adaptabil- ity in organizing information than do formal classification systems such as ontologies

(Halpin et al., 2007). Thus ”groups of users do not have to agree on a hierarchy of tags or detailed taxonomy, they only need to agree, in a general sense, on the meaning of a tag enough to label similar content with terms for there to be shared value” (Mathes,

2010; Halpin et al., 2007). Given the right context such as complex subjective queries

(McAuley and Yang, 2016; Luo et al., 2008) or adaptive retrieval (Teo et al., 2016) tag- ging may be able to retrieve the data more efficiently than classifying (Mathes, 2010;

Halpin et al., 2007). As Butterfield, 2004 (Butterfield, 2004), points out , “Free typing loose associations is just a lot easier than making a decision about the degree of match to a pre-defined category (especially hierarchical ones). It s like 90 % of the value of a proper taxonomy, but 10 times simpler” (Butterfield, 2004; Halpin et al., 2007). Liu et al. (Liu et al., 2011; Tuarob et al., 2013) propose a tag recommendation model using Machine Translation (Zuccon et al., 2015) but their algorithm basically trains the translation model to translate the textual description of a document in the training set, into its tags. Tuarob et al (Tuarob et al., 2013) develop algorithms for automatic annotation of metadata over a large-scale distributed metadata management system, where they transform the problem of making large digital repositories of complex re- sources readily searchable, into a tag recommendation problem with a controlled tag library, and propose two variants of an algorithm for recommending tags, one employ- ing a TF-IDF-based document similarity measure and the other using an LDA topic modeling-based document similarity measure for generating the library. Kholghi et

136 al. contrast the performance of an active learning framework v/s a fully supervised approach towards the automated extraction of medical concepts from clinical free-text reports for reducing manual annotation effort (Kholghi et al., 2015).

4.5 Query Expansion

Using query expansion (QE) techniques, that are widely employed for improving the efficiency of textual information retrieval systems, a query is reformulated to im- prove retrieval performance and obtain additional relevant documents by expanding the original query with additional relevant terms and re-weighting the terms in the expanded query, helping to overcome vocabulary mismatch issues including words in queries with the same or related meaning (Zhang et al., 2016; Zuccon et al., 2015;

Kholghi et al., 2015; Rivas et al., 2014). Automatic query expansion techniques have been extensively used in information retrieval as a means of addressing the vocabulary gap between queries and documents, which can be further categorized as either global or local. While global techniques rely on analysis of a whole collection to discover word relationships, local techniques emphasize analysis of the top-ranked documents retrieved for a query (Xu and Croft, 2000; Manning et al., 2009). Global methods include: (a) query expansion/reformulation with a thesaurus such as WordNet, (b) query expansion via automatic thesaurus generation, and (c) techniques like spelling correction (Manning et al., 2009). Local methods adjust a query relative to the doc- uments that initially appear to match the query, which is the basic idea behind our language modeling approach to semantic tagging. The basic local methods are: (a) relevance feedback, (b) pseudo-relevance feedback, (a.k.a. blind relevance feedback), and (c) (global) indirect relevance feedback (Manning et al., 2009), of which, relevance

137 feedback, is one of the most used and most successful approaches, where the system re- turns a revised set of results based on user feedback on an initial set of results. Thus our proposed method exactly tries to mimic the human user behavior via pseudo-relevance feedback to pre-tag documents semantically so as to later aid downstream in novel retrieval for direct querying or refined querying.

4.6 Research Methodology

In contrast to VSM-based query expansion methods like QEM by Sordoni et al. and many others cited in the neural IR literature, we do not project the queries and documents into the same semantic space (Sordoni et al., 2014; Zhang et al., 2016). We in fact do not model for queries at all, modeling solely the documents in the collection, and training completely unsupervised on the dataset to leverage shared contexts, using the top-ranked phrasal candidate latent concepts output by our model to semantically tag documents, so that our model training is not fitted to specific queries in the dataset.

Rather, the potential latent concepts describing a particular document are learnt via our phrasal-GLM, which may change or increase, with more documents (data) coming in. Not only does this allow our model output to be humanly interpretable via the actual document concept-tags, it also makes our model truly generalized to cater to any new unseen queries, leading to some gain from automated QE (Mathes, 2010;

Halpin et al., 2007; Kholghi et al., 2015).

This is particularly useful in the specific use case of medical information retrieval, in making clinical decisions, where physicians often seek out information about how to best care for their patients (Luo et al., 2008; Kholghi et al., 2015). Thus, information relevant to a physician may be for a variety of clinical tasks such as determining a

138 Figure 4.2: Sample query ’topic’ from the TREC 2016 challenge, showing a clinical note with patient history, at 3 levels of granularity Note, Description and Summary.

patients most likely diagnosis given a list of symptoms, deciding on the most effective treatment plan for a patient, and determining if a particular test is indicated. Some- times physicians can find the information they seek in published bio-medical literature, however, given the volume of existing literature and the rapid pace at which new re- search is published, locating the most relevant and timely information for a particular clinical need can be a daunting and time consuming task (Roberts et al., 2016a), given that there may be no single right answer. Thus our work makes novel use of a phrasal- embedding based (De Vine et al., 2014) language modeling approach to semantically tag documents with phrases representative of latent concepts within those documents

(Kholghi et al., 2015; De Vine et al., 2014), and to make the collection more readily searchable (Halpin et al., 2007; Li et al., 2011), by using these tags for query expansion

(Zhang et al., 2016; Sordoni et al., 2014) in downstream information retrieval. Addi- tionally, our method treats all queries in our dataset as unseen at test time, on which our actual results and gains are reported.

139 In this work, a concept, defined exactly as in section 1.4, is extracted from the corpus as per this definition, according to the following heuristic, where: it is chosen to be a candidate term or phrase scored by a chosen metric as important, for later use in our algorithm, either as a query term representing a query document (Yang et al.,

2009), or concept tags for target documents. Thus our concepts, primarily extracted noun phrases, vary from a single unigram term, to consisting of up to three unigrams as employed by our phrasal embedding-based language model.

We hypothesize that since word embedding techniques use the information around the local context of each word to derive the embeddings, using these within a LM to derive terms or concepts that may be closely associated with a given document in the collection, despite no lexical overlap between the query and a given document (Gan- guly et al., 2015b), and further extending the model to use embeddings of candidate noun phrases, could leverage such shared contexts toward query expansion, potentially augmenting both: 1) the global context analysis in IR leading to better downstream retrieval performance from direct query expansion, and, 2) the local context analy- sis from top-ranked documents leading to better query refinement within a relevance feedback loop (Su et al., 2015; Xu and Croft, 2000).

Using the phrasal embedding based general language model described in section

(4.7) we generate top ranked document sets for each query document. We subsequently select concepts to tag query documents with from the top-ranked sets. We apply our language model based latent concept discovery to query expansion both directly as well as via relevance feedback, evaluating the expanded queries on a separate ElasticSearch indexed search engine setup, showing improvement in both (Chen et al., 2016; Gormley and Tong, 2015).

140 4.7 A Phrasal Embedding-based General LM for semantic tag recommendation by knowledge transfer

We hypothesize that for automatically associating documents with meaningful con- cepts extracted from the text that are not necessarily present in the target document

(query document), leveraging shared local contexts between documents through the possible use of word or document embeddings (Mikolov et al., 2013b; Goldberg and

Levy, 2014; Le and Mikolov, 2014) could provide the necessary advantage. To this end, we find that the word-embedding based general language model of Ganguly et al.

(Ganguly et al., 2015b), designed specifically for IR, that models term dependencies using the vector embeddings of terms, lends itself exactly to this. Thus we extend this model in our work to include embeddings of candidate noun phrases, that could ultimately better help to rank documents matching with a query document. These ranked matching documents are then used in a final step of concept selection to tag our query document. The rest of this section outlines the details of this model.

One general approach to IR treats the estimation of the probability of document given query P (d|q) as a normalized product of query term probabilities for each docu- ment (Zhai and Lafferty, 2004). That is,

P (q|d).P (d) P (d|q) = ∝ P (q|d).P (d) P 0 0 d0∈C P (q|d ).P (d ) Y = P (d). P (q0|d) q0∈q

(Ganguly et al., 2015b). In the simplest case, P (d) is assumed to be uniform, and so does not affect document ranking (Zhai and Lafferty, 2004). It is important to note here, that since we use this document ranking approach to rank other documents with respect to a query document, dq, thus the q in our formulations really refers to a

141 query document dq. Thus, Y P (d|q) = λ.Pˆ(q0|d) + (1 − λ).Pˆ(q0|C) q0∈q (4.1) Y tf(q0, d) cf(q0) = λ + (1 − λ). |d| |C| q0∈q where Equation 4.1 represents the Jelinek–Mercer method of smoothing (Zhai and

Lafferty, 2004; Ganguly et al., 2015b) which involves a linear interpolation of the esti- mated maximum likelihood probabilities Pˆ(q0|d), of generating the query term q0 from the document d and Pˆ(q0|C) of generating the query term q0 from the collection, using a coefficient λ to control the influence of each, where tf(q0, d) denotes term frequency of query term q0 in document d, cf(q0) denotes collection frequency of term q0, and |d| and |C| denote document length and collection size respectively.

However, this still leads to poor probability estimation for query terms that do not appear in the document, due to the assumption that the query terms are sampled inde- pendently from either the document or the collection. Ganguly et al. ganguly2015word smooth this probability estimation by proposing a generalization to the model in Equa- tion (1), where instead of assuming that the terms are mutually independent during the sampling process, they relax this assumption to propose a generative process in which a noisy channel may transform a term t sampled from a document d or the collection

C, with probabilities α and β respectively, into a query term q0. Thus, by this model we have: Y Y P (q0|d) = [λP (q0|d) q0∈q q0∈q X ˆ 0 + α Psim doc(q , t|d) t∈d (4.2) X ˆ 0 + β Psim Coll(q , t|d) t∈d + (1 − λ − α − β)P (q0|C)]

142 Here P (q0|d) and P (q0|C) are the same as direct term sampling without transfor- mation, from either the document d or collection C, by the regular LM in Equation

4.1, when t = q0.

However, when t 6= q0 we may sample the term t either from document d or collection

C where the term t is transformed to q0. When t is sampled from d, since the probability of selecting a query term q0, given the sampled term t, is proportional to the similarity of q0 with t, where sim(q0, t) is the cosine similarity between the vector representations of q0 and t, and P (d) is the sum of the similarity values between all term pairs occurring in document d, the document term transformation probability can be estimated as:

sim(q0, t) tf(t, d) Pˆ (q0, t|d) = . (4.3) sim doc P (d) |d|

Similarly when t is sampled from C, where for the normalization constant, instead of considering all (q0, t) pairs in C, we restrict to a small neighbourhood of say 3

0 terms around the query term q , i.e. Nq0 , to reduce the effect of noisy terms, then the collection term transformation probability can be estimated as:

0 ˆ 0 sim(q , t) cf(t) Psim Coll(q , t|d) = P . (4.4) Nq0 |C|

Equation 4.2, combines all these term transformation events by denoting the prob- ability of observing a query term q0 without transformation (standard LM) as λ, that of document sampling–based transformation as α and the probability of collection sampling–based transformation as β.

Thus, per Equations 4.1 and 4.2, deriving the posterior probabilities P (d|q) for ranking documents with respect to a query, involves maximizing the conditional log likelihood (equivalent to minimizing negative log likelihood) of the query terms in a

143 query q given the document d, i.e.:

X arg min − [log(P (q0|d))] (4.5) α,β q0∈q

In the model given by equations (4.2) and (4.5), the term t being sampled from a document or the collection represents a unigram word embedding. Our main contri- butions in extending this model are: (1) Rather than changing the matching function into the IR system, we can use the transformation model to select new query expan- sion terms. We do this by extending the vocabulary to include noun-phrases of up to a length of three, extracted from the text, and, (2) We thus adapt the model to incorporate phrase–based LMs, which require phrasal embeddings. Thus the term t in equations (4.3) and (4.4) may no longer represent a unigram word embedding, but may be the same-length vector representation of an extracted noun-phrase c. Thus the vocabulary terms now consist of additional phrase terms, while introducing more con- textually meaningful terms into the set, used in term similarity calculations and thus candidate-concept matching, by the extended LM, which we believe gives additional leverage in final query term expansion via LM–based document ranking. Thus the query enhancement mechanism and IR system are decoupled and can exist as separate blackbox systems.

4.8 Algorithm

Our algorithm for semantic tag recommendation for query expansion, broadly works by discovering concepts that are similar in the shared local contexts that they occur in, within certain documents, and as such, based on some threshold criteria, are used to tag a document. Thus our algorithm, consists of two main parts: 1) A document

144 scoring and ranking module applying directly the phrasal embeddings based general language model described in Section 4.1, and, 2) A concept selection module to tag the query document with, coming from the set of top ranked matching documents to a query document from step 1. There are a couple of different variations implemented for the concept selection scheme: (i) Selecting the top TF-IDF term from each of the top-K matching documents as the set of diverse concepts, representative of the query document, and (ii) Selecting the top-similar concept terms matching each of the representative query document terms, using similarities on the top-ranked set of documents.

4.9 Data Pre-processing for Phrasal LM

We first pre-process the documents in our collection by lower-casing and lemma- tizing the text, removing most punctuation, like commas, periods, ampersands etc. keeping however, the hyphens, in order to retain hyphenated unigrams, also keeping semi-colons and colons for context. We use regular expressions to retain periods that occur within a decimal value replacing these with the string decimal that then gets its own vector representation. Since we implement both unigram and phrasal embed- ding based GLM models we process the same text corpus accordingly for each. For the unigram model, our tokens are single or hyphenated words in the corpus. For the phrasal model, we do an additional step of iterating through each document in the corpus, extracting the noun phrases in each using the textblob (Loria, 2014) toolkit.

This at times gave phrases of up to a length of six, so we only admit ones of size up to three which may include some hyphenated words, to avoid tiny frequency counts. We then plug these extracted phrases back into the documents to obtain a phrase-based

145 corpus for training, that has both unigrams and phrases. We then precompute various document and collection level statistics such as raw counts, term frequencies (phrase frequencies for phrasal corpus), IDF and TF-IDF (Sparck Jones, 1972) for the terms and phrases. Following this we proceed to generate various embedding models (Mikolov et al., 2013b) for both our unigram and phrasal corpora having different length vector representations and context windows using the gensim (Reh˚uˇrekandˇ Sojka, 2010a) package, using the processed text. In particular we generate word embeddings trained with the skip-gram model with negative sampling (Mikolov et al., 2013b) with vector length settings of 50 with a context window of 4, and also length 100 with a context window of 5. We also train with the CBOW learning model with negative sampling

(Mikolov et al., 2013b) for generating embeddings of length 200 with a context win- dow of 7. But we report all of our results on experiments run off the models having an embedding length of 50. Our algorithm assumes that the document and collection statistics as well as the embedding models are already computed and available.

4.10 Experimental Setup

We run two different sets of experiments: (1) Direct query expansion of the 30 queries in the TREC dataset, using UMLS concepts (Manual, 2008) for our augmented baselines, and, (2) Feedback loop–based query expansion where we use the concept tags for a subset of the top returned articles for the Summary Text–based queries ran against an ElasticSearch index, as query expansion terms, (here MeSH terms-based QE (Luo et al., 2008; Tuarob et al., 2013) is an augmented baseline), and evaluate both types of runs against our ElasticSearch (ES) index setup described in Section 4.11.

146 For direct query expansion we take all granularity levels of query topics described in Section 4.3.1, i.e. Summary, Description and Notes text, and feed these into our

GLMs obtaining the top-K ranked documents for each query and drawing our query expansion concept tags from this set according to the algorithm described in Section

4.8. For our augmented query baselines, we use UMLS terms within the above query texts as UMLS is very effective in finding optimal phrase boundaries (Chen et al., 2016).

For the relevance feedback–based query expansion, we take the top 10-15 documents returned by our ES index setup for each of the Summary Text queries and use the concept tags assigned to each of these top returned documents by our unigram and phrasal GLMs as the concept tags for query expansion for the original query. Then we re-run these expanded queries through the ES search engine to record the retrieval performance. The MeSH terms used for the augmented baseline for the feedback loop case, are directly available for a majority of the PMC articles from the TREC dataset itself.

Additionally, to evaluate our feedback loop method against a human judgments– based baseline, we use Expert Term annotations for the query topics available from a

2016 submission to this challenge (Chen et al., 2016), where 3 physicians were invited to participate in a manual query expansion experiment. Each physician was assigned

10 out of the 30 query topics from the 2016 challenge. Based on the note, each physi- cian provided a list of 2 to 4 key-phrases. The key-phrases did not have to be part of the note, but could be derived from the physician’s knowledge after reading the note (Chen et al., 2016). The search keywords for the query topics thus manually provided by these domain experts, were used to retrieve corresponding matching PMC article IDs from the PubMed domain. The expert then spot-checked the top–ranked

147 articles to see if these were mostly relevant. If so, they finalized the keywords as- signed. Otherwise, they kept fine-tuning the keywords, until they got a desired set of results, simulating exactly the adaptive decision support (relevance feedback loop) in

IR. We also develop an interpolated model with a coefficient γ that interpolates be- tween unigram and phrasal models, which gets performance comparable to the phrasal model, but does not outperform the other models by itself, hence we do not report those results here. Because the challenge uses a graded relevance scale to judge the performance of retrieval systems, we report our results using the inferred measures

(Voorhees, 2014), for normalized discounted cumulative gain (NDCG) and Precision at

10 (P@10) . Although the TREC CDS 2016 query set is categorized into three topic types for Diagnosis, Tests and Treatment, we do not divide our evaluation runs into three corresponding sets, evaluating our methods performance on the entire TREC query set instead.

4.11 Evaluation using an ElasticSearch Index

For the search engine–based evaluation of our proposed method, we replicated an

ElasticSearch (ES) instance setup with required settings, used in a 2016 challenge submission (Chen et al., 2016). Among the different algorithms available, BM25 (with parameters k1=3 and b=0.75) was selected as the ranking algorithm in our setup due to slightly better performance observed than others, with a logical OR querying model implemented, and the minimum percentage match criterion in ES, for search queries, set at 15% of the keywords matched for a document (Chen et al., 2016). Since our

GLM outlined in Section 4.7 uses the abstract field of the article for query expansion, we boosted the abstract field 4 times and the title field 2 times in our ES search setup.

148 Metric

Query Expansion Method Query Text NDCG ** P@10 **

Direct setting: BM25+Standard LM (Jelinek-Mercer sm.) (baseline) Summary 0.0475 0.1172 BM25+Phrase2Vec (no GLM) (baseline) Summary 0.0932 0.2267 BM25+DescUMLS (augmented baseline) Summary 0.1070 0.2299 BM25+DescUMLS+unigramGLM (model) Summary 0.1010 0.2414 BM25+None (baseline) Summary 0.1060 0.2489 BM25+SumUMLS (augmented baseline) Summary 0.1466 0.2644 BM25+SumUMLS+unigramGLM (model) Summary 0.1387 0.2817

Feedback Loop (PRF-based) setting: BM25+Standard LM (Jelinek-Mercer sm.) (baseline) Summary 0.0265 0.0867 BM25+Phrase2Vec (no GLM) (baseline) Summary 0.0662 0.1318 BM25+MeSH QE Terms (baseline) Summary 0.0970 0.2294 BM25+None (baseline) Summary 0.1060 0.2489 BM25+Human Expert QE Terms Summary 0.1029 0.2511 (aug. baseline) BM25+unigramGLM QE Terms (model) Summary 0.1173 0.2792* BM25+Phrase2VecGLM QE Terms (model) Summary 0.1159 0.2872*

PRF-based Ensembled Models

BM25+unigramGLM Terms +Phrase2VecGLM Terms (baseline) Summary 0.1057 0.2756 BM25+SumUMLS+unigramGLM+ Phrase2VecGLM QE Terms (model) Summary 0.1206 0.3091*

Table 4.1: Results for Query Expansion by different methods using unigram and phrasal GLM–generated terms, directly and in feedback loop. Asterix indicate inferred measures (Voorhees, 2014). Bold face values indicate statistical significance at p << 0.01 over previous result or baseline.

149 4.12 Experimental Results and Discussion

This section outlines our results obtained with the various experimental runs de- scribed in Section 4.10 as shown in Table 4.1. The hyper–parameters for our best per- forming models were empirically determined and set to be at (λ, α, β) = (0.2, 0.3, 0.2) for the word embedding–based GLM and (λ, α, β) = (0.2, 0.4, 0.2) for the phrasal embedding–based GLM, similar to those reported by Ganguly et al. ganguly2015word.

All models were evaluated for statistical significance against the respective baselines using a two-sided Wilcoxon signed rank test, for p << 0.01, indicated by bold face values 2 if found to be significant. As seen from the results, our unigram and phrasal

GLM–based methods for query expansion appear quite promising for both direct query expansion and feedback loop based decision support. For both methods, our triv- ial baseline used just the Summary text for the query terms with no expanded set of terms. To summarize some of our key findings: 1) For direct query expansion,

UMLS concepts found within the Summary, Description and Notes text of the query itself, were used as augmented baselines. Of these, the Notes UMLS–based expansion worked rather poorly (we attribute this to extra noise concepts in the lengthy Notes text). 2) Though Description text–based UMLS terms did worse than our vanilla

Summary text baseline, the Description UMLS terms run through the unigram GLM to get expanded terms did significantly better than Description UMLS terms indicat- ing that our method helps improve term expansion. 3) For direct query expansion, the biggest gain against the baseline was observed for the Summary text UMLS terms run through the unigram GLM to get expanded terms, with a P@10 value of 0.2817.

2A method is statistically significant against the previously listed method, or the baseline, by a two-sided Wilcoxon signed rank test, with p << 0.01

150 The phrasal model did comparably to the unigram model, however did not beat it, for direct query expansion. 4) For the feedback loop based query expansion method, we had two separate human judgment–based baselines, one using the MeSH terms available from PMC for the top 15 documents returned in a first round of querying the ES index with Summary text, and the other based on the expert annotations of the 30 query topics as described in Section 4.10. The MeSH terms baseline got a

P@10 of 0.2294, even less than our vanilla Summary Text baseline with no expanded terms, while our Expert Terms baseline beat this baseline significantly. 5) One rea- son for the lower performance of the MeSH terms model, we believe, is lack of MeSH term coverage for all the documents chosen. 6) Our unigram GLM–based expanded terms from the top–15 documents returned by Summary Text beat the Expert Terms baseline quite significantly with P@10 of 0.2792, which in turn was outperformed by the phrasal GLM–based expanded terms model with P@10 of 0.2872. 7) Finally our combined model using the unigram + phrasal GLM terms from the top–15 off of the

Summary text, beat our vanilla baseline, and in turn was outperformed by our very best combined terms model which generated unigram + phrasal GLM–based terms for the top–15 documents for each query, off of the Summary + Summary UMLS concepts, getting a P@10 of 0.3091. As an example to illustrate, a set of concept tags learned by our unigramGLM model may look like: , and for the phrasalGLM model we may have: .

151 Further, to provide evidence for Main Idea 1, we manually evaluate the inferred semantic tags using the UMLS ontology to find latent concept abstractions of these concept tags, that correspond to UMLS CUIs and we reason about possible relations between these latent concepts. Tables 4.2 and 4.3, attempt to show meaningful latent concept mappings via UMLS CUIs to the inferred concept tags as well as meaningful potential relations between them to provide support for Main Idea 1, for which expert evaluation–based analysis is discussed in section 7.1. Main Idea 2, by virtue of being incorporated into the design of the system itself, is an assumption made in the imple- mentation and experiments for this system. Finally, the fact that our experimental results from Table 4.1 show improved semantic matching for QE in downstream IR, provides evidence, we believe, for Main Idea 3, giving us also related concept tags in the query context.

Query UMLS CUIs Related UMLS CUIs Possible terms Latent Concepts Concept Latent Concepts Relations tags dementia [C0497327 Dementia]; alzheimers [C0002395 sameAs Alzheimer’s Disease] Alzheimer’s Disease] causedBy [C0036341 Schizophrenia] cognitive [C1516691 Cognitive]; behavioral [C0004927 isAbout [C0009240 Cognition]; Behavior]; [C0009241 Cognition [C0004941 Behavioral Disorders]; Symptoms] bp [C0020538, Hypertensive diabetes [C0011849 Diabetes]; elevated disease] [C0011849 Diabetes Expressions Mellitus] ofDiseases

Table 4.2: UMLS Concept Unique Identifier(CUI)-based Latent Concept mappings and Possible Relations for unigramGLM

152 Query UMLS CUIs Related UMLS CUIs Possible terms Latent Concepts Concept Latent Relations tags Concepts albendazole [C0001911 Strongyloides [C0038462 treatsFor Albendazole]; stercoralis Strongyloides stercoralis]; eosinophilic [C4703536; corticosteroid [C0149783 mayBe ascites Eosinophilic therapy Steroid TreatableBy ascites]; therapy]; parasitic [C3686777 Parasitic case [C0038463 mayBeAFormOf infection infection of hyperinfection Strongyloidia- (hyperinfection digestive tract]; sis];[C4524208 may be a Strongyloides form of parasitic stercoralis] infection)

Table 4.3: UMLS Concept Unique Identifier(CUI)-based Latent Concept mappings and Possible Relations for Phrase2VecGLM

153 CHAPTER 5

SEQUENCE-TO-SET SEMANTIC TAGGING: END-END NEURAL ATTENTION–BASED TERM TRANSFER TOWARDS COMPLEX QUERY REFORMULATION AND AUTOMATED CATEGORIZATION OF TEXT

The previous chapter focused on using embeddings of words and phrases in a neu- ral generalized language model framework, to relax the independence assumption of traditional language models and achieve better generalization to out of vocabulary terms. However in this approach we are unable to capture long-range dependencies and context arising out of the entire sequence of tokens within a document, apart from not being able to automatically learn weights over specific portions of the sequence or over particular features that may be more predictive of relevant tags via a mechanism like neural attention – advancements that more recent neural frameworks have suc- cessfully taken advantage of for various tasks. In this chapter we therefore introduce a document-level inductive transfer learning method via self-taught learning, as we do not have original labels or semantic tags on the documents, where we aim to predict semantic tags for documents directly via an end-to-end neural approach employing sequential encoder models and attention. Here, the dataset for the source and target

154 domains is the same, however for the unsupervised setting of the task, labels are un- available in the source domain, and the objective is to improve the performance of the semantic tagging task over the previous neural GLM–based implementation that em- ployed only representations of words and phrases, by extending the encoding learning to whole documents.

Inspired by the recent success of sequence-to-sequence neural models in deliver- ing the state-of-the-art in a wide range of NLP tasks, and building upon the suc- cesses of the neural GLM-based model described in the previous chapter, we explore a novel sequence-to-set architecture with neural attention for learning document rep- resentations that can effect term transfer for semantically tagging a large collection of documents. We demonstrate that our proposed method can be effective in both, a unique unsupervised setting with no document labels that uses no external knowledge resources and only corpus-derived term statistics to drive the training, and can also work well in a supervised multi-label classification setup for automated document cat- egorization. Further, we show that semi-supervised training using our architecture on large amounts of unlabeled data can augment performance on the automated document categorization task when limited labeled data is available.

We evaluate our method on supervised, semi-supervised and unsupervised setups of semantic tagging of documents via multi-label classification from document collections having various levels of human-derived tag labels. We evaluate our framework on a collection of tagged documents from the del.icio.us folksonomy for the supervised task, a tagged collection of medline abstracts from the Ohsumed dataset for the semi-supervised task (using only a fraction of the labeled data for training, and the majority of the data without human-annotated tags), as well the TREC 2016 clinical

155 decision support challenge dataset having no human-derived document labels for the unsupervised task. For the unsupervised task we evaluate the semantic tags inferred by our system, via complex query reformulation of clinical queries in this same dataset, in a pseudo-relevance feedback-based query expansion setting. Our approach to generate document encodings employing our sequence-to-set models for inference of semantic tags, gives to the best of our knowledge, the state-of-the-art when evaluated on the

TREC challenge dataset query expansion, on top of a standard Okapi BM25–based document retrieval system for the unsupervised task, and statistically significant AUC over the baseline, for both the supervised and semi-supervised tasks for the respective datasets.

In this work I aim to demonstrate evidence for hypotheses 1, 2 and 3 (Section

1.5.1) of the research statement, namely, that hidden associations not currently present in a knowledge base, may exist in a large collection between certain surface forms of concepts, that shared contexts may be leveraged to directly learn these hidden associations as opposed to modeling directly for the latent concepts, and that this type of direct learning of associations between latent concepts such as by learning semantic tags for the document collection, can lead to discovery-based insights recommendation systems such as the IR task that we evaluate in this chapter. A significant portion of this chapter is in submission at AAAI 2020.

5.1 Introduction

Tremendous strides have been made in recent times by neural machine learning models for reasoning with texts on a wide variety of NLP tasks. In particular, sequence- to-sequence (seq2seq) neural models often employing attention mechanisms, have been

156 largely successful in delivering the state-of-the-art for tasks such as machine trans- lation (Bahdanau et al., 2014; Vaswani et al., 2017), handwriting synthesis Graves

(2013), image captioning Xu et al. (2015), speech recognition Chorowski et al. (2015) and document summarization Cheng and Lapata (2016). Inspired by these successes, we aimed to harness the power of sequential encoder-decoder architectures with atten- tion, to train end-to-end differentiable models that are able to learn the best possible representation of input documents in a collection while being predictive of a set of key terms that best describe the document. These will be later used to “transfer” a relevant but diverse set of key terms from the most related documents, for semantic tagging of the original input documents aiding in downstream query refinement for IR by pseudo-relevance feedback Xu and Croft (2000); Cao et al. (2008).

Novel contexts may often arise in complex querying scenarios such as in evidence- based medicine (EBM) involving biomedical literature, that may not explicitly refer to entities or canonical concept forms occurring in any fact- or rule-based knowledge source such as an ontology like the UMLS Bodenreider (2004). Moreover, hidden asso- ciations between candidate concepts meaningful in the current context, may not exist within a single document, but within the collection, via alternate lexical forms. Our objective therefore, is to implicitly learn and encode the novel contexts that may relate two or more documents by learning the best possible intrinsic document representa- tions, incorporating both local contextual information on word usage occurring within a document, as well as global information on term usage occurring across the document collection. We aim to achieve this by using word embeddings trained on the whole cor- pus in our models, within a novel sequence-to-set pipeline, details to follow, that offers a generic paradigm for unsupervised semantic tagging of documents in any domain.

157 To this end and to the best of our knowledge, we are the first to employ a novel, completely unsupervised end-to-end neural attention-based document representation learning approach, using no external labels, in order to achieve the most meaningful term transfer between related documents, i.e. semantic tagging of documents, in a pseudo-relevance feedback Xu and Croft (2000)–based setting. This may also be seen as a method of document expansion as a means for obtaining query refinement terms for downstream IR. The following sections give an account of our specific architectural considerations in achieving an end-to-end neural framework for semantic tagging of documents using their representations, and a discussion of the results obtained from this approach.

5.2 Related Work

Pseudo-relevance feedback (PRF), a local context analysis method for automatic query expansion, is extensively studied in IR research as a means of addressing the word mismatch between queries and documents. It adjusts a query relative to the documents that initially appear to match it, with the main assumption that the top- ranked documents in the first retrieval result contain many useful terms that can help discriminate relevant documents from irrelevant ones Xu and Croft (2000); Cao et al.

(2008). It is motivated by relevance feedback (RF), a well-known IR technique that modifies a query based on the relevance judgments of the retrieved documents Salton et al. (1990). It typically adds common terms from the relevant documents to a query and re-weights the expanded query based on term frequencies in the relevant documents relative to the non-relevant ones. Thus in PRF we find an initial set of most relevant documents, then assuming that the top k ranked documents are relevant, RF is done

158 as before, without manual interaction by the user. The added terms are, therefore, common terms from the top-ranked documents.

To this end, (Cao et al., 2008) employ term classification for retrieval effectiveness, in a supervised setting, to select most relevant terms. Palangi et al. (2016) employ a deep sentence embedding approach using LSTMs and show improvement over standard sentence embedding methods, but as a means for directly deriving encodings of queries and documents for use in IR, and not as a method for QE by PRF. In another approach,

(Xu et al., 2017) train autoencoder representations of queries and documents to enrich the feature space for learning-to-rank, and show gains in retrieval performance over pre-trained rankers. But this is a fully supervised setup where the queries are seen at train time. Pfeiffer et al. (2018) also use an autoencoder-based approach for actual query refinement in pharmacogenomic document retrieval. However here too, their document ranking model uses the encoding of the query and the document for training the ranker, hence the queries are not unseen with respect to the document during training. They mention that their work can be improved upon by the use of seq2seq- based approaches. In this sense, i.e. with respect to QE by PRF and learning a sequential document representation for document ranking, our work is most similar to

Pfeiffer et al. (2018). However the queries are completely unseen in our case and we use only the documents in the corpus, to train our neural document language models from scratch in a completely unsupervised way.

Classic sequence-to-sequence models like Sutskever et al. (2014) demonstrate the strength of recurrent models such as the LSTM in capturing short and long range dependencies in learning effective encodings for the end task. Works such as Graves

(2013); Bahdanau et al. (2014); Rockt¨aschel et al. (2015), further stress the key role

159 that attention, and multi-headed attention (Vaswani et al., 2017) can play in solving the end task. We use these insights in our work.

Significant research in the biomedical IR domain has primarily targeted the ex- traction of entity relations, such as Protein-Protein relations, Drug-Drug interactions, and Protein-Gene or Mutation-Gene relations. E.g., in a recent application of self- attention for medical relation extraction Verga et al. (2017) employ a self-attention encoder to form pairwise predictions between all entity-pair mentions. Entity-pairwise pooling aggregates mention pair scores to make a final prediction. This is further im- proved by jointly training to predict named entities. However, in another approach expanding on this work, that also jointly learns named-entity recognition (NER) and relation extraction (RE) Singh and Bhatia (2019) cite that their analysis reveals that by focusing their encoder mostly on intra-sentence relations, Verga et al. (2017) often wrongly marked many cross-sentence relations as negative, especially when the two target entities are connected by a string of logic spanning over multiple sentences .

They address this issue by a model based on the hypothesis that two target entities, whether intra-sentence or cross-sentence, could also be explicitly connected by a third context token that they explicitly model, in their work. In this respect, our work is closely related to Singh and Bhatia (2019) as we implicitly model for this context that acts as pivot for the existence of second-order relations between concepts or entities that decides ultimately the degree of relatedness between two documents.

According to the detailed report provided for this dataset and task in Roberts et al. (2016b) all of the systems described perform direct query reweighting aside from supervised term expansion and are highly tuned to the clinical queris in this dataset.

In a related medical IR challenge Roberts et al. (2017) the authors specifically mention

160 that with only six partially annotated queries for system development, it is likely that systems were either under- or over-tuned on these queries. Since the setup of the seq2set framework is an attempt to model the PRF based query expansion method of its closest related work Das et al. (2018) where the effort is also to train a neural generalized language model for unsupervised semantic tagging, we choose this system as the benchmark to compare against to our end-end approach for the same task.

5.3 Methodology

Drawing on sequence-to-sequence modeling approaches for text classification, e.g. textual entailment Rockt¨aschel et al. (2015) and machine translation Sutskever et al.

(2014); Bahdanau et al. (2014) we adapt from these settings into a sequence-to-set framework, for learning representations of input documents, in order to derive a mean- ingful set of terms, or semantic tags drawn from a closely related set of documents, that expand the original documents. These document expansion terms are then used down- stream for query reformulation via PRF, for unseen queries. We employ an end-to-end framework for unsupervised representation learning of documents using TFIDF-based pseudo-labels (Figure 5.1(a))and a separate cosine similarity-based ranking module for semantic tag inference (Figure 5.1(b)).

We employ various encoding methods, viz. doc2vec, and sequential models such as

LSTM, GRU, BiGRU, BiLSTM with Attention and Transformer with Self-attention, detailed in Appendix 5.4 and Figure 5.1(c)-(f), for learning fixed-length input document representations in our framework. We employ various methods such as doc2vec, Deep

Averaging, sequential models such as LSTM, GRU, BiGRU, BiLSTM, BiLSTM with

161 Attention and Self-attention, detailed in Appendix 5.4 and Figure 5.1(c)-(f), for learn- ing fixed-length input document representations in our framework. We apply methods like DAN (Iyyer et al., 2015), LSTM, and BiLSTM as our baselines and formulate attentional models including a self-attentional Transformer-based one (Vaswani et al.,

2017) as our proposed augmented document encoders.

Further, we hypothesize that a sequential, bi-directional or attentional encoder coupled with a decoder which acts as a neural language model that conditions on the encoder output v (similar to an approach by Kiros et al. (2015) for a different purpose), would enable learning of the optimal semantic tags in our unsupervised query expansion setting, while modeling directly for this task in an end-end neural framework. The following sections describe our setup.

5.4 Sequence-based Document Encoders

We describe below the different neural models that we use for the sequence encoder, as part of our encoder-decoder architecture for deriving semantic tags for documents.

5.4.1 doc2vec encoder

doc2vec is the unsupervised algorithm due to Le and Mikolov (2014), that learns

fixed-length representations of variable length documents, representing each document by a dense vector trained to predict surrounding words in contexts sampled from each document. We derive these doc2vec encodings by pre-training on our corpus. We then use them directly as features for inferring semantic tags per Figure 5.1(b) without training them within our framework against the loss objectives. We expect this to be a strong document encoding baseline in capturing the semantics of documents. TFIDF

162 Terms is our other baseline where we don’t train within the framework but rather use the top-k neighbor documents’ TFIDF pseudo-labels as the semantic tags for the query document.

5.4.2 Deep Averaging Network encoder

The Deep Averaging Network (DAN) for text classification due to Iyyer et al. (2015)

Figure 5.1 (c), is formulated as a neural bag of words encoder model for mapping an input sequence of tokens X to one of k labels. v is the output of a composition function g, in this case averaging, applied to the sequence of word embeddings vw for w ∈ X.

For our multi-label classification problem, v is fed to a sigmoid layer to obtain scores for each independent classification. We expect this to be another strong document encoder given results in the literature and it proves in practice to be.

5.4.3 LSTM and BiLSTM encoders

LSTMs Hochreiter and Schmidhuber (1997), by design, encompass memory cells that can store information for a long period of time and are therefore capable of learning and remembering over long and variable sequences of inputs. In addition to three types of gates, i.e. input, forget, and output gates, that control the flow of information into

l l and out of these cells, LSTMs have a hidden state vector ht, and a memory vector ct. At each time step, corresponding to a token of the input document, the LSTM can choose to read from, write to, or reset the cell using explicit gating mechanisms. Thus the LSTM is able to learn a language model for the entire document, encoded in the hidden state of the final timestep, which we use as the document encoding to give to the prediction layer. By the same token, owing to the bi-directional processing of its input, a BiLSTM-based document representation is expected to be even more robust at

163 capturing document semantics than the LSTM, with respect to its prediction targets.

Here, the document representation used for final classification is the concatenated

−→l ←−l hidden state outputs from the final step, [ h t; h t], depicted by the dotted box in Fig. 5.1(e).

5.4.4 BiLSTM with Attention encoder

In addition, we also propose a BiLSTM with attention-based document encoder, where the output representation is the weighted combination of the concatenated hidden states at each time step. Thus we learn an attention-weighted representation at the final

(d×L) output as follows. Let X ∈ R be a matrix consisting of output vectors [h1. . . hL] that the Bi-LSTM produces when reading L tokens of the input document. Each word representation hi is obtained by concatenating the forward and backward hidden −→ ←− states, i.e. hi = [hi ; hi ]. d is the size of embeddings and hidden layers. The attention mechanism produces a vector α of attention weights and a weighted representation r of the input, via:

M = tanh(W xX),M ∈ R(d×L) (5.1)

α = softmax(wT M), α ∈ R (5.2)

r = XαT , α ∈ R (5.3)

th Here, the intermediate attention representation mi (i.e. the i column vector in

M) of the ith word in the input document is obtained by applying a non-linearity on the matrix of output vectors X, and the attention weight for the ith word in the input is the result of a weighted combination (parametrized by w) of values in mi. Thus r ∈ Rd is the attention−weighted representation of the word and phrase tokens in an

164 input document used in optimizing the training objective in downstream multi-label classification, as shown by the final attended representation r in Figure 5.1(e).

5.4.5 GRU and BiGRU encoders

A Gated Recurrent Unit (GRU) is a type of recurrent unit in recurrent neural networks (RNNs) that aims at tracking long-term dependencies while keeping the gra- dients in a reasonable range. In contrast to the LSTM, a GRU has only 2 gates: a reset gate and an update gate. First proposed by Chung et al. (2014, 2015) to make each recurrent unit to adaptively capture dependencies of different time scales, the

GRU, however, does not have any mechanism to control the degree to which its state is exposed, exposing the whole state each time. In the LSTM unit, the amount of the memory content that is seen, or used by other units in the network is controlled by the output gate, while the GRU exposes its full content without any control. Since the GRU has simpler structure, models using GRUs generally converge faster than

LSTMs, hence they are faster to train and may give better performance in some cases for sequence modeling tasks. The BiGRU has the same structure as GRU except con- structed for bi-directional processing of the input, depicted by the dotted box in Fig.

5.1(e).

5.4.6 Transformer self-attentional encoder

Recently, the Transformer encoder-decoder architecture due to Vaswani et al. (2017), based on a self-attention mechanism in the encoder and decoder, has achieved the state- of-the-art in machine translation tasks at a fraction of the computation cost. Based

165 entirely on attention, and replacing the recurrent layers commonly used in encoder- decoder architectures with multi-headed self-attention, it has outperformed most pre- viously reported ensembles on the task. Thus we hypothesize that this self-attention- based model could learn the most efficient semantic representations of documents for our unsupervised task. Since our models use tensorflow Abadi et al. (2016), a natural choice was document representation learning using the Transformer model’s available tensor2tensor API. We hoped to leverage apart from the computational advantages of this model, the capability of capturing semantics over varying lengths of context in the input document, afforded by multi-headed self-attention, Figure 5.1(f). Self-attention is realized in this architecture, by training 3 matrices, made up of vectors, correspond- ing to a Query vector, a Key Vector and a Value vector for each token in the input sequence. The output of each self-attention layer is a summation of weighted Value vectors that passes on to a feed-forward neural network. Position-based encoding to re- place recurrences help to lend more parallelism to computations and make things faster.

Multi-headed self-attention further lends the model the ability to focus on different po- sitions in the input, with multiple sets of Query/Key/Value weight matrices, which we hypothesize should result in the most effective document representation, among all the models, for our downstream task.

5.4.7 CNN encoder

Inspired by the success of Kim (2014) in employing CNN architectures successfully for achieving gains in NLP tasks we also employ a CNN-based encoder in the seq2set framework. Kim (2014) train a simple CNN with a layer of convolution on top of pre-trained word vectors, as a sequence of length n embeddings concatenated to form

166 a matrix input. Filters of different sizes, representing various context windows over neighboring words, are then applied to this input, over each possible window of words in the sequence to obtain feature maps. This is followed by a max-over-time pooling operation to take maximum value of the feature map as the feature corresponding to a particular filter. The model then combines these features to form a penultimate layer which is passed to a fully connected softmax layer whose output is the probability distribution over labels. In case of seq2set these features are passed to sigmoid layer for

final multi-label prediction used cross entropy loss or a combination of cross-entropy and LM losses. We use filters of sizes 2, 3, 4 and 5. Like our other encoders, we

fine-tune the document representations learnt.

5.5 Sequence-to-Set Architecture

Task Definition: For each query document dq in a given a collection of documents

D = {d1, d2, ..., dN }, represented by a set of k keywords or labels, e.g. k terms in dq derived from top − |V | TFIDF-scored terms, find an alternate set of k most relevant terms coming from documents “most related” to dq from elsewhere in the collection.

These serve as semantic tags for expanding dq.

A document to be tagged is regarded as a query document dq; its semantic tags are generated via PRF, and these terms will in turn be used for PRF–based expansion of unseen queries in downstream IR. Thus dq could represent an original complex query text or a document in the collection.

In the following sections we describe the building blocks used in the setup for the baseline and proposed models for sequence-to-set semantic tagging as described in the task definition.

167 Figure 5.1: Overview of Sequence-to-Set Framework. (a) Method for training document or query representations, (b) Method for Inference via term transfer for semantic tagging; Document Sequence Encoders: (c) Deep Averaging encoder; (d) LSTM last hidden state, GRU encoders; (e) BiLSTM last hidden state, BiGRU (shown in dotted box), BiLSTM attended hidden states encoders; and (f) Transformer self-attentional encoder Alammar (2018).

5.6 Training and Inference Setup

The overall architecture for sequence-to-set semantic tagging consists of two phases, as depicted in the block diagrams in Figures 5.1(a) and 5.1(b): the first, for training of input representations of documents; and the second for inference to achieve term

168 transfer for semantic tagging. As shown in Figure 5.1(a), the proposed model archi- tecture would first learn the appropriate feature representations of documents in a first pass of training, by taking in the tokens (various embeddings listed in Section 5.7.2) for an input document sequentially, using a document’s pre-determined top − k TFIDF- scored terms as the pseudo-class labels for an input instance, i.e. prediction targets for a sigmoid layer for multi-label classification. The training objective is to maximize probability for these k terms, i.e. yp = (t1, t2, ..tk) ∈ V , i.e.

arg max P (yp = (t1, t2, ..tk) ∈ V |v; θ) (5.4) θ given the document’s encoding v. For computational efficiency, we take V to be the list of top-10K TFIDF-scored terms from our corpus, thus |V | = 10, 000. k is taken as 3, so each document is initially labeled with 3 terms. The sequential model is then trained with the k-hot 10K-dimensional label vector as targets for the sigmoid classification layer, employing a couple of alternative training objectives. The first, typical for multi-label classification, minimizes a categorical cross-entropy loss, which for a single training instance with ground-truth label set, yp, is:

|V | X LCE(y ˆp) = yi log(y ˆi) (5.5) i=1

Since our goal is to obtain the most meaningful document representations most predic- tive of their assigned terms, and that can also be predictive of semantic tags not present in the document, we also consider a language model–based loss objective converting our decoder to a neural language model. Thus, we employ a training objective that maximizes the conditional log likelihood of the label terms Ld of a document dq, given the document’s representation v, i.e. P (Ld|dq) (where yp = Ld ∈ V ). This amounts to

169 minimizing the negative log likelihood of the label representations conditioned on the document encoding.

Thus, Y X P (Ld|dq) = P (l|dq) = − [log(P (l|dq))] (5.6)

l∈Ld l∈Ld

Since P (l|dq) ∝ exp(vec(l).vec(dq)), where vec(l) = vl and vec(dq) = v, it is equivalent to minimizing: X LLM (y ˆp) = − log(exp(vl · v)) (5.7)

l∈Ld Equation (5.7) represents our language model loss objective. We run experiments train- ing with both losses (Equations (5.5) & (5.7)) as well as a variant that is a summation of both, with a hyper-parameter α used to tune the language model component of the total loss objective.

5.7 Unsupervised Task Setting – Semantic Tagging for Query Expansion

5.7.1 Dataset - TREC CDS 2016

The 2016 TREC CDS challenge dataset, makes available actual electronic health records (EHR) of patients (de-identified), in the form of case reports, typically describ- ing a challenging medical case. Such a case report represents a query in our system, having a complex information need. There are 30 queries in this dataset, corresponding to such case reports, at 3 levels of granularity Note, Description and Summary text as described in Roberts et al. (2016b). The target document collection is the Open

Access Subset of PubMed Central (PMC), containing 1.25 million articles consisting of title, keywords, abstract and body sections. We use a subset of 100K of these arti- cles for which human relevance judgments are made available by TREC, for training.

170 Final evaluation however is done on an ElasticSearch index built on top of the entire collection of 1.25 million PMC articles.

5.7.2 Experimental Designs with Word Embedding Models

The following section describes various word representation algorithms we experi- mented with for initializing our setup for the unsupervised task setting.

Skip-Gram word2vec: We generate word embeddings trained with the skip-gram model with negative sampling Mikolov et al. (2013b) with dimension settings of 50 with a context window of 4, and also 300, with a context window of 5, using the gensim package 3 Reh˚uˇrekandˇ Sojka (2010b).

Probabilistic FastText: The Probabilistic FastText (PFT) word embedding model of Athiwaratkun et al. (2018) represents each word with a Gaussian mixture density, where the mean of a mixture component given by the sum of n-grams, can capture multiple word senses, sub-word structure, and uncertainty information. This model outperforms the n-gram averaging of FastText getting state-of-the-art performance on several word similarity and disambiguation benchmarks. The probabilistic word rep- resentations with flexible sub-word structures, can achieve multi-sense representations that also give rich semantics for rare words. This makes them very suitable to gener- alize for rare and out-of-vocabulary words motivating us to opt for PFT-based word vector pre-training 4 over regular FastText.

ELMo: Another consideration was to use embeddings that can explicitly capture the language model underlying sentences within a document. ELMo (Embeddings from

Language Models) word vectors Peters et al. (2018) presented such a choice where

3https://radimrehurek.com/gensim/ 4https://github.com/benathi/multisense-prob-fasttext

171 the vectors are derived from a bidirectional LSTM trained with a coupled language model (LM) objective on a large text corpus. The representations are a function of all of the internal layers of the biLM. Using linear combinations of the vectors derived from each internal state has shown marked improvements over various downstream

NLP tasks, because the higher-level LSTM states capture context-dependent aspects of word meaning (e.g., they can be used without modification to perform well on supervised word sense disambiguation tasks) while lower- level states model aspects of syntax. Using the API5 we generate ELMo embeddings fine-tuned for our corpus with dimension settings of 50 and 100 using only the top layer final representations.

A discussion of the results from each set of experiments is outlined in the following section and summarized in Table 5.1.

5.7.3 Experiments

We ran several sets of experiments with various document encoders, employing word embedding schemes like skip-gram (Mikolov et al., 2013a), Probabilistic Fasttext

(Athiwaratkun et al., 2018) and ELMo (Peters et al., 2018). The experimental setup used is the same as the Phrase2VecGLM Das et al. (2018), the only other known system for this dataset, that performs “unsupervised semantic tagging of documents by

PRF”, for downstream query expansion. Thus we take this system as the current state- of-the-art system baseline, while our non-attention-based document encoding models constitute our standard baselines. Our document-TFIDF representations–based query expansion forms yet another baseline. Summary text UMLS terms for use in our augmented models is available to us via the UMLS Java Metamap API similar to Chen

5https://allennlp.org/elmo

172 et al. (2016). The first was a set of experiments with our different models using the

Summary Text as the base query. Following this we ran experiments with our models using the Summary Text + Sum. UMLS terms as the “augmented” query. We use the

Adam optimizer Kingma and Ba (2014) for training our models. After several rounds of hyper-paramater tuning, batch size was set to 128, dropout to 0.3, the prediction layer was fixed to sigmoid, the loss function switched between cross-entropy and summation of cross entropy and LM losses, and models trained with early stopping.

Unsupervised QE Systems on Base Query inf. (Model on Summary Text) P@10 BM25+Seq2Set-doc2vec (baseline) 0.0794 BM25+Seq2Set-TFIDF Terms (baseline) 0.2000 BM25+MeSH QE Terms (baseline) 0.2294 BM25+Human Expert QE Terms (baseline) 0.2511 BM25+unigramGLM+Phrase2VecGLM ensemble (system baseline) (Das et al., 2018) 0.2756 BM25+unigramGLM+Phrase2VecGLM+ Seq2Set-GRU (LM only loss) ensemble (model) 0.3222* Supervised QE Systems on Base Query BM25+ETH Zurich-ETHSummRR 0.3067 BM25+Fudan Univ.DMIIP-AutoSummary1 0.4033* Unsupervised QE Systems on Augmented Query (Model on Summary+Sum. UMLS terms) BM25+Seq2Set-doc2vec (baseline) 0.1345 BM25+Seq2Set-TFIDF Terms (baseline) 0.3000 BM25+unigramGLM+Phrase2VecGLM ensemble (system baseline) (Das et al., 2018) 0.3091 BM25+Seq2Set-BiGRU (LM only loss) (model) 0.3333* BM25+Seq2Set-Transformer (Xent.+ LM) (model) 0.4333*

Table 5.1: Results on IR for best Seq2set models, in an unsupervised PRF–based QE setting. Boldface indicates statistical significance @p<<0.01 over previous.

173 5.7.4 Discussion

Results from various Seq2Set encoder models on base (Summary Text) and aug- mented (Summary Text + Sum. UMLS terms) query, are outlined in Table 5.1. Eval- uating on base query, a Seq2Set-Transformer model beats other Seq2Set encoders, and

TFIDF, MeSH QE terms and Expert QE terms baselines; and a Seq2Set-GRU model outperforms the Phrase2VecGLM baseline by ensemble with P@10 0.3222. On the augmented query, the Seq2Set-BiGRU and Seq2Set-Transformer models outperform other Seq2Set encoders, and the Phrase2VecGLM unsupervised QE system baseline with P@10 of 0.4333. Best performing supervised QE systems for this dataset, tuned on all 30 queries, range between 0.35 –0.4033 P@10 Roberts et al. (2016b), better than unsupervised QE systems on base query, but surpassed by the best Seq2Set-based mod- els on augmented query even without ensemble. Semantic tags from a best-performing model, do appear to pick terms relating to certain conditions, e.g.: .

Further, just as in the case of the Phrase2VecGLM system, in order to find evidence for Main Idea 1, we manually evaluate the inferred semantic tags using the UMLS ontology to find latent concept abstractions of these concept tags, that correspond to

UMLS CUIs, and we reason about possible relations between them. Here too, table 5.2 attempts to show meaningful latent concept mappings via UMLS CUIs, to the concept tags inferred by the Seq2set system, and also meaningful potential relations between them, for which expert evaluation–based analysis is discussed in section 7.1. Main Idea

2, applied in the design of the system itself, is incorporated into the implementation and

174 experiments for this system. Finally, the fact that our experimental results from Table

5.1 show improved semantic matching in downstream IR via the Seq2set framework, in the unsupervised task setting of semantic tagging for QE, provides evidence, we believe, for Main Idea 3, while also giving us related concept tags in the context of the query document.

Query UMLS CUIs Related UMLS CUIs Possible terms Latent Concepts Concept Latent Concepts Relations tags obesity [C0028754 Obesity]; dyslipidaemia [C0242339 isTypeOf Dyslipidemias]; mayBe [C3160761 CausedBy Diabetic dyslipidaemia]; diabetes [C0011847 Diabetes]; hyperglycemia [C0020456 comorbidity Hyperglycemia]; With pulmonary- [C0020542 Pulmo- bmi [C1305855 mayBe hypertension nary Hypertension]; Body mass index]; Correlated With children [C0008059 Child]; subjects [C0080105 Research mayBe Subject]; TypeOf

Table 5.2: UMLS Concept Unique Identifier(CUI)-based Latent Concept mappings and Possible Relations for Seq2set

5.8 Supervised Task Setting – Automated Text Categoriza- tion

The Seq2set framework’s unsupervised semantic tagging setup is primarily appli- cable in those settings where no pre-existing document labels are available. In such a scenario, of unsupervised semantic tagging of a large document collection, the Seq2set framework therefore consists of separate training and inference steps to infer tags from

175 other documents after encodings have been learnt. We therefore conduct a series of extensive evaluations in the manner described in the previous section, using a down- stream QE task in order to validate our method. However, when a tagged document collection is available where the set of document labels are already known, we can learn to predict tags from this set of known labels on a new set of similar documents. In order to generalize our Seq2set approach to such other tasks and setups, we therefore aim to validate the performance of our framework on such a labeled dataset of tagged documents, which is equivalent to adapting the Seq2set framework for a supervised setup. In this setup we therefore only need to use the training module of the Seq2set framework shown in Figure 5.1(a), and measure tag prediction performance on a held out set of documents. For this evaluation, we therefore choose to work with the pop- ular Delicious (del.icio.us) folksonomy dataset, same as that used by Soleimani and

Miller (2016) in order to do an appropriate comparison with their framework that is also evaluated on a similar document multi-label prediction task.

5.8.1 Dataset – Delicious

The Delicious dataset contains tagged web pages retrieved from the social book- marking site delicious.com. As in Ramage et al. (2011), there are 20 common tags used as class labels: reference, design, programming, internet, computer, web, java, writing, English, grammar, style, language, books, education, philosophy, politics, religion, science, history and culture. The training set consists of 8250 documents and the test set consists of 4000 documents.

176 Figure 5.2: A comparison of document labeling performance of Seq2set versus MLTM

5.8.2 Experiments

We then run Seq2set-based training for our 8 different encoder models on the train- ing set for the 20 labels, and perform evaluation on the test set measuring sentence-level

ROC AUC on the labeled documents in the test set.

Figure 5.2 shows the comparison of the ROC AUC scores obtained with training

Seq2set and MLTM based models for this task with various labeled data proportions.

Here we clearly see that Seq2set has quite a bit of advantage over the MLTM state- of-the-art statistically significantly surpassing MLTM when trained with greater than

25% of the labeled dataset.

Figure 5.3 shows the ROC AUC for the best performing models from the Seq2set framework on the del.icio.us dataset. Figures 5.4a–5.4c show an AUC of 0.85 obtained

177 (a) Seq2Set – GRU (b) Seq2Set – BiLSTM with Attention

(c) Seq2Set – Transformer (d) Seq2Set – BiGRU

Figure 5.3: Seq2Set – supervised text categorization task setting ROC AUC on del.icio.us dataset for best performing models

with our best performing, GRU, BiLSTM with Attention and Transformer–based mod- els respectively while our very best on multi-label prediction with the del.icio.us dataset was due to a BiGRU–based encoder model trained with a sigmoid–based prediction layer on cross entropy loss with a batch size of 64 and dropout set to 0.3 getting an

ROC AUC of 0.86 quite significantly surpassing MLTM for this task and dataset.

178 5.9 Semi-Supervised Task Setting – Automated Text Catego- rization

We then seek to further validate how well the Seq2set framework can leverage large scale pre-training on unlabeled data given only a small amount of labeled data for training, to be able to improve prediction performance on a held out set of these known labels. This amounts to a semi-supervised setup–based evaluation of the Seq2set framework. In this setup, we perform the training and evaluation of Seq2set similar to the supervised setup, except we have an added step of pre-training the multi-label prediction on large amounts of unlabeled document data in exactly the same way as the unsupervised setup.

5.9.1 Dataset – Ohsumed

We employ the Ohsumed dataset available from the TREC Information Filter- ing tracks of years 87-91 and the version of the labeled Ohsumed dataset used by

Soleimani and Miller (2016) for evaluation, to have an appropriate comparison with their MLTM system also evaluated for this dataset. The version of the Ohsumed dataset due to Soleimani and Miller (2016) consists of 11122 training and 5388 test documents, each assigned to one or multiple labels of 23 MeSH diseases categories.

Almost half of the documents have more than one label.

5.9.2 Experiments

We first train and test our framework on the labeled subset of the Ohsumed data from Soleimani and Miller (2016) similar to the supervised setup described in the pre- vious section. This evaluation gives a statistically significant ROC AUC of 0.951 over

179 (a) Seq2Set – BiGRU encoder with Cross (b) Seq2Set – CNN encoder with Cross Entropy–based Softmax prediction Entropy–based Softmax prediction

(d) Seq2Set – Seq2Set – Transformer encoder (c) Seq2Set – Transformer encoder with Cross with Cross Entropy–based Softmax prediction; Entropy–based Sigmoid prediction; 4 layers, 10 attention heads, and no dropout 4 layers, 10 attention heads, dropout=0.2 (=0)

Figure 5.4: Seq2Set – supervised and semi-supervised text categorization task setting ROC AUC on Ohsumed dataset for best performing models

the 0.90 AUC for the MLTM system of Soleimani and Miller (2016) for a best perform- ing Transformer–based Seq2set model. Next we experiment with the semi-supervised setting where we first train the Seq2set framework models on a large number of docu- ments that do not have pre-existing labels. This pre-training is performed in exactly

180 a similar fashion as the unsupervised setup. Thus we first preprocess the Ohsumed data from years 87-90 to obtain a top-1000 TFIDF score–based vocabulary of tags, and pseudo-labels all the documents in the training set with these. Our training and evaluation for the semi-supervised setup consists of 3 phases: Phase 1: We employ our seq2set framework (using any one of our encoder models) for multi-label predic- tion on this pseudo-labeled data, having an output prediction layer of 1000 having a penultimate fully-connected layer of dimension 23, same as the number of labels in the

Ohsumed dataset; Phase 2: After pre-training with pseudolabels we discard the final layer and continue to train labeled Ohsumed dataset from 91 by 5-fold cross-validation with early stopping. Phase 3: This is the final evaluation step of our semi-supervised trained Seq2set model on the labeled Ohsumed test dataset used by Soleimani and

Miller (2016). This constitutes simply inferring predicted tags using the trained model on the test data. As shown in Figure 5.4 (a)-(d), our evaluation of the Seq2set frame- work for the Ohsumed dataset, comparing supervised and semi-supervised training setups, yields an ROC AUC of 0.94 for our best performing semi-supervised–trained model of Fig. 5.4(d), compared to the various supervised trained models for the same dataset shown in Figures 5.4 (a)-(c) with an ROC AUC of 0.93. The best performing semi-supervised model at AUC 0.90 again involves a Transformer–based encoder using a softmax layer for prediction, with 4 layers, 10 attention heads, and no dropout. While this result on the semi-supervised training experiments statistically significantly out- performs the MLTM baseline having an ROC AUC of 0.90 on the Ohsumed dataset, it only slightly surpasses the best supervised models. Thus we believe that more

181 experiments with the Seq2set framework could yield an even better performing semi- supervised–trained model, that very significantly outperforms the supervised task for the same dataset.

5.10 Conclusion

We develop a novel sequence-to-set based neural architecture for training document representations using no external supervision labels, for pseudo-relevance feedback– based unsupervised semantic tagging of a large collection of documents. We find that in this unsupervised task setting of semantic tagging for PRF-based query expansion, an unsupervised term prediction mechanism that jointly optimizes both prediction of the TFIDF–based document pseudo-labels and the log likelihood of the labels given the document encoding, surpasses previous methods like Phrase2VecGLM that uses neural generalized language models for the same. In all experimental setups, our ini- tial hypothesis that the bi-directional or self-attentional models could learn the most efficient semantic representations of documents when coupled with a loss more effective than cross-entropy at reducing language model perplexity of document encodings, is corroborated. We demonstrate the effectiveness of our approach in each setup, i.e. for the unsupervised setting in a downstream medical IR challenge task, achieving to the best of our knowledge, the state-of-the-art on unsupervised QE via PRF-based seman- tic tagging surpassing Phrase2VecGLM (Das et al., 2018); and for both, the supervised and semi-supervised settings where we beat the state-of-art MLTM baseline (Soleimani and Miller, 2016) for multi-label prediction for automated text categorization for a set of known labels on a held out set of documents.

182 CHAPTER 6

LEARNING TO ANSWER SUBJECTIVE, SPECIFIC PRODUCT-RELATED QUERIES WITH CUSTOMER REVIEWS BY NEURAL DOMAIN ADAPTATION

In the previous chapter, we saw that inductive transfer by self-taught learning helped to discover related concepts that served to augment a complex query in a document retrieval setting. Here the input spaces of the source domains and the target domains are the same. However when the source and target domains are different yet related for the same task, and labeled data is only available in the source domain, we see in this chapter how a transductive transfer learning approach might be helpful in achieving the desired semantic matching by finding hidden associations between related latent concepts.

6.1 Introduction

Again, as per the definition of “latent concept” highlighted in section 1.4.2, consider e.g. a collection of online customer reviews and answered questions that we will leverage in a real-world question answering scenario. Note that concepts from this collection may or may not occur in any particular existing ontology or knowledge base. Here, as shown in Figure 6.1, consider the question being asked about a product that requires an

183 antenna, i.e. “what antennas are recommended?”. We have two different contexts within the collection pertaining to this question; Context1 that is a document (answer)

“about” a hypernym like Antenna properties, containing concept mentions such as “vertical polarized” or “longer antenna”, and another context Context2 perhaps from some review documents that are “about” or refer to a hypernym like Antenna makes and models for which, “Firestick II Antenna” or “Fiberglass Whip Antenna”, e.g., are hyponyms. These hyponyms in turn, may be hypernyms for concept mentions such as “Firestick II with the tunable tip” and “Fiberglass Wilson”. Here, the concept

“longer antenna” from Context1 may satisfy a relation like is desirable property of for the concept “Firestick II with the tunable tip” from Context2, and the same relation may also hold between the concept “vertical polarized” from Context1 and “Fiberglass

Wilson” from Context2. If this relation were to be implicitly learned between the concepts, then the question of “which antennas are recommended?” might bring up the review documents about the related hypernym concepts such as “Firestick II Antenna” or “Fiberglass Whip Antenna”. Thus, although these free-form concepts do not occur in their hypernym-hyponym forms in some pre-existing ontology, a process of learning the latent relations that are directly matched to a novel context such as “recommended antennas for a product”, which is the question in this case, may serve to surface the related concepts.

Online customer reviews on large-scale e-commerce websites, represent a rich and varied source of opinion data, often providing subjective qualitative assessments of such product usage that can help potential customers to discover features that meet their personal needs and preferences. Thus they have the potential to automatically answer specific queries about products, and to address the problems of answer starvation and

184 'what antennas are recommended?’

Association, Relation 1: Firestik II Antenna (y) Antenna Properties (x) Firestik antenna cable quality antenna Firestik II with the longer tune-able tip antenna

vertical Context 2: document 2 = reviews, quarter wave polarized “about” Antenna makes & models whip antenna CB antenna Fiberglass Whip Antenna (z)

Wideband Context 1: document 1 = answer 1, HF antenna “about” Antenna properties Fiberglass Wilson Lil Wilson Wil

Figure 6.1: Example of Hidden Associations between Latent Concepts in a corpus of Product-related Questions, Answers and Reviews.

answer augmentation on associated consumer Q & A forums, by providing good answer alternatives. In this work, we explore several recently successful neural approaches to modeling sentence pairs, that could better learn the relationship between questions and ground truth answers, and thus help infer reviews that can best answer a ques- tion or augment a given answer. In particular, we hypothesize that our neural domain adaptation-based approach, due to its ability to additionally learn domain-invariant features from a large number of unlabeled, unpaired question-review samples, would perform better than our proposed baselines, at answering specific, subjective product- related queries with reviews. We validate this hypothesis using a small gold standard

185 dataset of reviews evaluated by human experts, surpassing our chosen baselines. More- over, our approach, using no labeled question-review sentence pair data for training, gives performance at par with another method utilizing labeled question-review sam- ples for the same task.

In this work I aim to demonstrate evidence primarily for hypotheses 2 and of the re- search statement, namely, that: (i) shared contexts may be leveraged to directly learn hidden associations between surface forms of latent concepts not necessarily in any knowledge base, but in datasets from different yet related domains in a transductive transfer learning setting such as ours, as opposed to modeling directly for the latent concepts across domains, and (ii) that this type of direct learning of associations be- tween latent concepts across domains by domain adaptation can lead to novel or near true answers in a downstream recommendation task such as the QA system that we evaluate in this chapter.

A significant portion of this chapter is under submission to TACL.

6.2 Motivation and Background

General question-answering (QA), in the context of opinion and qualitative as- sessments available to consumers via Q & A forums on product-based e-commerce websites, is a challenging open problem, e.g. consider a real-world question such as:

“Is the Canon EOS Rebel T5i worth the extra $200+ dollars to get as a starter camera, or should I just go with the cheaper T3i?”. Many such questions cannot be answered directly using knowledge bases constructed from product descriptions alone (McAuley and Yang, 2016), but clearly rely on personal experiences of others, for a satisfactory answer. Some questions, especially on newer items may not be immediately answered

186 – which can lead to “answer starvation”. Many questions have short, unrelated, or incomplete answers – these are candidates for “answer augmentation”, via plausible answer alternatives.

Product-related question answering (PRQA) has emerged as a new research area, different from traditional QA and community question answering (CQA) tasks, owing to the immense popularity of e-commerce websites, where answers to potential cus- tomer questions often involves opinions and experiences from different users, found in the plentiful customer reviews that can help discover products or features for a more personalized experience (Yu and Lam, 2018; Wan and McAuley, 2016). Besides a bi- nary “good” or “bad” assessment, product reviews tend to provide a wide range of :

(i) personal experiences; (ii) subjective qualitative assessments, (iii) unique use-cases or failure scenarios. Moreover, massive volume and range of opinions makes review systems difficult to navigate (Wan and McAuley, 2016). This opinion data raises two interesting questions: (i) How can we help users navigate massive volumes of consumer opinions to address “specific queries”? (ii) How can we build an end-end system that simultaneously leverages the labeled question-answer data from QA forums, with abun- dant unlabeled reviews that may carry valuable information that can answer a specific product-related question? Thus, inspired by Wan and McAuley (2016), McAuley and

Yang (2016) and a plethora of available architectures, we define the task of learning to answer product-related questions with reviews as:

“To be able to respond to specific, subjective, product-related queries automatically with reviews, to address problems such as answer starvation and answer augmentation by providing suitable answer alternatives, leveraging available signal from answer sentence

187 data to enable learning of relevant review sentences with minimal supervision, that can address a question.”

Given recent advances in sentence pair modeling and question answering (Sharp et al., 2016; Rockt¨aschel et al., 2015; Yin et al., 2015) via end-end neural approaches

(Rajpurkar et al., 2016; Nguyen et al., 2016; Sukhbaatar et al., 2015) we attempt to adapt an architecture that is most suitable for our problem setting. One key idea to note in the two types of user-generated sentences used in our task is that answer sentences and review sentences, though both capable of addressing user questions, are both created with very different intent and purpose. The former is specifically targeted toward some or all aspects of a question, and the latter providing some personal expe- riences and qualitative assessments regarding a product that may or may not satisfy some user question. Thus answer and review sentences come from two very different distributions. We hypothesize therefore, that being able to automatically learn the no- tion of a “good” or “correct” answer, i.e. learn features that constitute such an answer by leveraging the commonalities between correct answers and suitable review sentences, might be key to our solution. In this context, we find neural domain adaptation Ajakan et al. (2014); Ganin et al. (2016) to present an attractive alternative suitable for our task in a minimally supervised setting, and therefore decide to adapt this approach for our solution comparing against other strong baselines. To our knowledge, we are the

first to address product-related question answering by identifying answers from unla- beled review data with no supervision signal on the reviews and deriving only weak supervision from labeled question-answer data, in a neural domain adaptation setting.

188 6.3 Related Work

6.3.1 Addressing subjective product-related queries with re- views

“Mixture-of-Experts” frameworks combine several weak learners by aggregating their outputs with weighted confidence scores. In their 2016 work, McAuley & Yang show that such a model can be adapted to simultaneously identify relevant reviews and combine them to answer complex queries, by treating reviews as experts that either support or oppose a particular response. Bi-linear models Chu and Park (2009) can help to address the issue of questions and reviews being from different domains drawing from very different vocabularies, by learning complex mappings between words in one corpus and words in another (or more generally between arbitrary feature spaces), which can be regarded as a form of domain adaptation. McAuley and Yang (2016) thus develop a “mixture-of-experts”-based bilinear model, called MoQA, to simultaneously learn which customer opinions are relevant to the query, as well as a prediction function that allows each review opinion to ’vote’ on the response, in proportion to its relevance.

These relevance and prediction functions are learned automatically from large corpora of training queries and reviews .

Yu and Lam (2018) develop an answer prediction framework which consists of two components, viz., an aspect analytics model and a predictive answer model. Given a product category, the aim of the aspect analytics model is to detect and capture latent aspects from a collection of review texts in an unsupervised manner. To this end they employ a 3-order Autoencoder to model aspects from review texts in the same product category and learn aspect-specific embeddings for reviews. This aspect analytics model generates aspect distributions and embeddings of reviews capturing

189 hidden semantic features associated with certain aspects. The predictive answer model captures intricate relationships among question texts, review texts, and yes-no answers reporting answer prediction numbers surpassing McAuley and Yang (2016). However, this work caters only to yes/no questions in the dataset being considered (we use the same one), and does not involve learning matching features for open-ended questions, which may have subjective, opinion-based answers, which are the main focus in our work.

Chen et al. (2019) propose an answer identification framework from reviews which employs a multi-task attentive network, called QAR-net, leveraging both large-scale user generated question-answer data and manually labeled question-review data to achieve this goal. Multi-task learning can be an effective learning paradigm for boost- ing the performance of tasks with insufficient training instances by training jointly with related tasks having abundant training data, and they couple this paradigm with At- tention to obtain a network that allows a “question focus” to attend to various “answer patterns” across answer and review sentences. However they utilize manually labeled question-review samples for training purposes; thus it is not a completely unsupervised approach. Our work, however, being similar in setting to Chen et al. (2019), differs from works like Moqa in that MoQA uses reviews as supporting data for answer pre- diction (i.e. uses review sentences as supporting “experts” for “Yes” or “No” binary questions; ranking answers before non-answers for open-ended questions), and does not try to actually identify potential answers from review sentences. To our knowledge, we are the first to address product-related question answering by identifying answers from reviews in a completely unsupervised manner for the question-review data, i.e. based

190 on no supervision labels on the reviews data and only using the available supervision from question-answer data, in a neural domain adaptation setting.

6.3.2 Modeling Sentence Pairs

Text pairs can exhibit various relations, including paraphrase, entailment, question- answer, translation and more. Early systems designed to model these relations based on lexical overlap or word pairs often fail to generalize to unseen word pairs, and can have difficulty learning synonymy effectively due to sparse features. Neural networks with dense text embeddings can more effectively learn synonymy and other relations, with attention helping to extend learning beyond word pairs to phrasal pairs (Bahdanau et al., 2014).

6.3.3 BiCNN and ABCNN-3

Convolutional neural networks (CNNs) have been used effectively for a variety of natural language processing tasks (Kim, 2014), following earlier successes in image recognition (Krizhevsky et al., 2012). The hierarchic nature of language is a natural parallel to local structure in image data, where nearby words often form meaningful phrases in the way nearby pixels form meaningful sub-units of the complete image.

The ABCNN (Yin et al., 2015) set of models is known to be effective at modeling sentence pair relations for tasks including answer selection, paraphrase identification, and textual entailment. In our work, the most effective reported versions of ABCNN without and with attention viz. BiCNN and ABCNN-3 respectively, are selected for use as a baseline for answer- or review-selection. The baseline BiCNN used in this work consists of two weight-sharing CNNs, each processing one of the two sentences, and a

191 Figure 6.2: BiCNN baseline of Yin et al. (2015)

final logistic regression layer at the top that solves the sentence pair task by making a sentence pair binary labeling decision.

While the non-attention-based BiCNN model is shown to have performance com- parable to the full ABCNN models with fewer parameters, we choose ABCNN-3 as an additional baseline to evaluate against, as it combines the strengths of their other two attention-based models ABCNN-1 and ABCNN-2, allowing the attention mechanism to operate both on the convolution and on the pooling parts of a convolution-pooling block in that architecture. We adapt these models for our work from a third-party implementation6.

6http://github.com/galsang/ABCNN

192 6.3.4 Reasoning for Textual Entailment

Figure 6.3: Conditional encoding-based Attentive LSTM for textual entailment of Rockt¨aschel et al. (2015).

Attention-based neural networks have recently demonstrated success in a wide range of tasks ranging from handwriting synthesis (Graves, 2013), machine translation (Bah- danau et al., 2014), to image captioning (Xu et al., 2015), and speech recognition

(Chorowski et al., 2015) to sentence summarization (Rush et al., 2015). The idea is to allow the model to attend over past output vectors (see Figure 6.3), thereby miti- gating the LSTM’s cell state bottleneck. More precisely, an LSTM with attention for recognizing textual entailment (RTE) does not need to capture the whole semantics of

193 the premise in its cell state. Instead, it is sufficient to output vectors while reading the premise and accumulating a representation in the cell state that informs the second

LSTM which of the output vectors of the premise it needs to attend over to determine the RTE class (Rockt¨aschel et al., 2015). We believe that this type of sentence pair model lends itself well to a QA setting where a potential answer bears some level of en- tailment relation with elements of a question, hence we choose this as another baseline model to evaluate against.

6.3.5 Neural Domain Adaptation

Top-performing deep neural architectures are trained on massive amounts of labeled data. In the absence of labeled data for a certain task, domain adaptation (DA) often provides an attractive option given that labeled data of similar nature but from a different domain (e.g. synthetic images) are available. As the training progresses, the approach promotes the emergence of “deep” features that are (i) discriminative for the main learning task on the source domain and (ii) invariant with respect to the shift between the domains. This adaptation behaviour could be achieved in almost any feed-forward model by augmenting it with few standard layers and a simple new gradient reversal layer as depicted in Figure 6.4. The resulting augmented architecture is thus trained using standard back-propagation of Ajakan et al. (2014); Ganin and

Lempitsky (2015).

6.3.6 Proposed Model – Domain Adversarial Neural Network

Given our task of predicting relevant product reviews that can answer a specific product question or provide additional detail, we hypothesize that our task is well- suited for and can benefit from domain adaptation, in which the data at training and

194 test time come from similar but different distributions; in our case, ground truth an- swers and reviews, because here we want to employ a representation learning approach for effective transfer of information in reviews that is “related” or “well-matched”, to

“good answers” to specific questions. For such transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. Thus our proposed model for this work is the one adapted from

Ganin et al. (2016) but with the input as sentence-pair data, i.e. question-answer and question-review pairs, and we expect this model to do better than our other chosen sentence pair baselines. In our experiments the labeled answer sentences to ques- tions represent the source domain data and unpaired (unlabeled) reviews represent the target domain data. Our proposed approach to domain adaptation is thus to train on large amounts of labeled data from the source domain (labeled Q-A pairs) and large amount of unlabeled (previously unpaired) data from the target domain, i.e reviews, as no labeled target domain data is necessary (Ganin and Lempitsky, 2015).

Our experiences with this model is further described in the Experiments.

The Domain Adversarial Neural Network (DANN) of Ajakan et al. (2014); Ganin et al. (2016) performs domain adaptation by optimizing a minimax training objective that simultaneously learns to do well at task-specific label prediction while doing poorly at domain label prediction. This is motivated by the theory on domain adaptation by

Ben-David et al. (2010) that a transferable feature is one for which an algorithm cannot learn to identify the domain of origin of the input observation. In the DANN model this is achieved by adversarial training of a domain classifier by reversing the gradient from it during backpropagation. Our adaptation of this model passes in labeled question- answer (QA) pairs and unlabeled question-review (QR) pairs thereby learning to answer

195 Figure 6.4: Unsupervised domain adaptation is achieved by adding a domain classifier (red) connected to the feature extractor via a gradient reversal layer Ajakan et al. (2014); Ganin et al. (2016).

Class-Labeled Question-Answer Pairs / QA QA labels Unlabeled (0) classifier Question Question-Review invariant invariant Pairs (DA setting) – BiLSTM Encoder Domain-Labeled Question-Answer Pairs / Answer/ representation Question-Review Domain Domain Labels Review Pairs Classifier Shareddomain BiLSTM Gradient Flip

Figure 6.5: Architecture of the adapted DANN model for Question-Answer/Review sentence pairs

a question and learning the domain-invariant features between two domains, i.e. review sentences and answer sentences, at the same time. To be specific, there is an encoder to learn the vector representation for the question and the answer/review separately.

Afterwards, the representations are concatenated together to represent the paired data

196 (“question + review” and “question + answer”). Only “question + answer” pairs are input into the label predictor which in our case is the QA classifier, to predict whether the answer could really answer the question. Both of “question + answer” pairs and

“question + review” pairs are input into a Domain classifier to predict whether the current input is a answer or a review. Both, the QA classifier and Domain classifier use cross entropy loss for back-propagation. The main idea of domain adaptation here is reversal of the gradient coming from the domain classifier as shown in Figure 6.5. That is to say, when the gradients of domain classifier are back-propagated to the encoder, their negative values are actually used to update the parameters of the encoder. The point is to eliminate the influence of domain-specific features learned by the domain classifier and keep the domain-invariant features. Thus, in the inference stage, when a review is input into the QA classifier, the encoder will learn some features which are similar to answer domain and the QA classifier will predict whether the review can answer the question or not. We hypothesize that this approach can greatly help with answering open-ended product-related questions with reviews.

Our implementation of the DANN model uses separate bidirectional LSTMs to encode question and answer or review sentence separately to get their vector repre- sentation Vq,Va/r and concatenates them to get the vector representation for the pair,

[Vq; Va/r]. The inputs to the model are the 300 dimensional GloVe vectors with random uniform sampled vectors for unknown words. The hidden state for the paired LSTM is also 300-dimensional. The QA classifier and domain classifier are each two-layer

(512-256) feed forward fully-connected networks. A gradient flip layer is added before the domain classifier. When calculating the loss, we use a mask operation to calculate the softmax loss of the QA classifier only for QA pair inputs.

197 6.4 Dataset

We use the newer Q & A dataset made publicly available by McAuley and Yang

(2016)7, the authors of the original work on answering product-related questions with reviews, which was developed off of the original SNAP dataset for Amazon product reviews and ratings 8. This dataset consists of paired questions and answers and de- duplicated product reviews on Amazon, across 24 different product categories. The question answer data, total around 1.4 million answered questions. 56.1% of the questions are binary, i.e. having “Yes” or “No” answers, and the rest constitute “Open-

Ended” questions with more subjective or specific answers. For our work we select only the set of Open-Ended (OE) questions that may include multiple answers to each question, from 6 product categories, viz. Automotive, Baby, Electronics, Home

& Kitchen, Sports & Outdoors and Tools and Home Improvement. Product reviews and metadata from Amazon, total 142.8 million spanning May 1996 - July 2014, and includes reviews (e.g. ratings, text, helpfulness votes), and product metadata (e.g. descriptions, category). After cleaning and matching by product id (ASIN), we have a total of 128K unique ASINs, that match a total of 1,06,1402 OE Q & A pairs, with 2,852,954 reviews, as shown in Table 6.1. Important to note here is that the datasets of QA pairs are created in a way such that review answer sentences are only selected from the matching subset of reviews corresponding to the same product from the category that the question belongs to. We also have a small hand-labeled dataset of 1725 Question-Review pairs acquired from the authors of the original work

McAuley and Yang (2016) that corresponds to approximately 300 pairs for each of

7http://jmcauley.ucsd.edu/data/amazon/ 8https://snap.stanford.edu/data/web-Amazon.html

198 Category Q-A pairs Reviews Automotive 18,214 20,474 Baby 40,429 160,793 Electronics 472,678 1,689,189 Home & Kitchen 283,637 551,683 Sports & Outdoors 140,120 296,338 Tools & Home 106,324 134,477 Total 1,061,402 2,852,954

Table 6.1: Dataset Statistics for Open-Ended, Multi-Answer Q-A Pairs and Re- views matched on 128K unique ASINs from the Amazon product review dataset McAuley and Yang (2016).

these 6 categories. This data was generated by human experts for evaluating their models, which we use for automated target domain-only evaluation of our models.

6.5 Experiments

In order to evaluate the hypothesis that domain adaptation improves candidate review sentence answer selection, this work evaluates a variety of models for our de-

fined task, with and without domain adaptation. Two baselines for sentence pair modeling are evaluated. These include: (1) a CNN-based Siamese network architec- ture due to Yin et al. (2015) – we experiment with both, their non-attentional BiCNN and attention-based ABCNN-3 models, and (2) a Reasoning for Textual Entailment

(RTE) model employing a conditional encoding-based attentive LSTM architecture

Rockt¨aschel et al. (2015). Question-answer data and a small set of question-review data from McAuley and Yang (2016) are used to train and evaluate the models re- spectively. Given gold answers and reviews for product-related queries, the answers are taken as the Source domain and reviews are taken as the Target domain.

199 Category Training Set Eval Source Data Eval Target Data Auto 49.93%/50.07% 50.01%/49.99% 66.67%(0)/33.33%3(1) (300 instances); Baby 49.86%/50.14% 49.86%/50.14% 66.67%(0)/33.33%(1) (300 instances); Electronics 49.84%/50.16% 49.85%/50.15% 66.67%(0)/33.33%(1) (252 instances); Combined 80.22%/19.78% 50%/50% 77.22%(0)/22.78%(1) (1725 instances);

Table 6.2: Label Proportions (0/1 for unrelated/related) in each dataset for a total number of instances.

Since our datasets for training and evaluation are fairly balanced, we report the best accuracy for evaluations with each of our models. Table 6.2 shows the proportion of

0/1 labels in each of our datasets. Each system was trained on labeled question-answer pairs for Auto, Baby and Electronics product categories and a larger Combined dataset involving each of the 6 categories listed in Table 6.1, for a total of 4 trained models per system, and scored on 4 sets of evaluations, one each on source and target domain test sets for each model.

6.6 Results and Discussion

6.6.1 ABCNN models: BiCNN and ABCNN-3

ABCNN results provide a baseline evaluation for question answering that does not involve domain adaptation. Because labeled sentence pairs are required for training, there is no easy way to incorporate unlabeled review data, which is plentiful but not targeted to any specific question. Results for both the BiCNN and ABCNN-3 models in Table 6.3 show strong performance when evaluated on the in-domain answer data,

200 Model on Train Eval Attn. LSTM BiCNN ABCNN-3 Data Data Data Accuracy Accuracy Accuracy Auto Source Source 50.12% 70.0% 70.0% Auto Source Target 66.67%* 53.0% 52.9%* Baby Source Source 50.37% 71.0% 71.4% Baby Source Target 61.33% 54.0%* 51.4% Electronics Source Source 60.41% 74.0% 72.08% Electronics Source Target 66.67%* 53.0% 52.05% Combined Source Source 52.34% 69.0% 70.55% Combined Source Target 53.73% 64.0%* 61.44%*

Table 6.3: Results from Experiments with the Baseline Sentence-Pair models: Conditional-encoding-based Attentive LSTM, BiCNN and ABCNN-3. Bold indicate best performance in that row of results while * indicates best performing for a partic- ular model on target domain evaluation across individual categories or combined.

but a marked decrease when evaluated on out-of-domain review data, with BiCNN still giving the best performance on target domain evaluation for the Combined dataset at

64.0%.

6.6.2 Attentive LSTM for RTE model

For modeling sentence pairs using textual entailment, we adapt the version of the conditional encoding-based attentive LSTM neural architecture of Rockt¨aschel et al.

(2015), that uses the encoding for the question representation learnt by the first LSTM, with the cell state of the first LSTM, as conditional input into the second LSTM that models the answer or review sentence, and which then learns an attended representation over the conditional input to generate the final representation for QA classification9

9Our implementation of this model is taken from a third-party implementation – https://github.com/shyamupa/snli-entailment for the same, and adapted to work for our dataset, and QA-based class labels instead of entailment.

201 10. Table 6.3 outlines the results from the experiments with this model by individual category and also with the Combined training dataset for the source-only and target- only task, using the conditional encoding-based LSTM with attention adapted from the reasoning for textual entailment task. Our experiments show that the RTE model gives the best performance on target-domain review data for each individual category, getting 66.67%, 61.33% and 66.67% on Auto, Baby and Electronics categories, but with ABCNN models doing much better on target domain evaluations for the Combined dataset. It is perhaps worth noting that the RTE model consistently fares worse on the source domain task when trained on source domain data as compared to out-of-domain target classification task, across categories, with no domain adaptation. This indicates perhaps that the conditional attention-based mechanism of this RTE model is able to better generalize for the out-of-domain task in the absence of domain adaptation.

6.6.3 DANN model

The results from experiments with running the DANN model are listed in Table

6.4. When we do not use domain adaptation, the DANN is simply a binary QA classification system to answer product-related queries, trained only on Q-A pairs.

When using domain adaptation, the model is trained on both Q-A and Q-R pairs, indicated by the red dotted box in Figure 6.5. We evaluate our final model on source and target test sets, for Q-A pairs and Q-R pairs. As we can see from the Table 6.4, after using domain adaptation, the performance on Q-R pairs is improved greatly which demonstrates that adding Q-R pairs via domain adversarial training can help answer

10In order to run our experiments with this model, we tweaked the format of our dataset to get it to resemble SNLI format, but in our case we only had binary ’entailment’ and ’contradiction’ labels to represent ’yes’ and ’no’ answers and had no ’neutral’ class label in the data. Both question-answer and question-review pairs were prepared in this way for passage through this model.

202 DANN Model Train Data Eval Data Accuracy on Data Train Data Eval Data Accuracy Auto Source (QA pairs) Source (QA pairs) 64.55% Auto Source (QA pairs) Target (QR pairs) 37.33% Baby Source (QA pairs) Source (QA pairs) 72.19% Baby Source (QA pairs) Target (QR pairs) 35.66% Electronics Source (QA pairs) Source (QA pairs) 75.19% Electronics Source (QA pairs) Target (QR pairs) 40.47% Combined (No domain-adapt) Source (QA pairs) Source (QA pairs) 83.28% Combined (No domain-adapt) Source (QA pairs) Target (QR pairs) 50.11% Combined Source, Target (Domain-adapt) (QA, QR pairs) Source (QA pairs) 73.95%* Combined Source, Target (Domain-adapt) (QA, QR pairs) Target (QR pairs) 77.17%*

Table 6.4: Results from experiments with the DANN model. Bold * indicate best results with domain adaptation on target domain evaluations

product queries using out-of-domain review sentences. As seen in the experimental results from our baseline sentence pair models and our proposed version of the DANN model in Tables 6.3 and 6.4, we see that the DANN model outperforms the baselines on the out-of-domain Q-R pair classification task, validating our original hypothesis.

The best performing ABCNN model evaluated on the source domain classification task gets a 70.55% accuracy on a held out test set of 10K Q-A samples, and 64.0% on Q-

R classification with our human-labeled evaluation-only dataset from McAuley et al., of 1740 samples. Somewhat surprising is that our best performing RTE-based model gave a best accuracy of 66.67% for Electronics and Auto categories on out-of-domain

Q-R classification, with Q-R classification significantly surpassing Q-A classification performance on all categories including Combined when trained on source domain

203 data for RTE. Our DANN-based model was able to beat both of these baselines with an accuracy of 77.17% on Q-R classification, with large amount of unlabeled reviews incorporated into the domain adversarial training.

6.7 Evaluation, Conclusion and Future work

Table 6.5 shows some examples from Q-R inference using the DANN model, where it learned to correctly classify relevant review sentences that can answer a question. The bold with italics texts highlight the concepts identified across question and inferred answer sentences, and the plain bold text highlights the possible effects of domain adaptation on inferring common or similar features between the related concepts, e.g. between hooks that might “slide” or “stay in place” and Mommy Hooks that are

“not stationary, always sliding around”. Similarly, between this device that may be

“attached to a surge protector” or “used with a power strip” and this item that you can “plug a surge protector power strip into” and then “plug all devices into surge protector strip”. In addition, as seen in Table 6.6, when compared with the

QAR-Net system (Chen et al., 2019) for the same task, our method gives comparable performance on F-1 score and significantly higher precision, but for a combined dataset

(1 million+ samples with all the categories) and trained with unlabeled question- review samples using domain adaptation, compared to data from only Electronics and

Cellphones & Accessories categories for QAR-Net, which uses labeled question-review instances for training. We believe this demonstrates the suitability and promise of our approach compared to other methods for this task.

As done for the previous systems in Chapters 4 and 5, in order to show evidence for Main Idea 1, we perform manual analysis for potential matching concepts found

204 in the inferred review–based answer sentences against those in the question, using the

WordNet ontology as one source. We map concepts to WordNet hypernyms for poten- tial latent concepts, and we reason about the possible relations between them. Thus,

Table 6.7 attempts to show meaningful latent concept mappings between concepts in question and answer sentences as found by the system, and also meaningful potential relations between them, via use of WordNet or external resources as applicable. Since we are able to obtain these mappings and show the relations as listed in Table 6.7, we believe, this provides evidence for Main Idea 1. Again, Main Idea 2, about lever- aging shared contexts, is reflected in the design choice for the system itself, in this case application of neural domain adaptation across labeled and unlabeled disparate data distributions. Finally, the fact that our experimental results from Tables 6.4 &

6.6 show improved semantic matching in the chosen product-based question answering task, provides evidence, we believe, for Main Idea 3.

Our proposed neural domain adaptation approach for question-answer/review sen- tence pair classification via domain adversarial training thus shows good results for learning to answer specific customer questions with product reviews. It is able to use plentiful unlabeled review data during training to better generalize to review data at inference time, outperforming numerous baseline models that cannot easily incorpo- rate such data. Future work may involve incorporating unsupervised objectives on review data to further improve the model. Our novel approach to the PRQA task thus leverages previously unused data without requiring explicit supervision, using domain adaptation to identify product-specific answers in product reviews.

205 Question Review Sentence Since the hooks attach with velcro, do “I originally purchased the Mommy Hooks they slide or do they stay in place ? for our stroller and loved the durability of the metal, but ended up hating how big and clunky they are, and they are not station- ary, always sliding around.” Does this device offer sure protection “I mean this should go directly to your out- ? Can it be attached to a surge pro- let and you can plug the surge protector tector ? Any problems using a pow- power strip into this item and all of your erstrip with it ( for electronics) ? devices into the surge protector strip.”

Table 6.5: Examples of Q-R target-only inference on positive examples that the DANN model gets right.

System Precision Recall F-1 score QAR-Net 53.85% 60.67% 57.05% DANN 64.11% 52.46% 56.23%

Table 6.6: Comparison of QAR-Net system with the DANN model for PRQA.

Concepts WordNet/Other Concepts WordNet/Other Possible in Question Latent Concepts in Answer Latent Concepts Relations hooks [WN hypernyms: Mommy [WN hypernyms: hasBranded catch, fastener]; Hooks catch, fastener]; Type device [WN hypernyms: item [WN hypernyms: variationsOf artifact, entity]; part, entity]; ObjectType surge [WN hypernyms: surge [WN hypernyms: variationsOf protector suppressor, entity]; protector suppressor, entity]; Suppressor power strip power [Wiki hypernym: surge [WN hypernyms: variationsOf strip “multiple socket protector suppressor, entity]; Electrical power board”]; strip Extension

Table 6.7: Latent Concept mappings and Possible Relations between question and answer sentence concepts

206 CHAPTER 7

EVALUATION, CONCLUSION AND FUTURE DIRECTIONS

Addressing “novel contexts” has always been a challenge in a variety of natural language processing and knowledge discovery tasks such as information retrieval for complex queries and question answering for subjective, specific, nuanced natural lan- guage questions. We hypothesize that these novel contexts may be best resolved by a process of concept discovery as outlined by the Main Ideas in the research statement.

Through this dissertation we show how the proposed systems employing methods to facilitate knowledge transfer can aid in the process of finding related concepts in con- text, or concept discovery, within and across domains. This process, we expect, can then work to better address these novel contexts in various unsupervised and semi- supervised task settings, enabling improved semantic matching for user intent. In addition to providing evidence for the Main Ideas outlined in the research statement through these systems, and also through analysis by expert validations which is pre- sented in the section to follow, we introduce and lay the groundwork for a process of concept discovery that can inform future research directions.

207 7.1 Expert Evaluation of Main Ideas

This section is intended to provide initial evidence supporting some of the main ideas surrounding the research statement. To recapitulate, our first main idea was that, “we may leverage the assumption that hidden associations may exist between latent concepts expressed via their surface forms, to develop models that can learn to predict novel associative terms in documents, that may improve query matching”.

To this end, in order to find evidence that hidden associations or latent relations, not currently present in a knowledge base, may exist between same or related “latent con- cepts” expressed via their corresponding surface forms concepts or entities, we ask an expert to provide their judgments for possible hidden relations between concepts in query and target documents that were associated as a result of experiments with the systems described in Chapters 4 and 5. The human expert is a resident doctor in the medical profession who we believe is better able to provide such judgment pertaining to documents from the biomedical domain. Two sets of expert-based validation were con- ducted: 1) To verify if the presented terminology may constitute a concept (biomedical or otherwise), based on the definition of a concept in Section 1.5, and, 2) To verify the existence of possible relations or hidden associations between concepts (again, may be a biomedical relation or otherwise). In the first, the expert was presented with some concept tags obtained from running experiments on the Seq2set framework where those specific terms were not found in the UMLS, although present in the collection and in- ferred as tags on certain documents. We clarified that some of these terms while all lowercase, may represent acronyms. We also obtained their reasoning for the specific judgments. We found at times the reasoning did not follow the exact definitions we

208 provided. For the second validation, we presented the expert with concept pairs re- sulting from experiments ran on the Phrase2VecGLM and Seq2set frameworks. Here the starting concept term is from a query or query document, and the inferred concept tag is from the collection. We then associated these concepts with their correspond- ing UMLS CUI-based latent concepts. Given these latent concept mappings, and our reasoning for the possible relations that exist between them, we asked the expert to evaluate these potential relations for plausibility, by again a Yes or No judgment.

Outlined below are a summary of the results from such expert-based validation experiments. For the unigram version of the Phrase2VecGLM system, we have:

Query terms Related Concept tags Possible Expert (with associated (with associated Relations Judg- UMLS CUI–based UMLS CUI–based ments Latent Concepts) Latent Concepts) dementia alzheimers sameAs Yes [C0497327 Dementia]; [C0002395 Alzheimer’s causedBy Yes [C0002395 Alzheimer’s Disease]; Disease]; [C0036341 Schizophrenia]; cognitive behavioral isAbout No [C1516691 Cognitive]; [C0004927 Behavior]; [C0009240 Cognition]; [C0004941 Behavioral [C0009241 Cognition Disorders]; Symptoms]; bp diabetes elevated [C0020538 Hypertensive disease]; [C0011849 Diabetes Expressions Mellitus]; OfDiseases Yes

Table 7.1: Expert Judgments on UMLS CUI-based Latent Concept mappings and Possible Relations for unigramGLM

For the phrasal GLM, we have:

209 Query terms Related Concept tags Possible Expert (with associated (with associated Relations Judg- UMLS CUI–based UMLS CUI–based ments Latent Concepts) Latent Concepts) albendazole Strongyloides treatsFor Yes [C0001911 Albendazole]; stercoralis [C0038462 Strongyloides stercoralis]; eosinophilic ascites corticosteroid therapy mayBe [C4703536 Eosinophilic [C0149783 Steroid therapy]; TreatableBy Yes ascites]; Strongyloides stercoralis]; parasitic infection case hyperinfection [C3686777 Parasitic [C4524208 infection of Strongylodiasis mayBe digestive tract]; hyperinfection]; AFormOf Yes

Table 7.2: Expert Judgments on UMLS CUI-based Latent Concept mappings and Possible Relations for Phrase2VecGLM

And for the Seq2Set framework, we have,

Query terms Related Concept tags Possible Expert (with associated (with associated Relations Judg- UMLS CUI–based UMLS CUI–based ments Latent Concepts) Latent Concepts) obesity dyslipidaemia isTypeOf Yes [C0028754 Obesity]; [C0242339 Dyslipidemias]; [C3160761 Diabetic mayBe dyslipidaemia]; CausedBy Yes diabetes hyperglycemia comorbidity [C0011847 Diabetes]; [C0020456 Hyperglycemia]; With Yes pulmonary bmi mayBe hypertension [C1305855 Body mass CorrelatedWith Yes [C0020542 Pulmonary index]; Hypertension]; children subjects mayBe [C0008059 Child]; [C0080105 Research Subject]; TypeOf Yes

Table 7.3: Expert Judgments on UMLS CUI-based Latent Concept mappings and Possible Relations for the Seq2set framework

210 Concept Is this a Sentence from PubMed article Comments terms Concept? concept appeared in / Rea- Yes/No soning Expert for judge- Judgment ment heartprints Yes Two dimensional density plots called Appears to heartprints correlate characteristic fea- be a metric tures of the dynamics of premature ventricular complexes and the sinus rate . Heartprints show distinctive characteristics in individual patients . smily-illness Yes We tested SMILY-Illness in patients Appears to with inflammatory rheumatic diseases be a screen- and then translated it into num- ing tool bers decimals languages pi-ocd Yes Poor-insight obsessive-compulsive dis- order is severe form of OCD where the typically obsessive features of intrusive egodystonic feelings and thoughts are absent . conscripts No In the present study levels of alcohol Conscripts use and abuse were measured in sam- = enlisted ple of numbers decimals male training members conscripts of the Hellenic Navy . doege-potter Yes Doege-Potter syndrome is paraneoplas- It’s a syn- tic syndrome characterized by non-islet drome cell tumor hypoglycemia secondary to solitary fibrous tumor . ntid No Drugs with a narrow TI (NTIDs) have It’s an ad- a narrow window between their effec- jective tive doses and those at which they pro- duce adverse toxic effects .

Table 7.4: Expert judgments for concept tags for documents, for Concept Yes-or-No, where no latent concept mapping via UMLS CUI could be found.

211 Thus tables 7.1, 7.2, and 7.3, where most of the possible relations between latent concept CUIs represented by the concept terms, have been validated as plausible by a medical expert, provides initial evidence, we believe, towards Main Idea 1.

Next, table 7.4 shows the Yes-or-No responses obtained for concepts assigned as tags to documents where no UMLS CUI could be found corresponding to those con- cepts. Additionally, Table 7.5 shows expert judgments on possible relations between concept pairs where one of them is a concept tag occurring in Table 7.4, that has no correponding CUI in the UMLS, but is validated as being a concept. The fact that most of these potential relations in Table 7.5 were found to be valid by an expert coupled with the fact that these non-existent CUI terms actually represent concepts, also validated by an expert, give initial evidence for Main Idea 1, and also provide sup- port for the research statement which aims to show that leveraging these associations between latent concepts, may help to address novel contexts by discovering related concepts in some downstream recommendation task setting. Since concepts such as heartprints, smily-illness, doege-potter, pi-ocd etc., that are not currently found in the UMLS, were inferred by the system as semantic tags for documents, by one of the better performing models from the Seq2Set framework, this we believe provides greater support for Main Idea 3, as well as alludes to the possibility as outlined in the research statement, that automated discovery of related concepts in the downstream task may be at the heart of better matching to, or better addressing of, novel contexts, and shows some promise for future work in this direction.

Having thus attempted to show some supporting evidence for the Main Ideas and the research statement, we now proceed, in this final chapter, to provide a summary

212 Query terms Related Concept tags Possible Expert (with associated (with associated Relations Judg- UMLS CUI–based UMLS CUI–based ments Latent Concepts) Latent Concepts) dcm heartprints measured [C0007193 [No UMLS CUI, ByIndicator Yes Cardiomyopathy, Dilated]; “A class of Metrics”]; leiomyosarcoma smily-illness associated [C0279986 Childhood [No UMLS CUI, “A class WithDiseases Leiomyosarcoma]; of Screening Tools”]; AffectingLungs Yes pleuropulmonary doege-potter associated [C1266144 [No UMLS CUI, WithDiseases Pleuropulmonary blastoma]; “A class of syndrome”]; AffectingLungs Yes compression pi-ocd comorbidity [C0027743 Nerve [No UMLS CUI, With No compression syndrome]; “A class of Disorders”];

Table 7.5: Expert Judgments for concept tags having no UMLS CUI-based mappings to Latent Concepts, and their Possible Relations to query concepts for the Seq2set framework

account of the main contributions, giving insights into further possible enhancements in each area, as well as future directions for this research.

7.2 Summary of Contributions

7.2.1 Exploring complementary signals for subscriber churn prediction via topic models

Given a rapidly declining print subscriber base for a large newspaper, in this work we mine different Web information sources in parallel with transactional data to build several models of subscriber churn, using a maximum entropy model that can incorpo- rate a variety of features, and an unsupervised learning approach such as LDA-based probabilistic topic modeling, well-suited to the task of learning topic features from the

Web data. To our knowledge our work is the first to study patterns of both on-line and

213 offline behavior of customers, by tying together Web and relational databases of user activity, for the task of predicting customer churn, in contrast to previous works in this space that only look at features from transactional data of newspaper subscribers.

We then extend these analyses by finding correlations of the top-ranked features from our various models, with churn. Based on these insights and more Web news data, we obtain refined web models of churn by tuning parameters on topic models. Fur- ther, we hypothesize that news and click logs web data that have significant temporal overlap in the subscriber base, complement each other in interesting ways for the same topics. In particular, we perform topic inference on click logs based on topic models learned on news data, and vice versa, to identify interesting trends from these complementary signals. The interesting trends are further validated by performing sen- timent analysis on the topics in our inferred models. Our topic inference models on the complementary signals from web data, representing the NEWS and CLICKLOG domains, further enhance our analyses about which factors, whether temporal, transactional or derived from web news page usage and sentiment, directly affect newspaper subscriber engagement. We use these augmented analyses to present several key insights into customer engagement by providing a comparison of the feature sets of these models, with respect to their predictiveness, in particular, juxtaposing the web-based topic and sentiment features from NEWS and CLICKLOG sources which we find to be complemen- tary in nature, highlighting particularly how topics transfer across the two sources, in the context of churn prediction.

Our experimental results confirm our intuition for using Web features to model subscriber churn and demonstrate the value of extracting signal from the Web, both

214 transactional and non-transactional. In addition, our human-guided approach to im- proving our models of churn prediction via parametrized exploration of the models and topic inference on complementary signals from web data, further enhance our analyses about which factors, whether temporal, transactional or derived from web usage and sentiment, directly affect newspaper subscriber engagement. The insights drawn on this data by means of topic transfer across sources for prediction, informs the sub- squent works where the goal is to improve the performance on recommendation tasks by transfer of latent concepts within and across data domains. Thus this work represents a set of experiments employing transfer learning via inferred topic models for prediction, that serve to inform much of the subsequent direction of this research has taken.

7.2.2 Phrasal neural generalized language model for semantic tagging

In this work, in keeping with the research statement of section 1.5, we hypothe- size that, similar to human experts who can determine the “aboutness” of an unseen document by recollecting meaningful concepts gleaned from shared contexts across cur- rent and past experiences, a completely unsupervised machine learning model could be trained to associate documents within a large collection with meaningful concepts

“discovered” by fully leveraging shared contexts within and between documents that are found to be “related” (Turney and Pantel, 2010; Pantel et al., 2007; Bhagat and

Ravichandran, 2008; Li et al., 2011; Bhagat and Hovy, 2013; Xu et al., 2014; Kholghi et al., 2015). This we believe could better address the vocabulary gap between poten- tial user queries and documents, thereby naturally augmenting results for downstream

215 retrieval or question answering tasks (Lin and Pantel, 2001a; Diekema et al., 2003;

McAuley and Yang, 2016). To our knowledge, ours is the first work that employs word and phrase-level embeddings in a language modeling approach as opposed to a VSM- based approach (Sordoni et al., 2014), to semantically tag documents with appropriate concepts for use in retrieval tasks (Kholghi et al., 2015; De Vine et al., 2014; Zhang et al., 2016; Zuccon et al., 2015; Sordoni et al., 2014; Tuarob et al., 2013), in contrast to previous similar approaches to document categorization for retrieval, based on clus- tering (Halpin et al., 2007; Lin and Pantel, 2002, 2001b), LDA-based topic modeling

(Blei et al., 2003; Griffiths and Steyvers, 2004; Tuarob et al., 2013) and supervised or active learning approaches (Kholghi et al., 2015).

We hypothesize that since word embedding techniques use the information around the local context of each word to derive the embeddings, using these within a language model (Ganguly et al., 2015b) to derive terms or concepts that may be closely asso- ciated with a given document in the collection, despite no lexical overlap between the query and a given document, and further, extending the model to use embeddings of candidate noun phrases, could leverage such shared contexts toward query expansion, potentially augmenting both: 1) the global context analysis in IR leading to better downstream retrieval performance from direct query expansion, and, 2) the local con- text analysis from top-ranked documents leading to better query refinement within a pseudo-relevance feedback loop (Su et al., 2015; Xu and Croft, 2000). Extending the word embedding–based language model due to Ganguly et al. (2015b), the phrasal embeddings used in our language model are intended to provide better support for meaning while serving to relax the independence assumption between term occurrence events in standard language models, by updating probabilities using similarity scores

216 between query terms and query documents, and query terms and their nearest neigh- bors, thus leading to a “generalized” language model providing greater smoothing over seen and unseen terms, that we expect could generalize better to out-of-vocabulary query terms.

Using this phrasal embedding based general language model we generate top ranked document sets for each query document. We subsequently select concepts to tag query documents with from the top-ranked sets. We apply our language model based latent concept discovery to query expansion in both “direct” as well as “relevance feedback” settings, evaluating the expanded queries on a separate ElasticSearch indexed search engine setup, where we demonstrate statistically significant improvement in both (Chen et al., 2016; Gormley and Tong, 2015) over baselines using both human expert derived terms and UMLS ontology–derived terms for query expansion.

7.2.3 Sequence-to-set semantic tagging for complex query re- formulation via semantic tagging and automated text categorization

Inspired by the recent success of sequence-to-sequence neural models in delivering the state-of-the-art in a wide range of NLP tasks, and building upon the successes of the neural GLM-based model of the previous work, we explore a novel sequence-to-set architecture with neural attention for learning document representations that can effect term transfer for semantically tagging a large collection of documents. We demonstrate that our proposed method can be effective in both a supervised multi-label classification setup for text categorization, as well as in a unique unsupervised setting with no document labels that uses no external knowledge resources and only corpus-derived term statistics to drive the training. Further, we show that semi-supervised training

217 using our architecture on large amounts of unlabeled data can augment performance or at least perform at par on the text categorization task making our framework very useful for this task when only limited labeled data is available.

To this end and to the best of our knowledge we are the first to employ a novel, completely unsupervised end-to-end neural attention-based document representation learning approach, using no external labels, in order to achieve the most meaningful

“term transfer” between related documents, i.e. semantic tagging of documents, in a pseudo-relevance feedback–based setting Xu and Croft (2000). This may also be seen as a method of “document expansion” as a means for obtaining query refinement terms for downstream IR. We are the first to attempt to model the pseudo-relevance feedback

PRF process using an end-to-end neural approach based on “seq2seq” architectures except adapted and reformulated as Seq2set to directly generate semantic tags for documents. This occurs in a two-step fashion by first learning document encodings using only corpus-derived supervision labels and no human labels; followed by a sim- ple inference of the semantic tags from the top-K similar documents’ TFIDF-based labels. 2) By the combination of a Language Modeling based loss objective where we minimize the negative log likelihood of the labels given the document, coupled with minimizing Cross Entropy for multiple label prediction while training promising encoder architectures, we see results that surpass previous systems for unsupervised

PRF-based semantic tagging including those using neural LM–based approaches “with- out” the use of UMLS query expansion terms, and other heavily supervised systems

(i.e. systems tuned on queries that we treat as unseen) “with” the use of UMLS terms.

Thus, we demonstrate the effectiveness of our approach in each setup, i.e. for the unsupervised setting in a downstream medical IR challenge task, achieving to the best

218 of our knowledge, the state-of-the-art on unsupervised QE via PRF-based semantic tagging surpassing Phrase2VecGLM (Das et al., 2018); and for both, the supervised and semi-supervised settings where we beat the state-of-art MLTM baseline (Soleimani and Miller, 2016) for multi-label prediction for automated text categorization for a set of known labels on a held out set of documents.

7.2.4 Learning to answer subjective, specific questions with customer reviews by neural domain adaptation

In keeping with the research statement of section 1.5, in this work, we explore find- ing hidden associations between latent concepts across two related domains to solve a question answering task. However, since we have labeled data in only one domain, i.e. answered questions, and no labeled data in the other domain, i.e. product reviews, we resort to neural domain adaptation as a means of transductive transfer learning to solve this task. Since online customer reviews on large-scale e-commerce websites represent a rich and varied source of opinion data, they often provide subjective qualitative as- sessments of product usage that can help potential customers to discover features that meet their personal needs and preferences. Thus they have the potential to automati- cally answer specific queries about products, and to address the problems of answer starvation and answer augmentation on associated consumer Q & A forums, by providing good answer alternatives. In this work, we explore and compare our ap- proach with several recently successful neural approaches to modeling sentence pairs, that could better learn the relationship between questions and ground truth answers, and thus help infer reviews that can best answer a question or augment a given answer.

In particular, we hypothesize that our neural domain adaptation-based approach, due

219 to its ability to additionally learn domain-invariant features from a large number of unlabeled, unpaired question-review samples, would perform better than our proposed baselines, at answering specific, subjective product-related queries with reviews. We validate this hypothesis using a small gold standard dataset of reviews evaluated by human experts, surpassing our chosen baselines. Moreover, our approach, using no labeled question-review sentence pair data for training, gives performance at par with another method utilizing labeled question-review samples for the same task.

7.3 Future Directions

We briefly discuss a few potential future directions for improving the knowledge transfer methods presented, in order to extend their use and apply them towards learning new or related tasks

7.3.1 Jointly learning multi-label prediction and NER for match- ing to novel contexts

There have been recent advances to the state-of-the-art neural architectures for

NER, such as the BiLSTM-CRF due to Lample et al. (2016) and variants thereof, and for fine-grained named entity recognition with attention due to Shimaoka et al.

(2016). There have also been similar advances to relation extraction between all entity mention pairs (Verga et al., 2017) and second order relation extraction by explicit context conditioning (Singh and Bhatia, 2019). In order to bridge the gap between out-of-vocabulary context matching to within-vocabulary concepts while also achieving desired precision levels at detecting first and second-order relations between concepts and entities in the data, one suggested future direction is to perform joint training of these two tasks, for example within the Seq2set framework. Here, a Seq2set encoder

220 could be trained alongside a character- and word-level BiLSTM-BiLSTM-CRF branch for NER, thus achieving multi-label prediction and entity recognition at the same time, where one task could inform and improve the other. The architecture could also be augmented similarly with sub-networks for known relation extraction as in Verga et al.

(2017); Singh and Bhatia (2019) to be able to simultaneously learn better models for both seen and unseen contexts for NER and multi-label prediction tasks.

7.3.2 Multi-task learning in domain adversarial setting for question answering

In our current setup for domain adversarial training for learning to answer product questions with customer reviews of Chapter 6, we train on large amounts of unlabeled review data to learn shared domain-invariant feature representations for reviews and answer sentences. Here, evaluation is still on the same classification task for which we have labeled data. However, for the out-of-domain review sentences that are rich in opinion and sentiment “hard” 0 or 1 classification decisions may not always lead to the best answers. In this case, and also given the large amounts of unlabeled review data, it may be better to learn a “ranking” of review sentences that can best answer a given question to minimize again the sifting through predicted positive answers from reviews. In this scenario it seems plausible to adapt our training setting to learn two different tasks, i.e. learning to classify labeled positive answers correctly, while also learning to “rank” candidate review sentence answers. Since learning to rank requires labeled data, special strategies may need to be employed to achieve this both in unsupervised and semi-supervised setting, by having a small amount of labeled review

221 data. Unsupervised ranking learning on reviews may be done by pre-generating pseudo- labels and semi-supervised learning may leverage strategies like temporal ensembling

(Laine and Aila, 2016). Because classification and ranking represent two very different yet complementary tasks, we believe that such a multi-task learning in the domain adversarial setting can serve as an example of design choice for achieving greater hidden association matching between latent concepts, across domains, for such other related pairs of tasks in other cross-domain problem settings.

7.3.3 Provide linguistic resource to advance research in Con- cept Discovery

In performing an analysis of the large amounts of the opinion–based review sen- tence data from different product categories (McAuley and Yang, 2016) with a view towards further generalizing the model architecture outlined in Chapter 6, a few key insights have emerged. In looking at examples of review sentences that can answer specific product-related questions, there are: (1) some sentences that contain just a single span of text that can exactly answer a question, reminiscent of the SQuAD task due to (Rajpurkar et al., 2016) or the Machine Reading Comprehension task; (2) there are some sentences that in their entirety can answer every aspect of a question, thus very suitable for a generating rankings for a ranking task; and (3) there are review sentences, contiguous or non-contiguous, within an individual review that individu- ally cannot answer a question, but together, can perfectly answer the question (i.e. a multi-sentence answer), in a spirit of extractive summarization, similar to Cheng and

Lapata (2016). This represents a continuum of tasks, like answer-span identification, ranking and extractive summarization, for the same problem, that can be addressed

222 by the same data, in the context of the broader scope of concept discovery for ad- dressing novel contexts by matching to hidden associations between latent concepts.

Thus one proposal is to have such a linguistic resource at our disposal that contains annotations for various related tasks for the same problem, in one real-world dataset, namely answer-span identification, ranking and extractive summarization, to cite the examples mentioned above. This type of resource we believe could inspire more re- search into the area of concept discovery as outlined in this dissertation, for example by leading to the formulation of associated shared tasks that are meaningful in the context of solving some previously undescribed real-world problem involving natural language understanding.

7.3.4 How does this research impact the NLP community

As highlighted as one of the main ideas surrounding concept discovery in this disser- tation, the “goodness” or “effectiveness” of discovered concepts can only be measured by proxy downstream tasks such as semantic tagging for information retrieval or sub- jective question answering (QA) (as opposed to fact-based QA), and other similar recommendation tasks. In our effort to show support for the main ideas surround- ing the research statement outlined in section (1.5), and lay the groundwork for the process of concept discovery in context, we demonstrate improved semantic matching in the chosen tasks against several strong baselines. Given the effectiveness of our methods for these chosen tasks coupled with how we position the ideas outlined in this dissertation against the existing literature on knowledge base completion, named entity recognition, relation extraction, missing data modeling, paraphrase identifica- tion, inferring selectional preferences and multi-way classification of semantic relations

223 between pairs of nominals, we leave it as an open question for the community to as- sess the value of addressing out-of-vocabulary relations defined by unseen contexts with within-vocabulary concepts in the observed data within a framework of concept discov- ery. In light of the initial evidence demonstrated for the main ideas in this dissertation, we believe that the methods outlined towards achieving concept discovery, could be adapted to serve traditional information extraction (IE) tasks such as knowledge base population, expanding KBs considerably when presented with novel contexts. In par- ticular, the implicit relations leveraged in our works to arrive at target concept surface forms, could be attached to a new or existing meaningful lexical form, i.e. a formal relation or predicate, to be added to the knowledge base and aid in discovery of new contexts where the relation may hold, which in turn may lead to discovery of more re- lated concepts. While this is similar in theory to NELL (Mitchell et al., 2015) the main distinction is that this research has attempted to address bridging the gap for when there is a new relation made possible by a novel context, that may be useful towards

finding other similarly related concepts that satisfy the relation, but that we cannot yet extract, since it is not currently present in the text. Our hope is that the main research ideas outlined in this dissertation make it into the mainstream of NLP and form the basis for systems that aim to lend better generalization and out-of-vocabulary matching capability to various NLP tasks.

7.4 Final remarks

As the next generation of natural language processing applications emerge with in- creasing focus on natural language understanding and generation, it has become even

224 more important than before for these systems to have the capabality to efficiently rea- son about out-of-vocabulary contexts using concepts from the given data with minimal supervision. In this context, we introduced the idea of “latent concepts” and the pro- cess of concept discovery via models developed for solving the chosen recommendation tasks in inductive and transductive transfer learning settings, showing their effective- ness in improving downstream semantic matching. We also discussed ways in which these systems may be further advanced in their capability by jointly learning various related tasks. We hope that the ideas, methods and future directions presented in this dissertation can provide the inspiration for the coming generations of NLP and knowledge discovery applications to leverage and adapt these, not just for the specific application areas covered, i.e. information retrieval and question answering, in order to better capture and address user intent for what is really meant, but also integrated into the core NLP pipeline.

225 APPENDIX A

PROBABILITY AND INFORMATION THEORY

Probability theory provides us with the basis to discuss and analyze many of the methods presented throughout this dissertation that are probabilistic in nature, such as language modeling. E.g. A statistical language model is a probability distribution over sequences of words. Using probability theory, we can make statements about how likely it is that an event such as a sequence of words, will occur given that other events (words or sequences) have already have occurred. In this work, we seek to computationally model such likelihoods, although both Bayesian and frequentist methods are employed in the various works throughout this dissertation (Russell and Norvig, 2016) (p. 491).

Information theory similarly provides us with tools to describe the information encoded in events and measures to characterize a difference in information. The latter gives us a quantities such as Cross Entropy that we often seek to minimize using machine learning methods.

A.0.1 Probability fundamentals

Random variable – In statistics and probability theory, a random variable is described informally as a variable whose values depend on outcomes of a random phe- nomenon. The domain of a random variable is a sample space Ω, which is interpreted

226 as the set of possible outcomes of a random phenomenon, e.g. a coin toss. A random variable is thus a function which allows for “probabilities” to be assigned to the poten- tial values in its sample space. The range of these values depends on some underlying random process. For example, in the case of a coin toss, only two possible outcomes are considered, viz heads or tails. And the range consists of probabilities assigned to each of these outcomes. A random variable has a probability distribution, which specifies the probability of each of its values. Random variables can be discrete, that is, taking any of a specified finite or countable list of values, defined by a “probability mass function” that characterizes the random variable’s probability distribution; or continuous, taking any numerical value in an interval or collection of intervals, defined by a “probability density function” that characterises the random variable’s probability distribution; or a mixture of both types. One of the underlying themes in this disserta- tion is to generally define systems that allow us to model the probability distribution of such random variables.

PMF and PDF – A random variable can be discrete, which means that it takes on a finite number of values. For discrete random variables, we define its probability distribution with a probability mass function (PMF), typically denoted as P . The probability mass function maps a value of a random variable to the probability of the random variable taking on that value. P (X = 5) thus indicates the probability of X taking on the value 5, such as the roll of a six-sided die that takes on the value 5. Every event x ∈ X has a probability 0 ≤ P (x) ≤ 1. An impossible event has a probability of

0, while an event that is guaranteed to happen has a probability of 1.

If a random variable is able to take on any value in an interval, it is said to be continuous and we use a probability density function (PDF), usually designated as

227 p to specify its probability distribution. In contrast to PMFs, PDFs do not provide the probability of a specific event. The PDF is used to specify the probability of the random variable falling within a particular “range of values”, as opposed to taking on any one value. In fact, the probability of any specific point within the interval is 0.

Rather, we measure the probability of being in an infinitesimally small region with volume δx, by p(x)δx. (Ruder, 2019)

Both PMFs and PDFs are normalized: For a PMF, all probabilities must sum to P 1, i.e. x∈X P (x) = 1, while for a PDF, the probabilities must integrate to 1, i.e. R x p(x)dx. Bayes’ Rule – In machine learning, we often would like to estimate the probability of P (B|A) given knowledge of P (A|B). Using the axiom of conditional probability we have: P (A, B) P (B|A) = (A.1) P (A)

Given that by the axiom of joint probability, P (A, B) = P (A|B)P (B), and if we also know P (B), we can derive P (B|A) as:

P (A|B)P (B) P (B|A) = (A.2) P (A)

P (B|A) is known as the posterior probability of B, P (A|B) is the likelihood of A given B, P (B) is the prior probability of B, and P (A) is the evidence. Equation

A.2 is also known as Bayes’ rule and is at the heart of Bayesian methods for machine learning.

Bayesian Inference based upon Bayes’ rule is a method of statistical inference in which Bayes’ theorem is used to update the probability for a hypothesis as more evi- dence or information becomes available. Bayesian inference is an important technique

228 in statistics and Bayesian updating is particularly important in the dynamic analysis of a sequence of data. In Chapter 4 we see these principles from probability theory at play. The other chapters take on more of a frequentist approach where only em- pirical evidence comes into play in the various methods employed for decision making.

(Ruder, 2019)

A.0.2 Maximum Likelihood Estimation

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a distribution by maximizing a likelihood function, so that under the assumed statistical model, the observed data is most probable. The point in the pa- rameter space that maximizes the likelihood function is called the maximum likelihood estimate.

As well as being a useful tool for parameter estimation in our current context, the method of maximum likelihood can be applied to a great variety of other statistical problems, such as curve fitting, for example. In general maximum likelihood estimates have nice theoretical properties as well.

Suppose that random variables X1,...,Xn have a joint density or frequency func- tion f(x1, x2, . . . , xn|θ). Given observed values Xi = xi, where i = 1, . . . , n the likeli- hood of θ as a function of x1, x2, . . . , xn is defined as likelihood(θ) = f(x1, x2, . . . , xn|θ).

Note that we consider the joint density as a function of θ rather than as a function of the xi . If the distribution is discrete, so that f is a frequency function, the likelihood function gives the probability of observing the given data as a function of the parameter

θ. The maximum likelihood estimate (MLE) of θ is that value of θ that maximizes the likelihood—that is, makes the observed data “most probable” or “most likely.”

229 Rice (2006). Chapters and 4 of this dissertation heavily makes use of this principle, while in Chapter 3 MLE is the underlying principle for deriving topic models used in much of that work.

A.0.3 Cross Entropy

Information entropy is the average rate at which information is produced by a stochastic source of data. (Shannon, 1948). The measure of information entropy associated with each possible data value is the negative logarithm of the probability mass function for the value:

X H(P ) = − Pi log Pi (A.3) i

When the data source produces a low-probability value (i.e., when a low-probability event occurs), the event carries more “information” (“surprisal”) than when the source data produces a high-probability value. The amount of information conveyed by each event defined in this way becomes a random variable whose expected value is the information entropy

Cross entropy, between two probability distributions p and q over the same un- derlying set of events measures the average number of bits needed to identify an event drawn from the set, if a “coding scheme” such as the number of bits, used for the set is optimized for an estimated probability distribution q, rather than the true distribution p.

The cross entropy for the distributions p and q over a given set is defined as follows:

H(p, q) = Ep[− log q] (A.4)

230 Thus, for discrete probability distributions p and q that take on the same values x ∈ X, it is given as X H(p, q) = − p(x) log q(x) (A.5) x∈X Cross entropy is a measure heavily used in practice for optimization of various machine learning algorithms as the loss objective we hope to minimize, and is integral to the methods detailed in Chapters 5 and 6 of this dissertation.

A.0.4 Perplexity

In information theory, perplexity is a measurement of how well a probability distri- bution or probability model predicts a sample. It may be used to compare probability models. A low perplexity indicates the probability distribution is good at predicting the sample. The perplexity of a discrete probability distribution p is defined as:

H(p) − P p(x) log p(x) 2 = 2 x 2

where H(p) is the entropy (in bits) of the distribution and x ranges over events. In natural language processing, perplexity is a way of evaluating language models (Brown et al., 1992).

Using the definition of perplexity for a probability model, one might find, for ex- ample, that the average sentence xi in the test sample could be coded in 190 bits (i.e., the test sentences had an average log-probability of -190). This would give an enor- mous model perplexity of 2190 per sentence. However, it is more common to normalize for sentence length and consider only the number of bits per word. Thus, if the test sample’s sentences comprised a total of 1,000 words, and could be coded using a total of 7.95 bits per word, one could report a model perplexity of 27.95 = 247 per word. In

231 other words, the model is as confused on test data as if it had to choose uniformly and independently among 247 possibilities for each word (Brown et al., 1992). In Chapter

3 perplexity is used as a measure of performance for topic inference across domains.

232 APPENDIX B

NEURAL NETWORKS AND MACHINE LEARNING

We now provide the background for neural network models starting with linear regression for machine learning, so as to provide an overview of the methods that form the basis for much of the work in this dissertation in Chapters 4–6 that revolve around use of these methods.

B.0.1 Linear Regression

Regression with linear models is concerned with the class of linear functions of continuous-valued inputs. The simplest case is regression with a univariate linear function, or “fitting a straight line”, with input x and output y having the form y = wix + w0, where w1 and w0 are real-valued coefficients to be learned. The value of y is changed by changing the relative weight of one term or another. We define w to be the weight vector [w0, w1], and define:

hw(x) = w1x + w0. (B.1)

Consider some data in an x, y plane, with each point representing the size in square feet and the price of a house on sale. The task of finding the hw that best fit these data is called linear regression. To fit a line to this data, we need to find the values

233 of the weights [w0, w1] that minimize the empirical loss. It is traditional to use the squared loss function, L2, summed over all the training examples, and given as:

N X Loss(hw) = L2(yj, hw(xj)) j=1 N X 2 = (yj − hw(xj)) (B.2) j=1 N X 2 = (yj − (w1xj + w0)) j=1

(Russell and Norvig, 2016, Chapter 18)

We can easily extend to multivariate linear regression problems, in which each example xj is an n-element vector. The hypothesis space is now a set of functions of the form:

hsw(xj) = w0 + w1xj,1 + ... + wnxj,n X (B.3) = w0 + wixj,i. i

The intercept term w0 stands out as different from the others, which we handle by a dummy input attribute, xj,0 always set to 1. h then just becomes the dot product of the weights and the input vector or the matrix product of the transpose of the weights and the input vector, i.e.

> X hsw(xj) = w · xj = w xj = wixj,i (B.4) i

Both in the univariate and multivariate case the best vector of weights w*, mini- mizes the squared-error loss over the examples:

X w* = arg min L2(yj, w · xj) (B.5) w j

234 (Russell and Norvig, 2016, Chapter 18)

Gradient descent which we describe in section (B.0.5) will be used to reach the unique minimum of the loss function for both univariate and multivariate cases, where the update equation for each weight wi is,

X wi ← wi + α xj,i(yj − hw(xj)) (B.6) j

B.0.2 Logistic Regression

Having a hard threshold function as the output of a linear classifier causes prob- lems in that the hypothesis htextbfw is not differentiable, being in fact a discontinuous function of its inputs and weights, making learning very unpredictable. Further, the linear classifier always gives a completely confident prediction of 1 or 0, even for exam- ples that are very close to the boundary. All of these issues can be resolved to a great extent by softening the threshold function – approximating the hard threshold with a continuous, differentiable function. The logistic function, given by:

1 Logistic(Z) = (B.7) 1 + e−z has more convenient mathematical properties. With the logistic function replacing the threshold function, we now have:

hw(x) = Logistic(w · x) 1 (B.8) = 1 + e-w·x The output of the logistic function being a value between 0 and 1 can be interpreted as a probability, of belonging to a class labeled 1. Thus we transform the output of linear regression into a probability by “squashing” it to be in the interval (0, 1) by using the logistic or sigmoid function σ. The process of fitting the weights of this model to

235 minimize the loss on a dataset is called logistic regression. Athough there is no easy closed-form solution to find the optimal value of w with this model, gradient descent computation is straightforward. We use the L2 loss function, g to stand for the logistic function and g0 for it’s derivative. We have:

∂ ∂ 2 Loss(w) = (y − hw(x)) ∂wi ∂wi (B.9) 0 = −2(y − hw(x)) × g (w · x) × xi The derivative g0 of the logistic function satisfies g0(z) = g(z)(i − g(z)), so we have

0 g (w · x) = g(w · x)(1 − (w · x)) = hw(x)(1 − hw(x)) (B.10)

Thus the weight update for minimizing the loss is

wi ← wi + α(y − hw(x)) × hw(x)(1 − hw(x)) × xi (B.11)

In a linearly separable case, logistic regression is somewhat slower to converge, but behaves much more predictably. When the data are noisy and nonseparable, logistic regression converges much more quickly and reliably. Due to its many advantages it has become one of the most popular classification techniques for a large number of application ranging from medicine and market analysis to credit scoring and natural language processing (Russell and Norvig, 2016, Chapter 18).

B.0.3 Multi-Class Classification by Softmax

The probability produced by logistic regression discussed in the previous section is calculated as follows:

pˆ(y = 1|x; θ) =y ˆ = σ(θ>x). (B.12)

Specifying the probability of one of these classes determines the probability of the other class, as the output random variable follows a Bernoulli distribution (Rice, 2006,

236 Chapter 2). For multi-class classification with C classes, we learn a separate set of weights θi ∈ θ for the label yi of the i-th class. We then use the softmax function to squash the values to obtain a categorical distribution:

> eθi x pˆ(yi|x; θ) = > (B.13) PC θj x j=1 e where the denominator normalizes the prediction by summing over the scores for all C classes, to obtain a probability distribution. Recalling our definition of cross entropy, for binary classification, cross entropy is then given by:

H(p, pˆ; x) = −(1 − y)log(1 − yˆ) − ylogyˆ (B.14)

(Ruder, 2019)

B.0.4 Feed-Forward Neural Networks and Other Neural Mod- els

Neural networks or universal function approximators have the striking quality that they can compute any function to a desired degree of approximation. No matter what the function, there is guaranteed to be a neural network so that for every possible input, x, the value f(x) (or some close approximation) is output from the network.

This result holds even if the function has many inputs, f = f(x1, . . . , xm), and many outputs (Nielsen, 2015). The universal approximation theorem states that a feed- forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of Rn, under mild assumptions on the activation function. The theorem thus states that simple neural networks can represent a wide variety of interesting functions when given appropriate parameters; although it does not touch upon the algorithmic learnability of these parameters (Cs´aji,

237 2001). Inspired by basic findings of neuroscience – that mental activity consists primar- ily of electrochemical activity in networks of brain cells or neurons, some of the earliest

AI work aimed to create artificial neural networks, where a neuronal unit “fires” when a linear combination of its inputs exceeds some hard or soft threshold – that is, it implements a linear classifier like the one described in the previous section. Thus a neural network is a collection of such units connected together where the properties of the network are determined by it’s topology and the properties of the “neurons”

(Russell and Norvig, 2016, Chapter 18). Neural networks can be seen as compositions of functions. In fact, we can view the basic machine learning models described so far, linear regression and logistic regression, as simple instances of a neural network (Ruder,

2019).

Neural networks are composed of nodes or units connected by directed links. A link from unit i to unit j serves to propagate an activation αi from i to j. Each link also has a weight wi,j associated with it, which determines the strength and the sign of the connection. Just as in linear regression models, each unit has a dummy input a0 = 1 with an associated weight wo,j. Each unit j first computes a weighted sum of its inputs: n X inj = wiai (B.15) i=0 Then it applies an activation function g to this sum to derive the output:

n X aj = g(inj) = g( wi,jai) (B.16) i=0

The activation function g is typically either a hard threshold with a 0/1 output, in which case the unit is called a perceptron, or a soft threshold such as a logistic function, in which case the term sigmoid perceptron is used Both of these non-linear

238 activation functions ensure the important property that the entire network of units can represent a nonlinear function. As discussed in the section on logistic regression, the logistic activation function has the added advantage of being differentiable (Russell and Norvig, 2016, Chapter 18).

Once we have this mathematical model for individual neurons, the next step is to connect them together to form a network. There are two fundamentally distinct ways to do this. A feed-forward neural network has connections only in one direction– that is, it forms a directed acyclic graph. Every node receives input from upstream nodes and delivers output to downstream nodes; there are no loops. A feed-forward network represents a function of its current input; thus it has no internal state other than the weights themselves. A recurrent network (RNN) on the other hand, feeds its outputs back into its own inputs (Russell and Norvig, 2016, Chapter 18). An RNN can also be seen as a feed-forward neural network with a dynamic number of hidden layers that are all set to have the same parameters. In contrast to a regular feed- forward neural network, however, it accepts a new input at every “layer” or time step.

Specifically, the RNN maintains a hidden state ht, which represents its “memory” of the contents of the sequence at each time step t. At every time step, the RNN performs the following operation:

ht = σh(Whxt + Uhht−1 + bh) (B.17) yt = σy(Wyht + by) where σh and σy are activation functions. The RNN applies a transformation Uh to modify the previous hidden state ht−1 and a transformation Wh to the current input xt, which yields the new hidden state ht (Ruder, 2019). Hence recurrent networks unlike feed-forward networks can support short-term memory, which makes them interesting

239 models of the brain Russell and Norvig (2016). The most elementary neural network for sequential input is the recurrent neural network. As the text data used in our experiments is sequential, we will be using sequential neural models that can process a sequence of inputs as seen in Chapters 5 and 6.

Long-short term memory – Long-short term memory (LSTM) networks (Hochre- iter and Schmidhuber, 1997) are preferred nowadays compared to RNNs for dealing with sequential data as they can retain information for longer time spans, which is nec- essary for modelling long-term dependencies common in natural language. The LSTM can be seen as a more sophisticated RNN cell that introduces mechanisms to decide what should be “remembered” and “forgotten”. The LSTM augments the RNN with a forget gate ft, an input gate it, and an output gate ot, which are all functions of the current input xt and the previous hidden state ht. These gates interact with the previous cell state ct1, the current input, and the current cell state ct and enable the model to selectively retain or overwrite information. The entire model is defined as follows:

ft = σg(Wf xt + Uf ht1 + bf )

it = σg(Wixt + Uiht1 + bi)

ot = σg(Woxt + Uoht1 + bo) (B.18)

ct = ft ◦ ct1 + it ◦ σc(Wcxt + Ucht1 + bc)

ht = ot ◦ σh(ct) where σg is the sigmoid activation function, σc and σh are the tanh activation function, and ◦ is element-wise multiplication. LSTM cells can be stacked in multiple layers. In many cases as in a lot of our experiments, it is more suitable to use a bi-directional

LSTM, or BiLSTM (Graves, 2013), which runs separate LSTMs forward and backward

240 over a sequence. The hidden state ht in case of a BiLSTM is the concatenation of the −→ ←− hidden states from the forward and backward LSTMs at time step t, i.e. ht = [ht ; ht ]

(Ruder, 2019).

Apart from the LSTM and BiLSTM recurrent models used for processing of sequen- tial text input, Chapter 5 provides detail about other computationally efficient variants of recurrent models such as the GRU and BiGRU (section 5.4.5) and the more recent

Transformer (section 5.4.6) model due to Vaswani et al. (2017), that achieves greater computational speedup alongside better model performance across several natural lan- guage tasks by finding an efficient way to eliminate recurrences by approximating them through the use of position-based transformation of the input in feed-forward layers, to model for each time step.

B.0.5 Gradient Descent

Gradient descent is central to learning in neural networks discussed in section B.0.4 where each layer represents a multivariate linear regression over the input. The param- eter α is a learning rate when we are trying to minimize loss in a learning problem and can be fixed to a constant or decay over time as the learning proceeds. (Russell and Norvig, 2016, Chapter 18)

Thus gradient descent is an efficient method to minimize an objective function

J(θ). It updates the model’s parameters θ ∈ Rd in the opposite direction of the gradient ∇θJ(θ) of the function. The gradient during learning, is the vector containing

∂ all the partial derivatives J(θ). The ∂θi i-th element of the gradient is the partial ∂θi derivative of J(θ) with respect to θi. Gradient descent then updates the parameters:

θ = θ − α∇θJ(θ) (B.19)

241 where again the learning rate α determines the magnitude of an update to the param- eters. (Ruder, 2019)

B.0.6 Back-propagation

In neural networks with multiple layers there are interactions that arise among the learning problems when the network has multiple outputs. In such cases, we think of the network as implementing a vector function hw as opposed to a scalar function hw. While a simple perceptron network decomposes into m separate learning problems for m-output problem, in a multi-layer network all outputs depend on all input-layer weights which may be intermediate hidden layers, so updates to those weights will depend on the errors in each of the outputs. However, in case of any loss function that is additive, across the components of the error vector y − hw, such as for L2 squared error loss, this dependency for a weight w is simply:

∂ ∂ Loss(w) = |y − h (x)|2 ∂w ∂w w X ∂ (B.20) = (y − a )2 ∂w k k k where the index k ranges over nodes in the output layer. Each term in the final summation is then just the gradient of the loss for the kth output computed as if the other outputs did not exist. Thus we can decomposee the m-output learning problem into m learning problems also in this case provided we add up the gradient contributions from each of them when updating the weights. (Russell and Norvig, 2016, Chapter 18)

The complication that arises from the addition of hidden layers to a network is that whereas the error y−hw at the output layer is clear, the error at the hidden layers seems mysterious because the training data do not say what values those hidden nodes should

242 have. Fortunately, it turns out that by the method of back-propagation (Rumelhart et al., 1988) we can propagate the errors from the output layer to the hidden layers.

The back-propagation process thus emerges directly from a derivation of the overall error gradient. Thus for multiple output units, if Errk is the kth component of the

0 error vector y − hw, we can define a modified error ∆k = Errkxg (ink), so that the weight-update rule becomes:

wj,k ← wj,k + α × aj × ∆k (B.21)

To update the connections between the input units and the hidden units, we define a quantity analogous to the error term for output nodes. Here is where we do the error back-propagation. The idea is that the hidden node j is “responsible” for some fraction of the error ∆k in each of the output nodes to which it connects. Thus the

∆k values are divided according to the strength of the connection between the hidden node and the output node and are propagated back to provide the ∆j values for the hidden layer. The propagation rule for the delta values is:

0 X ∆j = g (inj) wj,k∆k (B.22) k Now, the weight update rule for the weights between the inputs and the hidden layer is essentially identical to the update rule for the output layer, i.e.:

wj,k ← wj,k + α × ai × ∆j (B.23)

Thus the backpropagation process can be summarized as follows:

• Compute the ∆ values for the output units, using the observed error

• Starting with the output layer, repeat the following for each layer in the network,

until the earliest hidden layer is reached:

243 – Propagate the ∆ values back to the previous layer

– Update the weights between the two layers

(Russell and Norvig, 2016, Chapter 18)

The back-propagation algorithm lies at the heart of training all of the neural models that are integral to the works described in Chapters 4, 5 and 6

244 APPENDIX C

KNOWLEDGE REPRESENTATION FOR KBS AND OTHER RELATED AREAS

C.0.1 Document Summarization

According to a detailed survey by Das and Martins (2007), the subfield of summa- rization has been investigated by the NLP community for many decades. Radev et al.

(2002) define a summary as “a text that is produced from one or more texts, that conveys important information in the original text(s), and that is no longer than half of the original text(s) and usually significantly less than that”. This simple definition captures three important aspects that characterize research on automatic summariza- tion:

• Summaries may be produced from a single document or multiple documents,

• Summaries should preserve important information,

• Summaries should be short

Das and Martins (2007) point out that even if there is unanimous agreement on these points, it seems from the literature that any attempt to provide a more elaborate definition for the task would result in disagreement within the community. In fact,

245 many approaches differ on the manner of their problem formulations. Some common terms relating to summarization are: extraction is the procedure of identifying impor- tant sections of the text and producing them verbatim; abstraction aims to produce important material in a new way; fusion combines extracted parts coherently; and compression aims to throw out unimportant sections of the text (Radev et al., 2002).

While “extractive summarization” is mainly concerned with what the summary content should be, usually relying solely on extraction of sentences, “abstractive summariza- tion” puts strong emphasis on the form, aiming to produce a grammatical summary, which usually requires advanced language generation techniques. In a paradigm more tuned to information retrieval (IR), one can also consider “topic-driven” summariza- tion, that assumes that the summary content depends on the preference of the user and can be assessed via a query, making the final summary focused on a particular topic. There is also single-document summarization which is primarily extractive, and multi-document summarization that may be both extractive and abstractive (Das and

Martins, 2007).

In a recent neural approach to extractive summarization, Cheng and Lapata (2016) develop a general framework for single-document summarization composed of a hi- erarchical document encoder and an attention-based extractor, alleviating the need for hand-engineered features showing performance comparable to the state-of-the-art without access to linguistic annotation. They develop two classes of models based on sentence and word extraction. The main ideas behind their work are the creation of hierarchical neural structures that reflect the nature of the summarization task and generation by extraction. Although extractive methods yield naturally grammatical

246 summaries, the selected sentences make for long summaries containing much redun- dancies. Thus they also develop a model based on word extraction which seeks to find a subset of words in D and their optimal ordering so as to form a summary. They formulate word extraction as a language generation task with an output vocabulary restricted to the original document. Their language modeling based approach and LM– based training objective for the word extraction task, and other related works, provided inspiration for some of the choices for the Seqset model of Chapter 5. However, tradi- tional summarization in NLP is also not directly targeted to solve the problem of novel contexts outlined in this thesis, although could make use of concept discovery for its target, but is related from the standpoint of semantic tags for documents, the main focus of Chapters 4, that 5 could also be viewed as very precise document summaries.

C.0.2 Knowledge Representation for Relation Extraction

Another very related work in context of the main focus of this dissertation is the move towards a consolidated Open Knowledge Representation for Multiple Texts due to Wities et al. (2017). This effort represents the “explicit” version of hidden relation extraction between latent concepts or as the authors like to refer to as “co-referring” en- tities and concepts within a context. This work suggests that a common consolidation step and a corresponding knowledge representation should be part of a “standard” semantic processing pipeline, to be shared by downstream applications. Thus they propose an Open Knowledge Representation (OKR) framework that captures the in- formation expressed jointly in multiple texts while relying solely on the terminology appearing in those texts, without requiring pre-defined external knowledge resources

247 or schemata. They advocate the need to consolidate information originating from mul- tiple texts in applications that summarize multiple text into some structure, such as multi-document summarization and knowledge-base population. For this there is cur- rently no systematic solution, and the burden of integrating information across multiple texts is delegated to downstream applications, leading to partial solutions which are geared to specific applications. To this end they develop an “Open” KR framework where coreference clusters are proposed as a handle on real world entities and facts, while still being self-contained within the textual realm. They annotate a set of 1257 tweets from 27 clusters taken from News events, where all core NLP tasks are anno- tated in parallel over the same texts. The hope is that gradually, fine grained semantic phenomena may be addressed, such as factuality, attribution and modeling sub-events and cross-event relationships.

Singh and Bhatia (2019) develop a method to perform what they term “second-order relation extraction” given first-order relation scores, using explicit context conditioning.

They surpass the current state-of-the-art BRAN model (Verga et al., 2017) that scores relations between all entity mention pairs in dataset of biomedical abstracts. While these works are successful at explicitly modeling and extracting potential relations between existing entity mention pairs across sentence boundaries, they do not address the problem of mapping a “new” or unseen context to a potential relation that might exist between concepts or entities in text and retrieving related concepts in this new context.

248 C.0.3 Never-Ending Language Learning

Despite the success of machine learning for a wide variety of tasks in AI from spam

filtering, to speech recognition, from credit card fraud detection, to face recognition,

Mitchell et al., 2015, point out, that the ways in which computers learn today still remain surprisingly narrow compared to human learning. To this end, they introduce a massive project that automatically learns to extract unary and binary relations by reading the Web – in an alternative paradigm for machine learning that more closely models the diversity, competence and cumulative nature of human learning, called never-ending learning (Mitchell et al., 2015). This better reflects the more ambitious and encompassing type of learning performed by humans, and is built on the hypothesis that we will never truly understand machine or human learning until we can build computer programs that, like people,

• learn many different types of knowledge or functions,

• from years of diverse, mostly self-supervised experience,

• in a staged curricular fashion, where previously learned knowledge enables learn-

ing further types of knowledge,

• where self-reflection and the ability to formulate new representations and new

learning tasks enable the learner to avoid stagnation and performance plateaus.

(Mitchell et al., 2015)

Thus, as a case study for the never-ending learning paradigm, the Never-Ending Lan- guage Learner (NELL) has been learning to read the web 24 hours/day since January

2010, and so far has acquired a knowledge base with over 80 million confidence-weighted

249 beliefs (e.g., servedW ith(tea, biscuits)). NELL has also learned millions of features and parameters that enable it to read these beliefs from the web. Additionally, it has learned to reason over these beliefs to infer new beliefs, and is able to extend its ontology by synthesizing new relational predicates. NELL can be tracked online at http://rtw.ml.cmu.edu, and followed on Twitter at @CMUNELL (Mitchell et al.,

2015).

Even though NELL expands the coverage of open domain facts and relations to a great degree, however, as is the case with any ontology or knowledge resource, systems using them are limited to only the rules present within that resource. These may not always be able to address a novel context appearing in a query into a particular collection, although the knowledge to address this query may very well be contained within the collection itself.

250 BIBLIOGRAPHY

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghe-

mawat, S., Irving, G., Isard, M., et al. (2016). Tensorflow: A system for large-scale

machine learning. In 12th {USENIX} Symposium on Operating Systems Design and

Implementation ({OSDI} 16), pages 265–283.

Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., and Marchand, M. (2014).

Domain-adversarial neural networks. arXiv preprint arXiv:1412.4446.

Alammar, J. (2018). The illustrated transformer. Online; posted June 27, 2018.

Anick, P. G. and Tipirneni, S. (1999). The paraphrase search assistant: terminolog-

ical feedback for iterative information seeking. In Proceedings of the 22nd annual

international ACM SIGIR conference on Research and development in information

retrieval, pages 153–159. ACM.

Archambault, D., Hurley, N., and Tu, C. T. (2013). Churnvis: visualizing mobile

telecommunications churn on a social network with attributes. In Advances in So-

cial Networks Analysis and Mining (ASONAM), 2013 IEEE/ACM International

Conference on, pages 894–901. IEEE.

Athiwaratkun, B., Wilson, A. G., and Anandkumar, A. (2018). Probabilistic

for multi-sense word embeddings. arXiv preprint arXiv:1806.02901.

251 Azzini, A., Braghin, C., Damiani, E., and Zavatarelli, F. (2013). Using semantic lifting

for improving process mining: a data loss prevention system case study. In SIMPDA,

pages 62–73.

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly

learning to align and translate. arXiv preprint arXiv:1409.0473.

Banerjee, S. and Pedersen, T. (2002). An adapted lesk algorithm for word sense disam-

biguation using . In International Conference on Intelligent

and Computational Linguistics, pages 136–145. Springer.

Barzilay, R. and McKeown, K. R. (2001). Extracting paraphrases from a parallel

corpus. In Proceedings of the 39th annual meeting on Association for Computational

Linguistics, pages 50–57. Association for Computational Linguistics.

Barzilay, R., McKeown, K. R., and Elhadad, M. (1999). Information fusion in the con-

text of multi-document summarization. In Proceedings of the 37th annual meeting of

the Association for Computational Linguistics on Computational Linguistics, pages

550–557. Association for Computational Linguistics.

Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W.

(2010). A theory of learning from different domains. Machine learning, 79(1-2):151–

175.

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic

language model. Journal of machine learning research, 3(Feb):1137–1155.

Berger, A. L., Pietra, V. J. D., and Pietra, S. A. D. (1996). A maximum entropy

approach to natural language processing. Computational linguistics, 22(1):39–71.

252 Bhagat, R. and Hovy, E. (2013). What is a paraphrase? Computational Linguistics,

39(3):463–472.

Bhagat, R., Pantel, P., Hovy, E. H., and Rey, M. (2007). Ledir: An unsupervised

algorithm for learning directionality of inference rules. In EMNLP-CoNLL, pages

161–170.

Bhagat, R. and Ravichandran, D. (2008). Large scale acquisition of paraphrases for

learning surface patterns. In ACL, volume 8, pages 674–682.

Bhatia, P., Arumae, K., and Celikkaya, B. (2018). Dynamic transfer learning for named

entity recognition. arXiv preprint arXiv:1812.05288.

Bird, S. (2006). Nltk: the . In Proceedings of the COLING/ACL

on Interactive presentation sessions, pages 69–72. Association for Computational

Linguistics.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal

of machine Learning research, 3(Jan):993–1022.

Bodenreider, O. (2004). The unified medical language system (umls): integrating

biomedical terminology. Nucleic acids research, 32(suppl 1):D267–D270.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vec-

tors with subword information. Transactions of the Association for Computational

Linguistics, 5:135–146.

253 Bollacker, K., Evans, C., Paritosh, P., Sturge, T., and Taylor, J. (2008). Freebase: a

collaboratively created graph database for structuring human knowledge. In Proceed-

ings of the 2008 ACM SIGMOD international conference on Management of data,

pages 1247–1250. AcM.

Broder, A. (2002). A taxonomy of web search. In ACM Sigir forum, volume 36, pages

3–10. ACM.

Brown, P. F., Pietra, V. J. D., Mercer, R. L., Pietra, S. A. D., and Lai, J. C. (1992). An

estimate of an upper bound for the entropy of english. Computational Linguistics,

18(1):31–40.

Butterfield, S. (2004). Folksonomy: Social classification.

Callison-Burch, C., Koehn, P., and Osborne, M. (2006). Improved statistical machine

translation using paraphrases. In Proceedings of the main conference on Human Lan-

guage Technology Conference of the North American Chapter of the Association of

Computational Linguistics, pages 17–24. Association for Computational Linguistics.

Camacho-Collados, J., Bovi, C. D., Anke, L. E., Oramas, S., Pasini, T., Santus, E.,

Shwartz, V., Navigli, R., and Saggion, H. (2018). Semeval-2018 task 9: Hypernym

discovery. In Proceedings of the 12th International Workshop on Semantic Evalua-

tion, pages 712–724.

Cao, G., Nie, J.-Y., Gao, J., and Robertson, S. (2008). Selecting good expansion

terms for pseudo-relevance feedback. In Proceedings of the 31st annual international

ACM SIGIR conference on Research and development in information retrieval, pages

243–250. ACM.

254 Carey, S. (1999). Knowledge acquisition: Enrichment or conceptual change. Concepts:

core readings, pages 459–487.

Chen, L., Guan, Z., Zhao, W., Zhao, W., Wang, X., Zhao, Z., and Sun, H. (2019).

Answer identification from product reviews for user questions by multi-task attentive

networks.

Chen, W., Moosavinasab, S., Zemke, A., Prinzbach, A., Rust, S., Huang, Y., and Lin,

S. (2016). Evaluation of a machine learning method to rank pubmed central articles

for clinical relevancy: Nch at trec 2016 cds. TREC 2016 Clinical Decision Support

Track.

Cheng, J. and Lapata, M. (2016). Neural summarization by extracting sentences and

words. arXiv preprint arXiv:1603.07252.

Chin, S.-C. (2013). Knowledge transfer: what, how, and why. Iowa Research Online -

Theses and Dissertations.

Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015).

Attention-based models for speech recognition. In Advances in neural information

processing systems, pages 577–585.

Chu, W. and Park, S.-T. (2009). Personalized recommendation on dynamic content

using predictive bilinear models. In Proceedings of the 18th international conference

on World wide web, pages 691–700. ACM.

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated

recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.

255 Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2015). Gated feedback recurrent

neural networks. In International Conference on Machine Learning, pages 2067–

2075.

Coussement, K. and Van den Poel, D. (2008). Churn prediction in subscription services:

An application of support vector machines while comparing two parameter-selection

techniques. Expert systems with applications, 34(1):313–327.

Cs´aji,B. C. (2001). Approximation with artificial neural networks. Faculty of Sciences,

Etvs Lornd University, Hungary, 24:48.

Csikszentmihalyi, M., Mehl, M. R., and Conner, T. S. (2013). Handbook of research

methods for studying daily life. Guilford Publications.

Das, D. and Martins, A. F. (2007). A survey on automatic text summarization. Liter-

ature Survey for the Language and Statistics II course at CMU, 4(192-195):57.

Das, M., Cui, R., Campbell, D. R., Agrawal, G., and Ramnath, R. (2015a). Towards

methods for systematic research on big data. In Big Data (Big Data), 2015 IEEE

International Conference on, pages 2072–2081. IEEE.

Das, M., Elsner, M., Nandi, A., and Ramnath, R. (2015b). Topchurn: Maximum en-

tropy churn prediction using topic models over heterogeneous signals. In Proceedings

of the 24th International Conference on World Wide Web, pages 291–297. ACM.

Das, M., Fosler-Lussier, E., Lin, S., Moosavinasab, S., Chen, D., Rust, S., Huang, Y.,

and Ramnath, R. (2018). Phrase2vecglm: Neural generalized language model–based

semantic tagging for complex query reformulation in medical ir. In Proceedings of

the BioNLP 2018 workshop, pages 118–128.

256 Daum´eIII, H. (2004). Notes on cg and lm-bfgs optimization of logistic regression. Paper

available at http://pub. hal3. name# daume04cg-bfgs, implementation available at

http://hal3. name/megam, 198:282.

Dave, K. S., Vaingankar, V., Kolar, S., and Varma, V. (2013). Timespent based models

for predicting user retention. In Proceedings of the 22nd international conference

on World Wide Web, pages 331–342. International World Wide Web Conferences

Steering Committee.

De Bock, K. W. and Van den Poel, D. (2010). Ensembles of probability estimation

trees for customer churn prediction. In Trends in Applied Intelligent Systems, pages

57–66. Springer. de Marneffe, M., MacCartney, B., Potts, C., and Jurafsky, D. (2015). Computational

linguistics i lectures - lexical semantics. University Lecture.

De Vine, L., Zuccon, G., Koopman, B., Sitbon, L., and Bruza, P. (2014). Medical

semantic similarity with a neural language model. In Proceedings of the 23rd ACM

International Conference on Conference on Information and Knowledge Manage-

ment, pages 1819–1822. ACM.

Debeauvais, T., Nardi, B., Schiano, D. J., Ducheneaut, N., and Yee, N. (2011). If you

build it they might stay: Retention mechanisms in world of warcraft. In Proceedings

of the 6th International Conference on Foundations of Digital Games, pages 180–187.

ACM.

Deerwester, S. (1988). Improving information retrieval with latent semantic indexing.

257 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R.

(1990). Indexing by latent semantic analysis. Journal of the American society for

information science, 41(6):391.

Diekema, A., Yilmazel, O., Chen, J., Harwell, S., He, L., and Liddy, E. D. (2003).

What do you mean? finding answers to complex questions. In New Directions in

Question Answering, pages 87–93.

Documentation, T. . (2014). Textblob: Simplified text processing. URL

http://textblob.readthedocs.org/en/dev/.

Dror, G., Pelleg, D., Rokhlenko, O., and Szpektor, I. (2012). Churn prediction in

new users of yahoo! answers. In Proceedings of the 21st international conference

companion on World Wide Web, pages 829–834. ACM.

Espinosa-Anke, L., Camacho-Collados, J., Delli Bovi, C., and Saggion, H. (2016). Su-

pervised distributional hypernym discovery via domain adaptation. In Conference

on Empirical Methods in Natural Language Processing; 2016 Nov 1-5; Austin, TX.

Red Hook (NY): ACL; 2016. p. 424-35. ACL (Association for Computational Lin-

guistics).

Eysenck, M. W. (2006). Fundamentals of cognition. Psychology Press.

Fader, A., Zettlemoyer, L., and Etzioni, O. (2014). Open question answering over

curated and extracted knowledge bases. In Proceedings of the 20th ACM SIGKDD

international conference on Knowledge discovery and data mining, pages 1156–1165.

ACM.

258 Fellbaum, C. (1998). WordNet. Wiley Online Library.

Firth, J. R. (1957). A synopsis of linguistic theory, 1930-1955. Studies in linguistic

analysis.

Furnas, G. W., Landauer, T. K., Gomez, L. M., and Dumais, S. T. (1983). Hu-

man factors and behavioral science: Statistical semantics: Analysis of the potential

performance of key-word information systems. The Bell System Technical Journal,

62(6):1753–1806.

Ganguly, D., Roy, D., Mitra, M., and Jones, G. J. (2015a). A word embedding based

generalized language model for information retrieval. In Proceedings of the 38th

International ACM SIGIR Conference on Research and Development in Information

Retrieval, pages 795–798. ACM.

Ganguly, D., Roy, D., Mitra, M., and Jones, G. J. (2015b). A word embedding based

generalized language model for information retrieval. In Proceedings of the 38th

International ACM SIGIR Conference on Research and Development in Information

Retrieval, pages 795–798. ACM.

Ganin, Y. and Lempitsky, V. (2015). Unsupervised domain adaptation by backpropa-

gation. In International Conference on Machine Learning, pages 1180–1189.

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marc-

hand, M., and Lempitsky, V. (2016). Domain-adversarial training of neural networks.

Journal of Machine Learning Research, 17(59):1–35.

Gentner, D. (1983). Structure-mapping: A theoretical framework for analogy. Cognitive

science, 7(2):155–170.

259 Gentner, D. (1989). Analogical learning. Similarity and analogical reasoning, 199.

Giuliano, C., Lavelli, A., and Romano, L. (2006). Exploiting shallow linguistic in-

formation for relation extraction from biomedical literature. In EACL, volume 18,

pages 401–408. Trento Italy.

Goldberg, Y. and Levy, O. (2014). word2vec explained: Deriving mikolov et al.’s

negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.

Gormley, C. and Tong, Z. (2015). Elasticsearch: The Definitive Guide. ” O’Reilly

Media, Inc.”.

Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv preprint

arXiv:1308.0850.

Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. Proceedings of the

National academy of Sciences, 101(suppl 1):5228–5235.

Group, T. S. N. L. P. (2014). Stanford topic modeling toolbox. URL

http://nlp.stanford.edu/downloads/tmt/tmt-0.4/.

Guha, R. V. (1991). Contexts: a formalization and some applications, volume 101.

Stanford University Stanford, CA.

Halpin, H., Robu, V., and Shepherd, H. (2007). The complex dynamics of collaborative

tagging. In Proceedings of the 16th international conference on World Wide Web,

pages 211–220. ACM.

Harabagiu, S. and Hickl, A. (2006). Methods for using textual entailment in open-

domain question answering. In Proceedings of the 21st International Conference on

260 Computational Linguistics and the 44th annual meeting of the Association for Com-

putational Linguistics, pages 905–912. Association for Computational Linguistics.

Harris, Z. S. (1954). Distributional structure. Word, 10(2-3):146–162.

Hassan, A. Z., Vallabhajosyula, M. S., and Pedersen, T. (2018). UMDuluth-CS8761

at SemEval-2018 task9: Hypernym discovery using hearst patterns, co-occurrence

frequencies and word embeddings. In Proceedings of The 12th International Work-

shop on Semantic Evaluation, pages 914–918, New Orleans, Louisiana. Association

for Computational Linguistics.

Hearst, M. A. (1992). Automatic acquisition of hyponyms from large text corpora.

In Proceedings of the 14th conference on Computational linguistics-Volume 2, pages

539–545. Association for Computational Linguistics.

Hendrickx, I., Kim, S. N., Kozareva, Z., Nakov, P., O´ S´eaghdha,D., Pad´o,S., Pennac-

chiotti, M., Romano, L., and Szpakowicz, S. (2009). Semeval-2010 task 8: Multi-way

classification of semantic relations between pairs of nominals. In Proceedings of the

Workshop on Semantic Evaluations: Recent Achievements and Future Directions,

pages 94–99. Association for Computational Linguistics.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural compu-

tation, 9(8):1735–1780.

Hofmann, T. (1999). Probabilistic latent semantic analysis. In Proceedings of the

Fifteenth conference on Uncertainty in artificial intelligence, pages 289–296. Morgan

Kaufmann Publishers Inc.

261 Holmes, G., Donkin, A., and Witten, I. H. (1994). Weka: A machine learning work-

bench. In Intelligent Information Systems, 1994. Proceedings of the 1994 Second

Australian and New Zealand Conference on, pages 357–361. IEEE.

Idiro Analytics, E. . (2017a). Retaining customers. URL

http://www.idiro.com/products-services/retaining-customers/.

Idiro Analytics, E. . (2017b). Rotational churn. URL http://www.idiro.com/products-

services/rotational-churn/.

Interactive Knowledge Stack, I. (2012). Lecture presentation - semantic lifting. [Online;

posted 09-October-2012].

Iwata, T., Saito, K., and Yamada, T. (2006). Recommendation method for extend-

ing subscription periods. In Proceedings of the 12th ACM SIGKDD international

conference on Knowledge discovery and data mining, pages 574–579. ACM.

Iyyer, M., Manjunatha, V., Boyd-Graber, J., and Daum´eIII, H. (2015). Deep un-

ordered composition rivals syntactic methods for text classification. In Proceedings

of the 53rd Annual Meeting of the Association for Computational Linguistics and

the 7th International Joint Conference on Natural Language Processing (Volume 1:

Long Papers), volume 1, pages 1681–1691.

Jin, X., Zhou, Y., and Mobasher, B. (2005). A maximum entropy web recommendation

system: combining collaborative and content features. In Proceedings of the eleventh

ACM SIGKDD international conference on Knowledge discovery in data mining,

pages 612–617. ACM.

262 Jurafsky, D. and Martin, J. H. (2014). Speech and language processing, volume 3.

Pearson London.

Kadlec, R., Bajgar, O., and Kleindienst, J. (2017). Knowledge base completion: Base-

lines strike back. arXiv preprint arXiv:1705.10744.

Karnstedt, M., Rowe, M., Chan, J., Alani, H., and Hayes, C. (2011). The effect of user

features on churn in social networks. In Proceedings of the 3rd International Web

Science Conference, page 23. ACM.

Kawale, J., Pal, A., and Srivastava, J. (2009). Churn prediction in mmorpgs: A social

influence based approach. In Computational Science and Engineering, 2009. CSE’09.

International Conference on, volume 4, pages 423–428. IEEE.

Kholghi, M., Sitbon, L., Zuccon, G., and Nguyen, A. (2015). Active learning: a step

towards automating medical concept extraction. Journal of the American Medical

Informatics Association, 23(2):289–296.

Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv

preprint arXiv:1408.5882.

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv

preprint arXiv:1412.6980.

Kingsbury, P. and Palmer, M. (2002). From treebank to . In LREC, pages

1989–1993. Citeseer.

263 Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., and

Fidler, S. (2015). Skip-thought vectors. In Advances in neural information processing

systems, pages 3294–3302.

Klein, D. and Manning, C. (2003). Maxent models, conditional estimation, and opti-

mization. HLT-NAACL 2003 Tutorial.

Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal

of the ACM (JACM), 46(5):604–632.

Krallinger, M. and Valencia, A. (2005). Text-mining and information-retrieval services

for molecular biology. Genome biology, 6(7):224.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with

deep convolutional neural networks. In Advances in neural information processing

systems, pages 1097–1105.

Laine, S. and Aila, T. (2016). Temporal ensembling for semi-supervised learning. arXiv

preprint arXiv:1610.02242.

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. (2016).

Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.

Le, Q. V. and Mikolov, T. (2014). Distributed representations of sentences and docu-

ments. In ICML, volume 14, pages 1188–1196.

Lenat, D. B. (1995). Cyc: A large-scale investment in knowledge infrastructure. Com-

munications of the ACM, 38(11):33–38.

264 Li, C., Datta, A., and Sun, A. (2011). Semantic tag recommendation using concept

model. In Proceedings of the 34th international ACM SIGIR conference on Research

and development in Information Retrieval, pages 1159–1160. ACM.

Lin, D. and Pantel, P. (2001a). Discovery of inference rules for question-answering.

Natural Language Engineering, 7(4):343–360.

Lin, D. and Pantel, P. (2001b). Induction of semantic classes from natural language

text. In Proceedings of the seventh ACM SIGKDD international conference on

Knowledge discovery and data mining, pages 317–322. ACM.

Lin, D. and Pantel, P. (2002). Concept discovery from text. In Proceedings of the 19th

international conference on Computational linguistics-Volume 1, pages 1–7. Associ-

ation for Computational Linguistics.

Liu, Z., Chen, X., and Sun, M. (2011). A simple word trigger method for social

tag suggestion. In Proceedings of the Conference on Empirical Methods in Natural

Language Processing, pages 1577–1588. Association for Computational Linguistics.

Loria, S. (2014). Textblob: simplified text processing. Secondary TextBlob: Simplified

Text Processing.

Luo, G., Tang, C., Yang, H., and Wei, X. (2008). Medsearch: a specialized search

engine for medical information retrieval. In Proceedings of the 17th ACM conference

on Information and knowledge management, pages 143–152. ACM.

Manning, C., Raghavan, P., and Sch¨utze,H. (2009). Introduction to information

retrieval/christopher d.

265 Manning, C. D., Sch¨utze,H., et al. (1999). Foundations of statistical natural language

processing, volume 999. MIT Press.

Manual, N. U. K. S. (2008). National library of medicine. Bethesda, Maryland.

Marcus, A., Wu, E., Karger, D. R., Madden, S., and Miller, R. C. (2011). Crowdsourced

databases: Query processing with people. CIDR.

Margolis, E. and Laurence, S. (1999). Concepts: core readings. Mit Press.

Mathes, A. (2010). Folksonomies: Cooperative classification and commu-

nication through shared metadata, 2004. URL http://www. adammathes.

com/academic/computer-mediated-communication/folksonomies. html.

McAuley, J. and Yang, A. (2016). Addressing complex and subjective product-related

queries with customer reviews. In Proceedings of the 25th International Conference

on World Wide Web, pages 625–635. International World Wide Web Conferences

Steering Committee.

Medin, D. L. (1989). Concepts and conceptual structure. American psychologist,

44(12):1469.

Medin, D. L., Goldstone, R. L., and Gentner, D. (1993). Respects for similarity.

Psychological review, 100(2):254.

Merrill, M. D. and Tennyson, R. D. (1977). Concept teaching: An instructional design

guide. Englewood Cliffs, NJ: Educational Technology.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word

representations in vector space. arXiv preprint arXiv:1301.3781.

266 Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed

representations of words and phrases and their compositionality. In Advances in

neural information processing systems, pages 3111–3119.

Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the

ACM, 38(11):39–41.

Miller, G. A. and Charles, W. G. (1991). Contextual correlates of semantic similarity.

Language and cognitive processes, 6(1):1–28.

Minsky, M. (1991). Society of mind: a response to four reviews. Artificial Intelligence,

48(3):371–396.

Mintz, M., Bills, S., Snow, R., and Jurafsky, D. (2009). Distant supervision for re-

lation extraction without labeled data. In Proceedings of the Joint Conference of

the 47th Annual Meeting of the ACL and the 4th International Joint Conference on

Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 1003–1011.

Association for Computational Linguistics.

Mitchell, T. M., Cohen, W. W., Hruschka Jr, E. R., Talukdar, P. P., Betteridge, J.,

Carlson, A., Mishra, B. D., Gardner, M., Kisiel, B., Krishnamurthy, J., et al. (2015).

Never ending learning. In AAAI, pages 2302–2310.

Moldovan, D., Pa¸sca,M., Harabagiu, S., and Surdeanu, M. (2003). Performance issues

and error analysis in an open-domain question answering system. ACM Transactions

on Information Systems (TOIS), 21(2):133–154.

267 Mooney, R. J. and Roy, L. (2000). Content-based book recommending using learning for

text categorization. In Proceedings of the fifth ACM conference on Digital libraries,

pages 195–204. ACM.

Nandi, A. and Bernstein, P. A. (2009). Hamster: using search clicklogs for schema and

taxonomy matching. Proceedings of the VLDB Endowment, 2(1):181–192.

Neubig, G. (2017). Neural machine translation and sequence-to-sequence models: A

tutorial. arXiv preprint arXiv:1703.01619.

Newman, D., Chemudugunta, C., Smyth, P., and Steyvers, M. (2006). Analyzing

entities and topics in news articles using statistical topic models. In Intelligence and

Security Informatics, pages 93–104. Springer.

Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., and Deng,

L. (2016). Ms marco: A human generated machine reading comprehension dataset.

arXiv preprint arXiv:1611.09268.

Nielsen, M. A. (2015). Neural networks and deep learning, volume 25. Determination

press San Francisco, CA, USA:.

Nivre, J., De Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C. D.,

McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., et al. (2016). Universal dependen-

cies v1: A multilingual treebank collection. In Proceedings of the Tenth International

Conference on Language Resources and Evaluation (LREC 2016), pages 1659–1666.

Osgood, C. E. (1949). The similarity paradox in human learning: A resolution. Psy-

chological review, 56(3):132.

268 Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., Song, X., and Ward, R.

(2016). Deep sentence embedding using long short-term memory networks: Analysis

and application to information retrieval. IEEE/ACM Transactions on Audio, Speech

and Language Processing (TASLP), 24(4):694–707.

Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on

knowledge and data engineering, 22(10):1345–1359.

Pantel, P., Bhagat, R., Coppola, B., Chklovski, T., and Hovy, E. H. (2007). Isp:

Learning inferential selectional preferences. In HLT-NAACL, pages 564–571.

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and

Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint

arXiv:1802.05365.

Petrov, S. and McDonald, R. (2012). Overview of the 2012 shared task on parsing the

web.

Pew Research Centers, J. P. a. (2013). Newspapers turning ideas into dollars.

Pfeiffer, J., Broscheit, S., Gemulla, R., and G¨oschl, M. (2018). A neural autoencoder

approach for document ranking and query refinement in pharmacogenomic informa-

tion retrieval. In Proceedings of the BioNLP 2018 workshop, pages 87–97.

Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., and Welling, M. (2008).

Fast collapsed gibbs sampling for latent dirichlet allocation. In Proceedings of the 14th

ACM SIGKDD international conference on Knowledge discovery and data mining,

pages 569–577. ACM.

269 Radev, D., Winkel, A., and Topper, M. (2002). Multi document centroid-based text

summarization. In ACL 2002. Citeseer.

Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). Improving lan-

guage understanding with unsupervised learning. Technical report, Technical report,

OpenAI.

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). Squad: 100,000+ questions

for machine comprehension of text. arXiv preprint arXiv:1606.05250.

Ramage, D., Manning, C. D., and Dumais, S. (2011). Partially labeled topic models for

interpretable text mining. In Proceedings of the 17th ACM SIGKDD international

conference on Knowledge discovery and data mining, pages 457–465. ACM.

Ravichandran, D. and Hovy, E. (2002). Learning surface text patterns for a question

answering system. In Proceedings of the 40th annual meeting on association for

computational linguistics, pages 41–47. Association for Computational Linguistics.

Reh˚uˇrek,R.ˇ and Sojka, P. (2010a). Software Framework for Topic Modelling with

Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges

for NLP Frameworks, pages 45–50, Valletta, Malta. ELRA. http://is.muni.cz/

publication/884893/en.

Reh˚uˇrek,R.ˇ and Sojka, P. (2010b). Software Framework for Topic Modelling with

Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges

for NLP Frameworks, pages 45–50, Valletta, Malta. ELRA. http://is.muni.cz/

publication/884893/en.

270 Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., and Riedl, J. (1994). Grouplens:

an open architecture for collaborative filtering of netnews. In Proceedings of the 1994

ACM conference on Computer supported cooperative work, pages 175–186. ACM.

Resnik, P. (1996). Selectional constraints: An information-theoretic model and its

computational realization. Cognition, 61(1):127–159.

Rice, J. A. (2006). Mathematical statistics and data analysis. Cengage Learning.

Ritter, A., Soderland, S., and Etzioni, O. (2009). What is this, anyway: Automatic

hypernym discovery. In AAAI Spring Symposium: Learning by Reading and Learning

to Read, pages 88–93.

Ritter, A., Zettlemoyer, L., Etzioni, O., et al. (2013). Modeling missing data in distant

supervision for information extraction. Transactions of the Association for Compu-

tational Linguistics, 1:367–378.

Rivas, A. R., Iglesias, E. L., and Borrajo, L. (2014). Study of query expansion tech-

niques and their application in the biomedical information retrieval. The Scientific

World Journal, 2014.

Roberts, K., Gururaj, A. E., Chen, X., Pournejati, S., Hersh, W. R., Demner-Fushman,

D., Ohno-Machado, L., Cohen, T., and Xu, H. (2017). Information retrieval for

biomedical datasets: the 2016 biocaddie dataset retrieval challenge. Database, 2017.

Roberts, K., Simpson, M. S., Voorhees, E. M., and Hersh, W. R. (2016a). Overview of

the trec 2015 clinical decision support track. In TREC.

271 Roberts, K., Voorhees, E., Demner-Fushman, D., and Hersh, W. R. (2016b). Overview

of the trec 2016 clinical decision support track. Online; posted August-2016.

Rockt¨aschel, T., Grefenstette, E., Hermann, K. M., Koˇcisk`y, T., and Blunsom,

P. (2015). Reasoning about entailment with neural attention. arXiv preprint

arXiv:1509.06664.

Roediger, H. L. (1990). Implicit memory: Retention without remembering. American

psychologist, 45(9):1043.

Ruder, S. (2019). Neural Transfer Learning for Natural Language Processing. PhD

thesis, NATIONAL UNIVERSITY OF IRELAND, GALWAY.

Rumelhart, D. E., Hinton, G. E., Williams, R. J., et al. (1988). Learning representations

by back-propagating errors. Cognitive modeling, 5(3):1.

Rush, A. M., Chopra, S., and Weston, J. (2015). A neural attention model for abstrac-

tive sentence summarization. arXiv preprint arXiv:1509.00685.

Russell, S. J. and Norvig, P. (2016). Artificial intelligence: a modern approach.

Malaysia; Pearson Education Limited,.

Salton, G., Buckley, C., and Smith, M. (1990). On the application of syntactic method-

ologies in automatic text analysis. Information Processing & Management, 26(1):73–

92.

Salton, G., Wong, A., and Yang, C.-S. (1975). A vector space model for automatic

indexing. Communications of the ACM, 18(11):613–620.

272 Sebastiani, F. (2002). Machine learning in automated text categorization. ACM com-

puting surveys (CSUR), 34(1):1–47.

Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical

journal, 27(3):379–423.

Sharp, R., Surdeanu, M., Jansen, P., Clark, P., and Hammond, M. (2016). Creating

causal embeddings for question answering with minimal supervision. arXiv preprint

arXiv:1609.08097.

Shimaoka, S., Stenetorp, P., Inui, K., and Riedel, S. (2016). Neural architectures for

fine-grained entity type classification. arXiv preprint arXiv:1606.01341.

Shneiderman, B., Byrd, D., and Croft, W. B. (1997). Clarifying search: A user-interface

framework for text searches. D-lib magazine, 3(1):18–20.

Shwartz, V., Santus, E., and Schlechtweg, D. (2016). Hypernyms under

siege: Linguistically-motivated artillery for hypernymy detection. arXiv preprint

arXiv:1612.04460.

Silvestri, F. (2010). Mining query logs: Turning search usage data into knowledge.

Foundations and Trends in Information Retrieval, 4(1—2):1–174.

Singh, G. and Bhatia, P. (2019). Relation extraction using explicit context condition-

ing. arXiv preprint arXiv:1902.09271.

Smith, N. A. (2011). Linguistic structure prediction. Synthesis lectures on human

language technologies, 4(2):1–274.

273 Soleimani, H. and Miller, D. J. (2016). Semi-supervised multi-label topic models for

document classification and sentence labeling. In Proceedings of the 25th ACM inter-

national on conference on information and knowledge management, pages 105–114.

ACM.

Song, Y., Kim, E., Lee, G. G., and Yi, B.-K. (2005). Posbiotm—ner: a trainable

biomedical named-entity recognition system. Bioinformatics, 21(11):2794–2796.

Sordoni, A., Bengio, Y., and Nie, J.-Y. (2014). Learning concept embeddings for query

expansion by quantum entropy minimization. In AAAI, volume 14, pages 1586–1592.

Sparck Jones, K. (1972). A statistical interpretation of term specificity and its appli-

cation in retrieval. Journal of documentation, 28(1):11–21.

Spelke, E. S. (1990). Principles of object perception. Cognitive science, 14(1):29–56.

Steyvers, M. and Griffiths, T. (2007). Probabilistic topic models. Handbook of latent

semantic analysis, 427(7):424–440.

Su, Y., Yang, S., Sun, H., Srivatsa, M., Kase, S., Vanni, M., and Yan, X. (2015).

Exploiting relevance feedback in knowledge graph search. In Proceedings of the 21th

ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,

pages 1135–1144. ACM.

Sukhbaatar, S., Weston, J., Fergus, R., et al. (2015). End-to-end memory networks.

In Advances in neural information processing systems, pages 2440–2448.

274 Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with

neural networks. In Advances in neural information processing systems, pages 3104–

3112.

Szab´o¡,Z. G. (2017). Compositionality. In Zalta, E. N., editor, The Stanford Encyclo-

pedia of Philosophy. Metaphysics Research Lab, Stanford University, summer 2017

edition.

Szpektor, I., Tanev, H., Dagan, I., and Coppola, B. (2004). Scaling web-based acqui-

sition of entailment relations. In Proceedings of the 2004 Conference on Empirical

Methods in Natural Language Processing.

Tarng, P.-Y., Chen, K.-T., and Huang, P. (2009). On prophesying online gamer de-

parture. In Network and Systems Support for Games (NetGames), 2009 8th Annual

Workshop on, pages 1–2. IEEE.

Taylor, A., Marcus, M., and Santorini, B. (2003). The penn treebank: an overview. In

Treebanks, pages 5–22. Springer.

Teo, C. H., Nassif, H., Hill, D., Srinivasan, S., Goodman, M., Mohan, V., and Vish-

wanathan, S. (2016). Adaptive, personalized diversity for visual discovery. In Pro-

ceedings of the 10th ACM Conference on Recommender Systems, pages 35–38. ACM.

Thomason, R. H. (2012). What is semantics? Richard H. Thomason’s UMich EECS

Departmental Homepage.

Thorndike, E. L. (1931). Human learning.

275 Tuarob, S., Pouchard, L. C., and Giles, C. L. (2013). Automatic tag recommendation

for metadata annotation using probabilistic topic modeling. In Proceedings of the

13th ACM/IEEE-CS joint conference on Digital libraries, pages 239–248. ACM.

Turney, P. D. (2005). Measuring semantic similarity by latent relational analysis. arXiv

preprint cs/0508053.

Turney, P. D. and Littman, M. L. (2003). Measuring praise and criticism: Inference of

semantic orientation from association. ACM Transactions on Information Systems

(TOIS), 21(4):315–346.

Turney, P. D. and Pantel, P. (2010). From frequency to meaning: Vector space models

of semantics. Journal of artificial intelligence research, 37:141–188.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser,

L., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural

Information Processing Systems, pages 5998–6008.

Verga, P., Strubell, E., Shai, O., and McCallum, A. (2017). Attending to all mention

pairs for full abstract biological relation extraction. arXiv preprint arXiv:1710.08312.

Voorhees, E. M. (2014). The effect of sampling strategy on inferred measures. In

Proceedings of the 37th international ACM SIGIR conference on Research & devel-

opment in information retrieval, pages 1119–1122. ACM.

Vuli´c,I., Gerz, D., Kiela, D., Hill, F., and Korhonen, A. (2017). Hyperlex: A large-scale

evaluation of graded lexical entailment. Computational Linguistics, 43(4):781–835.

276 Wan, M. and McAuley, J. (2016). Modeling ambiguity, subjectivity, and diverging

viewpoints in opinion question answering systems. In Data Mining (ICDM), 2016

IEEE 16th International Conference on, pages 489–498. IEEE.

Wei, C., Lee, H., Molnar, L., Herold, M., Ramnath, R., and Ramanathan, J. (2013).

Assisted human-in-the-loop adaptation of web pages for mobile devices. In Computer

Software and Applications Conference (COMPSAC), 2013 IEEE 37th Annual, pages

118–123. IEEE.

Wilks, Y. (1973). Preference semantics. Technical report, STANFORD UNIV CA

DEPT OF COMPUTER SCIENCE.

Wities, R., Shwartz, V., Stanovsky, G., Adler, M., Shapira, O., Upadhyay, S., Roth,

D., Mart´ınez-C´amara,E., Gurevych, I., and Dagan, I. (2017). A consolidated open

knowledge representation for multiple texts. In Proceedings of the 2nd Workshop on

Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 12–24.

Xu, B., Lin, H., Lin, Y., and Xu, K. (2017). Learning to rank with query-level semi-

supervised autoencoders. In Proceedings of the 2017 ACM on Conference on Infor-

mation and Knowledge Management, pages 2395–2398. ACM.

Xu, J. and Croft, W. B. (2000). Improving the effectiveness of information retrieval

with local context analysis. ACM Transactions on Information Systems (TOIS),

18(1):79–112.

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and

Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with

visual attention. In International conference on machine learning, pages 2048–2057.

277 Xu, W., Ritter, A., Callison-Burch, C., Dolan, W. B., and Ji, Y. (2014). Extracting

lexically divergent paraphrases from twitter. Transactions of the Association for

Computational Linguistics, 2:435–448.

Yang, Y., Bansal, N., Dakka, W., Ipeirotis, P., Koudas, N., and Papadias, D. (2009).

Query by document. In Proceedings of the Second ACM International Conference

on Web Search and Data Mining, pages 34–43. ACM.

Yin, W., Sch¨utze,H., Xiang, B., and Zhou, B. (2015). Abcnn: Attention-based convolu-

tional neural network for modeling sentence pairs. arXiv preprint arXiv:1512.05193.

Yu, Q. and Lam, W. (2018). Review-aware answer prediction for product-related

questions incorporating aspects. In Proceedings of the Eleventh ACM International

Conference on Web Search and Data Mining, pages 691–699. ACM.

Zhai, C. and Lafferty, J. (2004). A study of smoothing methods for language models

applied to information retrieval. ACM Transactions on Information Systems (TOIS),

22(2):179–214.

Zhang, Y., Rahman, M. M., Braylan, A., Dang, B., Chang, H.-L., Kim, H., McNamara,

Q., Angert, A., Banner, E., Khetan, V., et al. (2016). Neural information retrieval:

A literature review. arXiv preprint arXiv:1611.06792.

Zhang, Z., Li, J., Zhao, H., and Tang, B. (2018). Sjtu-nlp at semeval-2018 task 9:

Neural hypernym discovery with term embeddings. arXiv preprint arXiv:1805.10465.

Zuccon, G., Koopman, B., Bruza, P., and Azzopardi, L. (2015). Integrating and eval-

uating neural word embeddings in information retrieval. In Proceedings of the 20th

Australasian Document Computing Symposium, page 12. ACM.

278