Context and Syntax Embeddings for Question Answering on Stack Overﬂow

CASE-QA: Context and Syntax embeddings for Question Answering On Stack Overflow Ezra Winston Committee member: Graham Neubig Advisor: William Cohen Abstract improved greatly with recent advances in machine Question answering (QA) systems rely on both reading comprehension, but effective combination knowledge bases and unstructured text cor- of search and reading systems is an active research pora. Domain-specific QA presents a unique challenge. challenge, since relevant knowledge bases are This project focuses on the QUASAR-S dataset often lacking and unstructured text is diffi- (Dhingra et al., 2017) constructed from the com- cult to query and parse. This project focuses munity QA site Stack Overflow. QUASAR-S on the QUASAR-S dataset (Dhingra et al., consists of Cloze-style (fill-in-the-blank) ques- 2017) constructed from the community QA site Stack Overflow. QUASAR-S consists of tions about software entities and a large back- Cloze-style questions about software entities ground corpus of community-generated posts, and a large background corpus of community- each tagged with relevant software entities. To generated posts, each tagged with relevant effectively answer these highly domain-specific software entities. We incorporate the tag en- questions requires deep understanding of the back- tities as context for the QA task and find that ground corpus. One way to leverage the back- modeling co-occurrence of tags and answers ground posts corpus for QA is to train a language in posts leads to significant accuracy gains. model of posts, creating training questions similar To this end, we propose CASE, a hybrid of an RNN language model and a tag-answer to the Cloze questions by treating entities in posts co-occurrence model which achieves state-of- as answer entities. In this project, we find that ad- the-art accuracy on the QUASAR-S dataset. ditionally modeling co-occurrence of tags and an- We also find that this approach — modeling swers in posts greatly aids in the QA task. For ex- both question sentences and context-answer ample, a post about Java and the Eclipse integrated co-occurrence — is effective for other QA development environment appears with tags java, tasks. Using only language and co-occurrence compilation, and java-7 and contains the sentence: modeling on the training set, CASE is com- petitive with the state-of-the-art method on the You can use the eclipse ide for the pur- SPADES dataset (Bisk et al., 2016) which uses pose of refactoring. a knowledge base. We create a training question by treating eclipse as 1 Introduction the answer entity and refer to the tags as the con- Question answering (QA) is a long-standing goal text entities. We use both the sentence q and the of AI research. Factoid QA is the task of providing context entities c to predict the answer a, model- short answers — such as people, places, or dates ing P (ajq; c). — to questions posed in natural language. Sys- This project proposes CASE, a hybrid of a re- tems for factoid QA have broadly fallen into two current neural network language model (RNN- categories: those using knowledge-bases (KBs) LM) of question sentences P (ajq) and a context- and those using unstructured text. While KB answer co-occurrence model of P (ajc). Factoid approaches benefit from structured information, questions can often be viewed as consisting of QA tasks which require domain-specific knowl- both a question sentence and one or more con- edge present a unique challenge since relevant text entities. For example, the SPADES corpus knowledge bases are often lacking. Text-based ap- (Bisk et al., 2016) contains questions about Free- proaches which query unstructured sources have base entities like “USA has elected blank , our first African-American president” where we take to unstructured text data such as Wikipedia ar- USA to be the context entity and the desired an- ticles. Such data is available in abundance but swer entity is Barack Obama. We show that this can prove challenging to retrieve and parse. Text- view leads to a useful division of responsibility: based approaches (e.g. Chen et al.(2017); Dhingra the presence of the context model allows the he et al.(2017)) typically follow a search-and-read RNN LM to focus on the “type” of the answer en- paradigm, involving a search stage, in which rele- tity based on question syntax. vant documents are retrieved, and a reading stage, This project makes the following original con- in which retrieved passages are read for the correct tributions: answer. Much research has focused on the reading stage, with many datasets (e.g. Rajpurkar et al. • We propose CASE, a hybrid lan- (2016)) developed for the reading comprehension guage/context model, and instantiate it task. Effectively trading off between query recall using an RNN-LM and simple count-based and reading accuracy is the subject of current re- co-occurrence context model. search (Dhingra et al., 2017). • We show that CASE makes more effec- To our knowledge, little work has focused on tive use of background knowledge than both incorporating background knowledge for QA via pure language modeling and search-and-read language modeling, although an RNN-LM is pro- baselines, obtaining state-of-the-art perfor- vided as a baseline on the QUASAR-S dataset mance on QUASAR-S. (Dhingra et al., 2017). When applicable, this approach has the benefit of access to much larger • We demonstrate that on the SPADES dataset training sets than either KB or search-and-read where no background text corpus is avail- approaches, since it can be trained on natural- able, CASE still obtains results comparable language sources that are orders of magnitude to state-of-the-art knowledge-based methods, larger than existing QA training sets. In addition, without using a knowledge-base. We then the language-modeling approach does not depend combine the co-occurrence counts with the on achieving the fine balance between query and best existing model to obtain a new state-of- reading systems required for search-and-read. the-art. • Finally, we provide qualitative analysis of 2.3 Language Modeling the entity embeddings produced by CASE, Given a sequence S consisting of words showing that they encode entity “type” infor- w1; : : : ; wk−1 (and sometimes words mation while ignoring semantic differences, wk+1; : : : wK ), the language modeling task which is of potential use for other tasks. is to model P (wkjS). Neural network language 2 Background & Related Work models such as those using LSTMs and GRUs have shown increasingly good performance (see 2.1 Problem Definition Chung et al.(2014) for a comparison). Following We take an instance of the QA with context task to (Dhingra et al., 2017), we adopt a BiGRU model be a tuple (c; q; i) where c = fc1; : : : ; cmg is a set for modeling the question sentence q. of one or more context entities, question sentence RNN-LMs have trouble modeling long-range q has words w1; : : : ; wn, and the answer a appears topical context as well as predicting rare words. at index i, a.k.a. wi = a. At test time the answer We find that explicitly incorporating predictions entity a is replaced with blank and the task is to based on context entities (e.g. tags in Stack Over- flow, or Freebase entities in SPADES) is critical for identify it. That is, we wish to model P (ajc; qnwi ). the QA-with-context task, since the correct answer 2.2 Question Answering entity can be largely dictated by the context en- Research into both text-based and knowledge- tities. Several approaches to incorporating long- based QA has recently centered on deep-learning range context in RNN-LMs have emerged and led approaches. For example, memory networks have to better language modeling performance. Fol- proven an effective way to reason over KBs (e.g. lowing the terminology of Wang and Cho(2015), (Bordes et al., 2015)). However, the relative spar- these either employ early-fusion, in which a con- sity of even the largest KBs has motivated a turn text vector is concatenated with each RNN input (Ghosh et al., 2016; Mikolov and Zweig, 2012), or concatenate the forward and backward GRU out- late late fusion, in which a context vector is used puts at that index: as a bias before the output nonlinearity of the RNN cell (Wang and Cho, 2015). x = [W1w1;:::;W1wK ] We employ an approach most related to late- h = [fGRU(x)i−1; bGRU(x)i+1] fusion, adding a context vector as a bias to the log(f(q; ·)) = W2h RNN output in logit space, prior to softmax. Re- lated to our approach, Arthur et al.(2016) incorporate discrete lexicons into neural translation mod- where the wk are one-hot encoded and fGRU(x) els by using them as a bias in the output soft- and bGRU(x) are the sequential outputs of the max, finding that this compensates where neural forward and backward GRUs. translation models fail at translating rare but im- For the context model g we use simple co- portant words. Neubig and Dyer(2016) present occurrence counts calculated from the training a framework for hybridizing neural and n-gram set. Specifically, given context entities c = language models, one instantiation of which in- fc1; : : : ; cmg we compute volves neural interpolation between n-gram predictions and RNN-LM predictions. Also related #(a; ci) g(c; a) = avgi : to our approach is TopicRNN, a generative lan- #(ci) guage model that combines a neural variational In other words, for each context entity, we com- topic model over past words with an RNN lan- pute the empirical probability of co-occurrence guage model of the current sentence (Dieng et al., with the answer entity, and then average over con- 2016).

Context and Syntax Embeddings for Question Answering on Stack Overﬂow

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support