Context and Syntax Embeddings for Question Answering on Stack Overflow
Total Page:16
File Type:pdf, Size:1020Kb
CASE-QA: Context and Syntax embeddings for Question Answering On Stack Overflow Ezra Winston Committee member: Graham Neubig Advisor: William Cohen Abstract improved greatly with recent advances in machine Question answering (QA) systems rely on both reading comprehension, but effective combination knowledge bases and unstructured text cor- of search and reading systems is an active research pora. Domain-specific QA presents a unique challenge. challenge, since relevant knowledge bases are This project focuses on the QUASAR-S dataset often lacking and unstructured text is diffi- (Dhingra et al., 2017) constructed from the com- cult to query and parse. This project focuses munity QA site Stack Overflow. QUASAR-S on the QUASAR-S dataset (Dhingra et al., consists of Cloze-style (fill-in-the-blank) ques- 2017) constructed from the community QA site Stack Overflow. QUASAR-S consists of tions about software entities and a large back- Cloze-style questions about software entities ground corpus of community-generated posts, and a large background corpus of community- each tagged with relevant software entities. To generated posts, each tagged with relevant effectively answer these highly domain-specific software entities. We incorporate the tag en- questions requires deep understanding of the back- tities as context for the QA task and find that ground corpus. One way to leverage the back- modeling co-occurrence of tags and answers ground posts corpus for QA is to train a language in posts leads to significant accuracy gains. model of posts, creating training questions similar To this end, we propose CASE, a hybrid of an RNN language model and a tag-answer to the Cloze questions by treating entities in posts co-occurrence model which achieves state-of- as answer entities. In this project, we find that ad- the-art accuracy on the QUASAR-S dataset. ditionally modeling co-occurrence of tags and an- We also find that this approach — modeling swers in posts greatly aids in the QA task. For ex- both question sentences and context-answer ample, a post about Java and the Eclipse integrated co-occurrence — is effective for other QA development environment appears with tags java, tasks. Using only language and co-occurrence compilation, and java-7 and contains the sentence: modeling on the training set, CASE is com- petitive with the state-of-the-art method on the You can use the eclipse ide for the pur- SPADES dataset (Bisk et al., 2016) which uses pose of refactoring. a knowledge base. We create a training question by treating eclipse as 1 Introduction the answer entity and refer to the tags as the con- Question answering (QA) is a long-standing goal text entities. We use both the sentence q and the of AI research. Factoid QA is the task of providing context entities c to predict the answer a, model- short answers — such as people, places, or dates ing P (ajq; c). — to questions posed in natural language. Sys- This project proposes CASE, a hybrid of a re- tems for factoid QA have broadly fallen into two current neural network language model (RNN- categories: those using knowledge-bases (KBs) LM) of question sentences P (ajq) and a context- and those using unstructured text. While KB answer co-occurrence model of P (ajc). Factoid approaches benefit from structured information, questions can often be viewed as consisting of QA tasks which require domain-specific knowl- both a question sentence and one or more con- edge present a unique challenge since relevant text entities. For example, the SPADES corpus knowledge bases are often lacking. Text-based ap- (Bisk et al., 2016) contains questions about Free- proaches which query unstructured sources have base entities like “USA has elected blank , our first African-American president” where we take to unstructured text data such as Wikipedia ar- USA to be the context entity and the desired an- ticles. Such data is available in abundance but swer entity is Barack Obama. We show that this can prove challenging to retrieve and parse. Text- view leads to a useful division of responsibility: based approaches (e.g. Chen et al.(2017); Dhingra the presence of the context model allows the he et al.(2017)) typically follow a search-and-read RNN LM to focus on the “type” of the answer en- paradigm, involving a search stage, in which rele- tity based on question syntax. vant documents are retrieved, and a reading stage, This project makes the following original con- in which retrieved passages are read for the correct tributions: answer. Much research has focused on the reading stage, with many datasets (e.g. Rajpurkar et al. • We propose CASE, a hybrid lan- (2016)) developed for the reading comprehension guage/context model, and instantiate it task. Effectively trading off between query recall using an RNN-LM and simple count-based and reading accuracy is the subject of current re- co-occurrence context model. search (Dhingra et al., 2017). • We show that CASE makes more effec- To our knowledge, little work has focused on tive use of background knowledge than both incorporating background knowledge for QA via pure language modeling and search-and-read language modeling, although an RNN-LM is pro- baselines, obtaining state-of-the-art perfor- vided as a baseline on the QUASAR-S dataset mance on QUASAR-S. (Dhingra et al., 2017). When applicable, this ap- proach has the benefit of access to much larger • We demonstrate that on the SPADES dataset training sets than either KB or search-and-read where no background text corpus is avail- approaches, since it can be trained on natural- able, CASE still obtains results comparable language sources that are orders of magnitude to state-of-the-art knowledge-based methods, larger than existing QA training sets. In addition, without using a knowledge-base. We then the language-modeling approach does not depend combine the co-occurrence counts with the on achieving the fine balance between query and best existing model to obtain a new state-of- reading systems required for search-and-read. the-art. • Finally, we provide qualitative analysis of 2.3 Language Modeling the entity embeddings produced by CASE, Given a sequence S consisting of words showing that they encode entity “type” infor- w1; : : : ; wk−1 (and sometimes words mation while ignoring semantic differences, wk+1; : : : wK ), the language modeling task which is of potential use for other tasks. is to model P (wkjS). Neural network language 2 Background & Related Work models such as those using LSTMs and GRUs have shown increasingly good performance (see 2.1 Problem Definition Chung et al.(2014) for a comparison). Following We take an instance of the QA with context task to (Dhingra et al., 2017), we adopt a BiGRU model be a tuple (c; q; i) where c = fc1; : : : ; cmg is a set for modeling the question sentence q. of one or more context entities, question sentence RNN-LMs have trouble modeling long-range q has words w1; : : : ; wn, and the answer a appears topical context as well as predicting rare words. at index i, a.k.a. wi = a. At test time the answer We find that explicitly incorporating predictions entity a is replaced with blank and the task is to based on context entities (e.g. tags in Stack Over- flow, or Freebase entities in SPADES) is critical for identify it. That is, we wish to model P (ajc; qnwi ). the QA-with-context task, since the correct answer 2.2 Question Answering entity can be largely dictated by the context en- Research into both text-based and knowledge- tities. Several approaches to incorporating long- based QA has recently centered on deep-learning range context in RNN-LMs have emerged and led approaches. For example, memory networks have to better language modeling performance. Fol- proven an effective way to reason over KBs (e.g. lowing the terminology of Wang and Cho(2015), (Bordes et al., 2015)). However, the relative spar- these either employ early-fusion, in which a con- sity of even the largest KBs has motivated a turn text vector is concatenated with each RNN input (Ghosh et al., 2016; Mikolov and Zweig, 2012), or concatenate the forward and backward GRU out- late late fusion, in which a context vector is used puts at that index: as a bias before the output nonlinearity of the RNN cell (Wang and Cho, 2015). x = [W1w1;:::;W1wK ] We employ an approach most related to late- h = [fGRU(x)i−1; bGRU(x)i+1] fusion, adding a context vector as a bias to the log(f(q; ·)) = W2h RNN output in logit space, prior to softmax. Re- lated to our approach, Arthur et al.(2016) incorpo- rate discrete lexicons into neural translation mod- where the wk are one-hot encoded and fGRU(x) els by using them as a bias in the output soft- and bGRU(x) are the sequential outputs of the max, finding that this compensates where neural forward and backward GRUs. translation models fail at translating rare but im- For the context model g we use simple co- portant words. Neubig and Dyer(2016) present occurrence counts calculated from the training a framework for hybridizing neural and n-gram set. Specifically, given context entities c = language models, one instantiation of which in- fc1; : : : ; cmg we compute volves neural interpolation between n-gram pre- dictions and RNN-LM predictions. Also related #(a; ci) g(c; a) = avgi : to our approach is TopicRNN, a generative lan- #(ci) guage model that combines a neural variational In other words, for each context entity, we com- topic model over past words with an RNN lan- pute the empirical probability of co-occurrence guage model of the current sentence (Dieng et al., with the answer entity, and then average over con- 2016).