Duplicate Question Identification by Integrating Framenet with Neural

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Duplicate Question Identification by Integrating FrameNet with Neural Networks Xiaodong Zhang,1 Xu Sun,1 Houfeng Wang1,2 1 MOE Key Lab of Computational Linguistics, Peking University, Beijing, 100871, China 2 Collaborative Innovation Center for Language Ability, Xuzhou, Jiangsu, 221009, China {zxdcs, xusun, wanghf}@pku.edu.cn Abstract • Q3: How can I go downtown from the airport? • Q4: How can I go downtown from the park? There are two major problems in duplicate question identification, namely lexical gap and essential constituents match- First, different people tend to use different words and ing. Previous methods either design various similarity fea- phrases to express the same meaning. For Q1 and Q2, al- tures or learn representations via neural networks, which though they are semantically equivalent, there are only a few try to solve the lexical gap but neglect the essential con- overlapped words. If only considering the surface form of stituents matching. In this paper, we focus on the essential the questions without leveraging any semantic information, constituents matching problem and use FrameNet-style se- it is hard to identify that they are duplicate. This problem mantic parsing to tackle it. Two approaches are proposed to is called lexical gap. Second, for Q3 and Q4, although most integrate FrameNet parsing with neural networks. An ensemble approach combines a traditional model with manually de- words are overlapped and there is only one different word, signed features and a neural network model. An embedding they are not duplicate. There are two essential constituents, approach converts frame parses to embeddings, which are points of departure and destination, in the two questions. The combined with word embeddings at the input of neural net- departure places are different, therefore the answers for one works. Experiments on Quora question pairs dataset demon- question are useless for the other one. It is hard for a model strate that the ensemble approach is more effective and out- to classify them as non-duplicate because their surface forms performs all baselines.1 are so similar. We refer to this problem as essential constituents matching. Distributed representation is an effective way to tackle the Introduction lexical gap problem. Researchers have designed various sim- Duplicate question identification (DQI) aims to compare two ilarity features based on word embeddings (Franco-Salvador questions and identify whether they are semantically equiv- et al. 2016), or acquired representations of questions via neu- alent or not, i.e., a binary classification problem. It is a vi- ral networks and then calculated their similarity (Santos et tal task for community question answering (CQA). With an al. 2015; Lei et al. 2016). Although much effort has been automatic DQI method, a CQA forum can merge duplicate paid to the lexical gap problem, there is little research on the questions so as to organize questions and answers more ef- essential constituents matching problem, which is also vital ficiently. Besides, by retrieving questions that are semanti- to DQI. Previous approaches are generally based on simi- cally equivalent to a question presented by a user, an auto- larity. Therefore they are unlikely to classify Q3 and Q4 as matic QA system can answer the user’s question with an- non-duplicate. The words and sentence patterns of the two swers of the retrieved questions. questions are so similar that representations learned by neu- There are two major problems in DQI, namely lexical gap ral networks are likely to be similar. There should be a way (or called semantic gap) and essential constituents matching. to model the matching of essential constituents in question Essential constituents of a question refer to constituents that pairs explicitly. The unmatched essential constituents can are important to the meaning of the question. A constituent provide strong clues for predicting question pairs as non- contains two parts, name and value. For example, for a ques- duplicate. tion asking route, there is usually a destination constituent. It is non-trivial to extract essential constituents in a ques- Four questions are listed below to explain the two prob- tion. A common way is to define constituent categories by lems. The first two questions are duplicate, and the last two experts and label some questions by annotators to get a la- are non-duplicate. beled dataset. Then, a supervised sequence labeling model can be trained on the dataset to extract essential constituents, • Q1: What is the most populous state in the USA? which is similar to named entity recognition task (Sang • Q2: Which state in the United States has the most people? and Meulder 2003). However, defining and labeling essen- Copyright c 2018, Association for the Advancement of Artificial tial constituents in open domain are impractical, consum- Intelligence (www.aaai.org). All rights reserved. ing too much time and funding. Fortunately, there is a cor- 1The code is available at https://github.com/zxdcs/DQI relation between essential constituents and semantic units 6061 in a semantic parse. Hence, we parse questions using a parent. By viewing the name and LUs of a frame as the name FrameNet (Baker, Fillmore, and Lowe 1998) parser and ap- and value of an essential constituent, a frame can be easily proximate essential constituents by frames. The essential converted to an essential constituent. This is the main reason constituents matching problem is transformed into a frame why FrameNet-style parsing is used. matching problem. In this way, manual labeling is avoided We find that the FrameNet parsing cannot cover all es- and all we need is a frame parser, which is publicly available sential constituents in questions, which is because of both on the Internet. the incomplete coverage of FrameNet and unsatisfying per- In this paper, we use FrameNet parsing for essential con- formance of the parser. A major missing is some location stituents matching and neural networks for handling lexical constituents. For example, in question “What is the best gap. Two approaches are proposed to integrate FrameNet travel website in Spain?”, the word Spain is not included parses with neural networks, namely ensemble approach and in any frame. To overcome this shortcoming, named enti- embedding approach. In the ensemble approach, two mod- ties in questions are recognized by an automatic recognizer els are trained separately and their outputs are combined. For and they are fitted into the FrameNet structure. Specifically, FrameNet parses, two kinds of features on the word and the the word Spain is recognized as a geo-political entity (GPE). frame level are designed to measure the matching degrees Hence, a frame with GPE as the name and Spain as the LU of essential constituents. A gradient boosting decision tree is constructed. (GBDT) (Friedman 2001) classifier is trained on these features. For neural networks, any kinds of neural networks that take two sentences as input can be used. In the embedding Neural Networks Model approach, a unified model is proposed. Each kind of frame is assigned to an embedding. The frame embeddings are con- Neural networks models (NNMs) with word embeddings as catenated with word embeddings at the input of neural net- input are ideal models to handle the lexical gap problem. works. Consequently, the representations learned by neural Because our focus is how to leverage frame parsing rather networks can include essential constituents information. than proposing a novel NNM, we directly use some off-the- shelf models that perform well on similar tasks. In exper- Frame Parsing iments, many different kinds of neural networks are tried. The FrameNet project (Baker, Fillmore, and Lowe 1998) is a Here we only introduce a basic one as an example, that is semantic database of English, which contains about 200,000 a Siamese network (Bromley et al. 1993) consisting of two manually annotated sentences linked to more than 1,200 se- bidirectional long short-term memory (BLSTM) networks. mantic frames. It is based on a theory of meaning called The structure is shown in Figure 1. Frame Semantics (Fillmore 1976). The basic idea is that the meaning of most words can be understood on the basis of se- y mantic frames, which are represented by three major compo- nents: frame, frame elements (FEs) and lexical units (LUs). Table 1 lists the parse of Q3 using a FrameNet parser MLP called SEMAFOR (Kshirsagar et al. 2015). The first column lists all words in the question. In the rest columns, each col- J umn represents a frame. The word that triggers a frame is marked by bold text, and other items represent FEs of the frame. The question contains three frames, including capa- H1 H2 bility, motion and buildings. Take the motion frame as an example, it is evoked by LU go and contains three FEs, i.e., theme, goal and source. The FE goal is filled by LU downtown. Capability Motion Buildings E1 E2 How Entity Q1 Q2 can Capability … the most populous… … has the most people… I Event Theme go Motion Figure 1: The structure of the BLSTM model. downtown Goal from At first, a question Q =[w1,w2, ..., wn] is mapped the Source to an embedding matrix E via lookup table, i.e. E = airport Buildings [e(w1),e(w2), ..., e(wn)], where e(wt) is the word embed- ? ding of wt. Then a BLSTM (Hochreiter and Schmidhuber 1997; Graves 2012) is employed to learn contextual repre- Table 1: FrameNet-style parsing of a question. sentations of the embeddings and these representations are reduced to a fixed-length representation H. The gates, cell The resemblance of frame and essential constituent is ap- and output of LSTM are calculated as follows: 6062 For example, if fi is Capability in Table 1, ri is {How, can, I}.

Duplicate Question Identification by Integrating Framenet with Neural

WELDA: Enhancing Topic Models by Incorporating Local Word Context

Generative Topic Embedding: a Continuous Representation of Documents

Linked Data Triples Enhance Document Relevance Classification

Document and Topic Models: Plsa

Improving Distributed Word Representation and Topic Model by Word-Topic Mixture Model

Topic Modeling: Beyond Bag-Of-Words

Semantic N-Gram Topic Modeling

Modeling Topical Relevance for Multi-Turn Dialogue Generation

Text Analysis of QC/Issue Tracker by Natural Language Processing (NLP)

Topic Modeling Approaches for Understanding COVID-19

Distributional Approaches to Semantic Analysis

Topic Modeling for Natural Language Understanding