Knowledge and Cross-Pair Pattern Guided Semantic Matching for Question Answering

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Zihan Xu,1,2 Hai-Tao Zheng,1,2∗ Shaopeng Zhai,3 Dong Wang1,2 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Department of Computer Science and Technology, Tsinghua University, Beijing, China 3School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University {xu-zh17, wangd18}@mails.tsinghua.edu.cn, [email protected], [email protected]

Abstract other patterns. For example, considering Q1 in Figure 1, it is difficult to pick out the true answer based on Q-A similar- Semantic matching is a basic problem in natural language processing, but it is far from solved because of the differ- ity, since the comparing framework is likely to be misled by ences between the pairs for matching. In question answering words with identical attributes (marked by color). (QA), answer selection (AS) is a popular semantic match- In fact, QA is different from PI because the question and ing task, usually reformulated as a paraphrase identification the answer are not synonymous sentences and not strictly (PI) problem. However, QA is different from PI because the comparable. Instead of being semantically equivalent, Q-A question and the answer are not synonymous sentences and pairs usually form a continuity in meaning (and generally not strictly comparable. In this work, a novel knowledge and fall into certain Q-to-A patterns). Therefore, the basic se- cross-pair pattern guided semantic matching system (KCG) is proposed, which considers both knowledge and pattern con- mantic matching strategy for answer selection has room for ditions for QA. We apply explicit cross-pair matching based improvement. In this work, we study the answer selection on Graph Convolutional Network (GCN) to help KCG rec- problem based on fundamental characteristics of QA itself. ognize general domain-independent Q-to-A patterns better. We observe that, a sentence will be considered as a proper And with the incorporation of domain-specific information answer only if it meets two basic conditions. First, the infor- from knowledge bases (KB), KCG is able to capture and ex- mation in the sentence must be relevant to that in the ques- plore various relations within Q-A pairs. Experiments show tion (knowledge condition). Second, the structure of the sen- that KCG is robust against the diversity of Q-A pairs and out- tence should correspond to the question structure (pattern performs the state-of-the-art systems on different answer se- condition). Knowledge condition ensures that the sentence lection tasks. is “telling the truth”, while pattern condition checks whether the sentence is “responding to” the question. Introduction There is growing interest in the study of the knowledge Semantic matching is a basic problem in natural language condition. Some recent work leverages Wikipedia (Chen et processing. Many tasks are essentially a semantic match- al. 2017), knowledge bases (Shen et al. 2018) or other ex- ing problem, such as information retrieval (IR), question an- ternal resources to provide background information for QA. swering (QA) and paraphrase identification (PI) (Li, Xu, and However, content correlation with the question is not suffi- others 2014). In QA, the matching of the question with the cient for an answer. As can be seen from Figure 1, the listed most proper answer from a set of candidates, which is known candidate answers to Q1 are more reliable than others be- as answer selection (AS), remains challenging due to the di- cause they are all statements of fact, and are all about the versity of Q-A pairs. entity “8 track” in the question. However, among these can- Usually, AS is reformulated as a PI problem. Methods didates, A1−1 and A1−2 are giving irrelevant answers, but can be divided into three categories based on the general are also likely to be assigned with high probability, unless model structures: Siamese networks (Feng et al. 2015; Yang, extra semantic parsing or information extraction methods Yih, and Meek 2015), attentive networks (Santos et al. 2016; (entity linking and relation detection) are conducted as in Yin et al. 2016) and Compare-Aggregate networks (Wang KBQA (Yao and Van Durme 2014). and Jiang 2017; Bian et al. 2017). However, these methods However, if the pattern condition is followed when de- are all based on the comparing framework. It appears that signing AS models, irrelevant answers will be avoided with by optimizing the likelihood of two text sequences being a few auxiliary tasks. As in the example, Q1 and Q2 both con- matched pair based on their similarity, neural models assign tain what year was ... invented. Meanwhile, A1−3 shares high probability to those with the same words, phrases or the same pattern ... was created in (year) ... with A2−1. ∗Corresponding Author These pairs are in a common Q-to-A pattern. If the pattern Copyright c 2020, Association for the Advancement of Artificial is explicitly learned to guide the answer selection process of Intelligence (www.aaai.org). All rights reserved. Q1, the correct answer will be picked out more easily.

9370 ! "# $

' &

! !" ##$% " &

& ' % %

Figure 1: QA pairs from the WikiQA corpus. Intra-pair and cross-pair matches are denoted by colors and underlines respectively.

Taking both the pattern and knowledge conditions into pair graph, and conduct node classification with GCN to consideration, we propose a novel knowledge and cross-pair capture global correlations. To the best of our knowledge, pattern guided system (KCG) for answer selection. First, the this is the first study to model the QA corpus as a graph to idea of cross-pair similarity (Zanzotto and Moschitti 2006) perform a GCN-based post-procedure, which may expand is applied to learning general Q-to-A patterns. To be spe- the application of graph neural networks on textual data. A Q cific, for each candidate answer of the question ,inad- • The proposed system considers both knowledge and pat- dition to the common intra-pair comparison between Q and A P =(Q, A) tern conditions for QA, and outperforms the state-of-the- , matching is conducted between this pair to art results on different answer selection tasks. other Q-A pairs. Therefore, global information is incorpo- rated into each single matching pair. To explicitly model cross-pair dependencies, we regard Related Work Q-A pairs as nodes and build a graph around the idea that Deep Semantic Matching Semantic matching is usually similar Q-A pairs are close to each other, and turn AS into a solved with the score of semantic recall (similarity compu- node classification problem based on Graph Convolutional tation) based on the comparing framework. Deep semantic Network (GCN). GCN has been demonstrated as one of matching starts with Siamese networks (Feng et al. 2015; the most effective approaches for semi-supervised learn- Yang, Yih, and Meek 2015). These models use the same ing (Kipf and Welling 2017) because of its ability to exploit structure to encode the semantic sequences separately for connectivity patterns between labeled and unlabeled data. matching. Then more interaction between sequences has Therefore, we find it a good fit for capturing global corre- been introduced by soft-attention (Santos et al. 2016; Yin lations and learning cross-pair patterns. With the correlation et al. 2016). Further, some interaction-based networks (Hu matrix which guides information propagation among nodes, et al. 2014; Pang et al. 2016; Wang and Jiang 2017) are pro- the classification process retains semantic structures in the posed, most of which conduct the matching process before embedding space, where related concepts are neighbors. further representation learning. In order to meet the knowledge condition, multi-view at- However, these methods are all based on the intra-pair tention is utilized to capture interactive features within the comparing framework. Cross-pair similarity was proposed Q-A pair (intra-pair matching part). To be specific, we adopt by (Zanzotto and Moschitti 2006) in the textual entailment both textual attention from words and knowledge-based at- task. They devised the tree kernel based on cross-pair simi- tention from entities to enhance the representation learn- larity for Support Vector Machines (SVM). Recently, (Ty- ing of the Q-A pair with the Compare-Aggregate network. moshenko and Moschitti 2018) combine the tree kernels Therefore, the model implements comparison on word, sen- with word-based kernels for AS. In this paper, we introduce tence and knowledge levels, thus learning more comprehen- GCN to model cross-pair dependencies, since GCN is natu- sive intra-pair information. rally good at exploiting connectivity patterns through incor- Our main contributions include: porating neighborhood information. • We propose a universal semantic matching strategy for Application of GCN on NLP GCN is a simplified question answering. Different from the traditional com- graph neural network (GNN), first introduced by (Kipf and paring framework, we apply both intra-pair and cross- Welling 2017) to perform semi-supervised classification. In pair matching, thus enabling our system to learn not only NLP, GCN is mainly explored in tasks such as semantic role multi-view information between the question and the an- labeling (Marcheggiani and Titov 2017), machine transla- swer, but also global Q-to-A pattern information. tion (Bastings et al. 2017) and relation classification (Li, • In order to learn cross-pair patterns, we propose the Q-A Jin, and Luo 2018) to encode syntactic structures. (Lai et al.

9371

Figure 2: The proposed knowledge and cross-pair pattern guided semantic matching system (KCG).

2019) introduce word lattice (a directed graph) into Chinese where j denotes the layer number and Z(0) = X. A˜ = − 1 − 1 QA, but the basic model relies on Siamese intra-pair match- D 2 AD 2 is the normalized symmetric adjacency matrix ing. Besides, the above applications usually require that the and W (j) is a trainable weight matrix. ρ is an activation data itself exhibit a natural graph structure, and mainly build function, e.g., ReLU. Higher order neighborhood informa- the graphs inside sentences. tion can be introduced by stacking multiple GCN layers. (Yao, Mao, and Luo 2019) first model a whole corpus as a graph where documents and words are regarded as nodes. GCN-Based Cross-Pair Learning For graph-based However, the graph is built on traditional features like word learning, the key challenge is to exploit graph structures co-occurrence, which may ignore word orders useful for text and data features to improve learning performance. In fields classification. Our graph based on sentence pair representa- where GCN is widely used (e.g., social network, citation tions and correlations is easy to build and effective to model network or knowledge graph), the data usually exhibits a cross-pair dependencies. With the design and high-quality natural graph structure. However, there is no pre-defined node embeddings, the application of GCN on textual data graph structure in textual QA. Thus, the building of the without pre-defined graph structures can be extended. graph is a crucial problem. In order to model cross-pair dependencies, we build the graph over Q-A pairs (Figure 3), Knowledge and Cross-Pair Pattern Guided where each node is represented by the sentence pair embedding. The correlation matrix P is computed based on Semantic Matching cosine similarity between embedding vectors. We binarize In this part, we elaborate on KCG for semantic matching the matrix by a threshold τ and get: in question answering, as shown in Figure 2. The cross-pair 0,ifP<τ and intra-pair parts are trained independently, and final de- ij Aij = , (2) cisions are simply made based on weighted sum of the pre- 1,ifPij ≥ τ dictions to reduce dependency on parameters. where A is the binary correlation matrix. Cross-Pair Learning The built graph is fed into GCN, and the output of the penultimate layer is passed to a softmax layer: Graph Convolutional Networks The essential idea of GCN is to update node representations by propagating in- Z = softmax(AXW˜ ), (3) formation among nodes. Formally, for a graph G =(V,E), V (|V | = n) and E are sets of nodes and edges respec- The loss function is defined as the cross-entropy error tively. Every node is assumed to be connected to itself, i.e., over all labeled Q-A pairs: n×m (v, v) ∈ E for any v. X ∈ R is the feature matrix con- F n m taining the features of all nodes, where is the dimen- L = − Ypf lnZpf , (4) sion of feature vectors. A is the adjacency matrix of G and p∈YP f=1 D is the degree matrix, where Dii = j Aij. The layerwise propagation rule is defined as: where YP is the indices of labeled Q-A pairs, Y is the label indicator matrix, and F is the dimension of the output, which Z(j+1) = ρ(AZ˜ (j)W (j)), (1) is equal to the number of classes.

9372

! !

Figure 3: Cross-pair learning with the Q-A pair graph and GCN. Colors denote different classes.

In order to increase flexibility and label efficiency, we ap- tion, and obtain the final attention vectors: c,q e,q ply Auto-Regressive (AR) filter to get an improved GCN exp(Eij + Eij ) A˜ A˜ Eq = , (IGCN) (Li et al. 2019), which replaces with : ij la c,q e,q k=1(exp Eik + Eik ) (1) (0) c,a e,a (9) Z = ρ(A˜ XW ), (5) exp(E + E ) Ea = ij ij . ij l c,a e,a ˜ −1 q (exp E + E ) where A = par(L)=(I + αL) is the AR filter and is k=1 kj kj approximated with polynomial expansion: Hc,q = Attention-weighted sums are computed as i la Eq Ac Hc,a = lq Ea Qc 1 +∝ α j=1 ij j and j i=1 ij i for textual represen- (I + αL)−1 = [ A˜]i, (α>0), (6) tations. Analogously, knowledge-based representations are 1+α 1+α e,q l q e,a l i=0 a e q a e Hi = j=1 EijAj and Hj = i=1 EijQi . L = D − A A˜X = 1 T (k) Then unit-level comparisons match each unit of one se- where is the graph Laplacian. 1+α quence with a weighted version of its counterpart: is computed iteratively with: c,q c c,q c c,q Ti = CMP(Qi ,Hi )=Qi ⊗ Hi , α c,a c,a c,a (10) T (0) = ,T(1) = X, ..., T (i+1) = X + AT˜ (i), T = CMP(Ac,H )=Ac ⊗ H , O 1+α (7) j j j j j where ⊗ is the element-wise multiplication. Analogously for k = 4α e,q e,a and is enough according to (Li et al. 2019). knowledge-based representations, we get Ti and Tj . IGCN can achieve label efficiency by using the exponent Further, the aggregation process is conducted based on k to conveniently adjust the filter strength. In this way, it can CNN as suggested in (Bian et al. 2017): maintain a shallow structure with a reasonable number of c,q c,q c,q c,q c,q R = AGG([T1 , ..., T ]) = CNN([T1 , ..., T ]), trainable parameters to avoid overfitting. lq lq Rc,a = ([T c,a, ..., T c,a]) = CNN([T c,a, ..., T c,a]). AGG 1 la 1 la Intra-Pair Learning (11) The intra-pair matching part is under the Compare- Similarly, we get final knowledge-based representations Re,q Re,a Aggregate framework, where vector representations of small and . Then outputs of the Aggregation Layer are Q A units (such as words) of sentences are compared to capture concatenated to predicate the probability that and form interactive features, and then aggregated to calculate the fi- a pair by a full connection layer with sigmoid. nal relevance score. In order to learn more comprehensive intra-pair information, we compute both textual attention Ec Experiments and knowledge-based attention Ee for the model: Experimental Settings Ec = score(Qc,Ac)=Qc · Ac, Datasets We evaluate our model on two widely adopted ij i j i j QA benchmark datasets: WikiQA (Yang, Yih, and Meek e e e e e (8) Eij = score(Qi ,Aj )=tanh(Qi UAj ), 2015) and TrecQA (Wang, Smith, and Mitamura 2007). WikiQA is an open domain factoid answer selection bench- where Qc and Ac are pre-trained word embedding matrixes mark. We adopt the standard setup (Yang, Yih, and Meek of the question and the answer respectively, and Qe and Ae 2015) of only considering questions with correct answers for are knowledge embedding matrixes based on entity linking evaluation. TrecQA has clean and raw versions. The clean results and knowledge graph (KG) entity embeddings. U is version removes questions that have only positive/negative a parameter matrix to be learned. answers or no answers. We evaluate on the clean version as Then with the softmax computation along the dimension noted by (Rao, He, and Lin 2016). The details of these two c,q e,q c,a e,a j, we get Eij and Eij . Similarly, Eij and Eij are com- datasets are shown in Table 1. Evaluation measures are mean puted along i. We merge textual and knowledge-based atten- average precision (MAP) and mean reciprocal rank (MRR).

9373 applies both intra-pair and cross-pair learning, but in a more Table 1: Summary statistics of datasets. traditional way. It computes cross-pair relations with scalar Dataset Type Question QA Pairs %Correct Nodes Edges Train 873 8672 12.0 products and ensembles different kernels-based SVM clas- WikiQA Dev 126 1130 12.4 12.2K 21.9M Test 243 2351 12.5 sifiers. For KCG, we implement it by using learned knowl- Train 1229/1160 53417/53313 12.0/11.8 TrecQA edge representations for aggregation (KCGeca), or only ap- Dev 82/65 1148/1117 19.3/18.4 12.5K 40.3M (original/cleaned) Test 100/68 1517/1442 18.7/17.2 plying knowledge to the Attention Layer and Comparison Layer (KCGec) to enhance Q-A pair representations. We observe that KCG demonstrates significant gains over Table 2: Hyperparameters. the baselines based on the intra-pair comparing frame- Hyperparameter Method Name Definition Intra-Pair Cross-Pair works, and outperforms the state-of-the-art systems on both λ Learning rate 0.001 0.01 WikiQA and TrecQA. The improvement on WikiQA is p Dropout rate 0.2 0.5 much more obvious than TrecQA. TrecQA includes editor- L2 L2 normalization 0 0.0005 m Batch size 4 1 generated questions and candidate answer sentences se- w Conv. size [1,2,3,4,5] 1 lected by word matching (Wang, Smith, and Mitamura h Hidden layer size 300 (64) 2007), while WikiQA is constructed in a natural and realis- τ Edge threshold - 0.95 r Neg. rate - 1:1 tic manner based on query logs and Wikipedia pages (Yang, Yih, and Meek 2015). Therefore, WikiQA is more lexically diverse and closer to real-world scenarios. Common Training Setup For intra-pair learning, we use Ablation Study In order to analyze the effectiveness of pre-trained GloVE embeddings (Pennington, Socher, and different factors, we also report the ablation tests in terms of Manning 2014) for text and TransE (Bordes et al. 2013) em- discarding cross-pair matching part (w/o IGCN) and knowl- beddings for entities with a subset of Freebase (Bollacker edge graph information (w/o KG) respectively. The bottom et al. 2008): FB5M (4,904,397 entities, 7,523 relations and of Table 3 shows ablation results on KCGec. 22,441,880 facts) as KG following (Shen et al. 2018). We First, leaving out cross-pair matching (w/o IGCN) im- adopt listwise learning and use KL-divergence as the loss pacts the performance, and the drop is more significant on function as suggested in (Bian et al. 2017). WikiQA (0.784 to 0.768 for MAP). It suggests that cross- For the cross-pair part, graph node features are pair matching helps to improve the performance on com- BERT (Devlin et al. 2019) embeddings of Q-A pairs, and plex cases where intra-pair models may be insufficient. In the threshold τ is tuned to balance between quantity and fact, (Yih et al. 2013) found that simple word matching out- quality of edges. A graph is built on a whole QA corpus performs many sophisticated approaches on TrecQA. There- (under-sampled on TrecQA) to capture cross-pair patterns fore, intra-pair matching alone performs well on the dataset. (summarized in Table 1) with labels of the validation and testing sets masked following (Yao, Mao, and Luo 2019). Second, note that KG is also a main contributor to the Since negative answers are less useful for learning explicit performance, which indicates the importance of background Q-A patterns, and may introduce noise during propagation information for QA tasks. Knowledge-aware models en- (if not randomly selected, typical wrong Q-A patterns may rich the representation learning of Q-A pairs with external knowledge. Aggregating learned knowledge representations also help), we also apply sample masks on training data to KCG tune the negative sampling rate r. We set the filter parame- with sentence representations in the end ( eca), how- ter α =10for WikiQA and α =1for TrecQA according ever, does not ensure further improvement. This may depend to label rates (Li et al. 2019). GCN applies the first-order on the entity distributions in specific datasets. convolutional filter to integrate graph and feature informa- Third, we have the hypothesis that KCG manages to tion. For stacked GCN, the hidden layer size is set to 64. meet both knowledge and pattern conditions, thus han- Adam (Kingma and Ba 2014) is adopted for training and dling the AS problem based on the nature of QA. Exper- the model with the lowest training loss in 400 steps is se- iments suggest that KCG substantially outperforms some lected (Li et al. 2019). Other hyperparameters are shown in well-designed models under PI comparing frameworks, and Table 2. models only focusing on one condition (Shen et al. 2018; Tymoshenko and Moschitti 2018). Results and Analysis Comparison between Different Graph-Based Methods Comparison with the State of the Art Experimental We also conduct experiments to compare effects of differ- results are summarized in Table 3 and nine baselines ent graph-based methods for cross-pair learning. Results are are adopted. Among which, CNN-Cnt (Yang, Yih, and presented in Table 4. The classic label propagation (Zhou et Meek 2015) and HyperQA (Tay, Tuan, and Hui 2018a) al. 2004) contributes little to the raw model. This method are under the Siamese framework, AP-CNN (Santos et al. only makes predictions based on the graph structure, which 2016), IWAN (Shen, Yang, and Deng 2017) and MCAN- is inadequate without representation learning of Q-A pairs. FM (Tay, Tuan, and Hui 2018b) are attentive networks. The model with GCN has achieved better results because of KABLSTM (Shen et al. 2018) is a knowledge-aware atten- the first-order convolutional filter which integrates graph and tive network. BiMPM (Wang, Hamza, and Florian 2017) and feature information. However, GCN usually needs stacked DCA (Bian et al. 2017) are built on the Compare-Aggregate layers to increase smoothness, and thus it is difficult to train architecture. SUM (Tymoshenko and Moschitti 2018) also with fewer labels due to high model complexity.

9374 Table 3: Results on WikiQA and TrecQA datasets. Framework Method WikiQA TrecQA MAP MRR MAP MRR CNN-Cnt (Yang, Yih, and Meek 2015) 0.652 0.665 0.695 0.763 Siamese HyperQA (Tay, Tuan, and Hui 2018a) 0.712 0.727 0.784 0.865 AP-CNN (Santos et al. 2016) 0.689 0.696 0.753 0.851 KABLSTM (Shen et al. 2018) 0.732 0.749 0.804 0.885 Attentive IWAN (Shen, Yang, and Deng 2017) 0.733 0.750 0.822 0.889 MCAN-FM (Tay, Tuan, and Hui 2018b) - - 0.838 0.904 BiMPM (Wang, Hamza, and Florian 2017) 0.718 0.731 0.802 0.875 Compare-Aggregate DCA (Bian et al. 2017) 0.754 0.764 0.821 0.899 SUM (Tymoshenko and Moschitti 2018) 0.762 0.776 0.777 0.869 Intra-Cross KCGeca 0.772 0.786 0.857 0.908 KCGec 0.784 0.802 0.841 0.902 w/o IGCN 0.768 0.782 0.832 0.900 w/o KG 0.763 0.778 0.828 0.889

Table 4: Results of replacing cross-pair learning part by different graph-based methods on WikiQA. MAP MRR Cross-Pair Part Layer - - 0.768 0.782 1 0.770 0.785 LP (Zhou et al. 2004) 2 0.771 0.785 1 0.774 0.787 GCN (Kipf and Welling 2017) 2 0.773 0.788 1 0.784 0.802 IGCN (Li et al. 2019) 2 0.774 0.787

Figure 4: Test MAP results on WikiQA by varying training Table 5: The facilitation of cross-pair learning to various data proportions on ABCNN and IGCN. matching or binary classification models on WikiQA. Intra-Pair Cross-Pair MAP MRR TextCNN (Kim 2014) - 0.528 0.542 TextCNN IGCN 0.664 0.683 performance boost to apply GCN-based cross-pair learning. Transformer (Vaswani et al. 2017) - 0.641 0.644 In particular, the typical attentive network ABCNN (Yin Transformer IGCN 0.668 0.681 et al. 2016) achieves competitive results with some strong Dilated CNN (Yu and Koltun 2016) - 0.658 0.659 baselines (Table 3). The attentive pooling mechanism in Dilated CNN IGCN 0.677 0.690 ABCNN (Yin et al. 2016), - 0.692 0.713 ABCNN considers local relations within pairs, but it does ABCNN IGCN 0.721 0.742 not cover global correlations as in cross-pair learning. ABCNN Transformer 0.693 0.714 Considering that deep neural networks themselves may implicitly capture cross-pair similarity during training, we replace the cross-pair part with a classification model based KCG with IGCN has achieved the best performance. on Transformer (Vaswani et al. 2017) (the last line). How- IGCN improves GCN with low-pass graph convolutional fil- ever, the change leads to a drop in the results, which further ters to generate smooth and representative features for sub- demonstrates the effectiveness of the immediate cross-pair sequent classification. Through flexibly adjusting the filter modeling approach. The results also suggest that model in- strength, it can significantly reduce trainable parameters and tegration is not the key factor of good performance here. effectively prevent overfitting. Further, in order to evaluate the label efficiency of the Note that two-layer models do not show better perfor- procedure, we test it alone with different proportions of mance. Two layers allow exchange of information among training data. Figure 4 compares IGCN cross-pair learn- nodes that are at maximum two steps away. However, noise ing with ABCNN on 6%, 12%, 18% and 24% of the Wik- can be introduced from some randomly selected negative iQA training set. Note that IGCN can achieve better MAP samples. Constraints may be put on the negative-sampling with limited training data, which is similar to the result in of answers to further improve the performance, which we (Kipf and Welling 2017), where GCN performs well with leave for future work. low label rate. The results again suggest that our graph Additional Analysis on GCN-Based Post-Procedure To preserves global Q-to-A pattern information, and GCN can further analyze the effectiveness and potential of the GCN- make better use of the corpus through propagating informa- based post-procedure, we reimplement several classic intra- tion among nodes. pair matching or binary classification models with the pro- The GCN-based post-procedure incorporates global infor- cedure on WikiQA, as shown in Table 5. Generally, it makes mation and brings progress over different basic models. In

9375 Table 6: Examples of answer selection results. ID Question KCG DCA Puce (often misspelled as “puse”, “peuse” or The colors in the boxes at right are two of the 1 What is the color puce? “peuce”) is deﬁned in the United States as a various shades and varieties of puce. brownish-purple color.

Stefka Kostadinova (Bulgaria) has held the The high jump is a track and ﬁeld athletics event Who set the world record 2 women’s world record since 1987, also the in which competitors must jump over a for women for high jump? longest-held record in the event. horizontal bar placed at measured heights...

How many numbers are on An ISO/IEC 7812 card number is typically 16 Bank card numbers are allocated in accordance 3 a credit card? digits in length. with isoiec 7812.

Calcium nitrate, also called Norgessalpeter Nitrocalcite is the name for a mineral which is a What is the formula for 4 (Norwegian saltpeter), is an inorganic hydrated calcium nitrate that forms as an calcium nitrate? compound with the formula Ca(NO3)2. efﬂorescence...

The original members were guitarist/vocalist The Climax Blues Band (originally known as Who are the members of the 5 Peter Haycock, guitarist Derek Holt; keyboardist the Climax Chicago Blues Band) were formed in climax blues band? Arthur Wood; bassist Richard Jones... Stafford, England in 1968. the full-batch setup, the adjacency matrix in the sparse form Acknowledgments has a linear relationship with non-zero elements, and the fea- This research is supported by National Natural Science ture matrix grows linearly with the dataset size. To expand Foundation of China (Grant No. 61773229 and 61972219), to large datasets, graph partitioning (Abbas et al. 2018) is Shenzhen Giiso Information Technology Co. Ltd., the Na- usually adopted. Simple trade-off methods like adjusting the tional Natural Science Foundation of Guangdong Province edge threshold and using fewer training samples are also ef- (Grant No. 2018A030313422), and Overseas Cooperation fective due to the label efﬁciency of the method. Research Fund of Graduate School at Shenzhen, Tsinghua Case Study Considering the variety of Q-A pairs, some University (Grant No. HW2018002). top ranked answers are demonstrated in Table 6 for further analysis. We reimplement DCA (Bian et al. 2017), a baseline References model with the Compare-Aggregate architecture, to support Abbas, Z.; Kalavri, V.; Carbone, P.; and Vlassov, V. 2018. Stream- comparisons in the case study. We use bold fonts to denote ing graph partitioning: an experimental study. Proceedings of the intra-pair matches and underlines to mark cross-pair ones. VLDB Endowment 11(11):1590–1603. We can see that DCA does not handle some cases well Bastings, J.; Titov, I.; Aziz, W.; Marcheggiani, D.; and Simaan, because of its sensitiveness to intra-pair key words, while K. 2017. Graph convolutional encoders for syntax-aware neural KCG is generally more robust against the diversity of Q- machine translation. In Proceedings of the 2017 Conference on A pairs. KCG considers both pattern and knowledge condi- Empirical Methods in Natural Language Processing, 1957–1967. tions. With the incorporation of global Q-to-A pattern infor- Bian, W.; Li, S.; Yang, Z.; Chen, G.; and Lin, Z. 2017. A compare- mation, KCG works more effectively on different question aggregate model with dynamic-clip attention for answer selection. structures. Besides, knowledge information helps KCG to In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 1987–1990. ACM. capture various relations within Q-A pairs, and prevents it from over-learning of some uncommon Q-to-A patterns. Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; and Taylor, J. 2008. Freebase: a collaboratively created graph database for struc- turing human knowledge. In Proceedings of the 2008 ACM SIG- Conclusion and Future Work MOD international conference on Management of data, 1247– 1250. AcM. Answer selection is a basic semantic matching problem in QA. In this paper, we propose a novel system named KCG Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; and for AS, which considers both knowledge and pattern con- Yakhnenko, O. 2013. Translating embeddings for modeling multi- relational data. In Advances in neural information processing sys- ditions. KCG applies both intra-pair and cross-pair learning. tems, 2787–2795. For intra-pair matching, it makes comparison on both textual and knowledge information within the Q-A pair. For cross- Chen, D.; Fisch, A.; Weston, J.; and Bordes, A. 2017. Reading wikipedia to answer open-domain questions. In Proceedings of the pair matching, it explores global information with the QA- 55th Annual Meeting of the Association for Computational Lin- pair graph and GCN, thus incorporating more general Q-to- guistics (Volume 1: Long Papers), 1870–1879. A patterns and making better use of the corpus. Compared Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Bert: with multiple baselines, QURRD achieves state-of-the-art Pre-training of deep bidirectional transformers for language un- results on both WikiQA and TrecQA. For future work, we derstanding. In Proceedings of the 2019 Conference of the North will construct heterogeneous graphs with both sentences and American Chapter of the Association for Computational Linguis- entities to better model cross-pair dependencies, and extend tics: Human Language Technologies, Volume 1 (Long and Short KCG to more general semantic matching scenarios. Papers), 4171–4186.

9376 Feng, M.; Xiang, B.; Glass, M. R.; Wang, L.; and Zhou, B. 2015. Tay, Y.; Tuan, L. A.; and Hui, S. C. 2018b. Multi-cast attention Applying deep learning to answer selection: A study and an open networks. In Proceedings of the 24th ACM SIGKDD International task. In 2015 IEEE Workshop on Automatic Speech Recognition Conference on Knowledge Discovery & Data Mining, 2299–2308. and Understanding (ASRU), 813–820. IEEE. ACM. Hu, B.; Lu, Z.; Li, H.; and Chen, Q. 2014. Convolutional neural Tymoshenko, K., and Moschitti, A. 2018. Cross-pair text represen- network architectures for matching natural language sentences. In tations for answer sentence selection. In Proceedings of the 2018 Advances in neural information processing systems, 2042–2050. Conference on Empirical Methods in Natural Language Process- Kim, Y. 2014. Convolutional neural networks for sentence clas- ing, 2162–2173. sification. In Proceedings of the 2014 Conference on Empirical Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Methods in Natural Language Processing (EMNLP), 1746–1751. Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is Kingma, D., and Ba, J. 2014. Adam: A method for stochastic opti- all you need. In Proceedings of the 31st International Conference mization. International Conference on Learning Representations. on Neural Information Processing Systems, 6000–6010. Curran Associates Inc. Kipf, T. N., and Welling, M. 2017. Semi-supervised classification with graph convolutional networks. In 5th International Con- Wang, S., and Jiang, J. 2017. A compare-aggregate model for ference on Learning Representations, ICLR 2017, Toulon, France, matching text sequences. In 5th International Conference on April 24-26, 2017, Conference Track Proceedings. Learning Representations, ICLR 2017, Toulon, France, April 24- 26, 2017, Conference Track Proceedings. Lai, Y.; Feng, Y.; Yu, X.; Wang, Z.; Xu, K.; and Zhao, D. 2019. Lat- Wang, Z.; Hamza, W.; and Florian, R. 2017. Bilateral multi- tice cnns for matching based chinese question answering. In Pro- perspective matching for natural language sentences. In Proceed- ceedings of the AAAI Conference on Artificial Intelligence, 6634– ings of the 26th International Joint Conference on Artificial Intel- 6641. ligence, 4144–4150. AAAI Press. Li, Q.; Wu, X.-M.; Liu, H.; Zhang, X.; and Guan, Z. 2019. La- Wang, M.; Smith, N. A.; and Mitamura, T. 2007. What is the jeop- bel efficient semi-supervised learning via graph filtering. In Pro- ardy model? a quasi-synchronous grammar for qa. In Proceed- ceedings of the IEEE Conference on Computer Vision and Pattern ings of the 2007 Joint Conference on Empirical Methods in Nat- Recognition, 9582–9591. ural Language Processing and Computational Natural Language Li, Y.; Jin, R.; and Luo, Y. 2018. Classifying relations in clinical Learning (EMNLP-CoNLL), 22–32. narratives using segment graph convolutional and recurrent neural Yang, Y.; Yih, W.-t.; and Meek, C. 2015. Wikiqa: A challenge networks (seg-gcrns). Journal of the American Medical Informat- dataset for open-domain question answering. In Proceedings of ics Association 26(3):262–268. the 2015 Conference on Empirical Methods in Natural Language Li, H.; Xu, J.; et al. 2014. Semantic matching in search. Founda- Processing, 2013–2018. tions and Trends R in Information Retrieval 7(5):343–469. Yao, X., and Van Durme, B. 2014. Information extraction over Marcheggiani, D., and Titov, I. 2017. Encoding sentences with structured data: Question answering with freebase. In Proceedings graph convolutional networks for semantic role labeling. In Pro- of the 52nd Annual Meeting of the Association for Computational ceedings of the 2017 Conference on Empirical Methods in Natural Linguistics (Volume 1: Long Papers), 956–966. Language Processing, 1506–1515. Yao, L.; Mao, C.; and Luo, Y. 2019. Graph convolutional networks Pang, L.; Lan, Y.; Guo, J.; Xu, J.; Wan, S.; and Cheng, X. 2016. for text classification. In Proceedings of the AAAI Conference on Text matching as image recognition. In Thirtieth AAAI Conference Artificial Intelligence, volume 33, 7370–7377. on Artificial Intelligence. Yih, W.-t.; Chang, M.-W.; Meek, C.; and Pastusiak, A. 2013. Ques- Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global tion answering using enhanced lexical semantic models. In Pro- vectors for word representation. In Proceedings of the 2014 ceedings of the 51st Annual Meeting of the Association for Com- conference on empirical methods in natural language processing putational Linguistics (Volume 1: Long Papers), 1744–1753. (EMNLP), 1532–1543. Yin, W.; Schutze,¨ H.; Xiang, B.; and Zhou, B. 2016. Abcnn: Rao, J.; He, H.; and Lin, J. 2016. Noise-contrastive estimation Attention-based convolutional neural network for modeling sen- for answer selection with deep neural networks. In Proceedings tence pairs. Transactions of the Association for Computational of the 25th ACM International on Conference on Information and Linguistics 4:259–272. Knowledge Management, 1913–1916. ACM. Yu, F., and Koltun, V. 2016. Multi-scale context aggregation by Santos, C. d.; Tan, M.; Xiang, B.; and Zhou, B. 2016. Attentive dilated convolutions. In 4th International Conference on Learn- pooling networks. arXiv preprint arXiv:1602.03609. ing Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, Shen, Y.; Deng, Y.; Yang, M.; Li, Y.; Du, N.; Fan, W.; and Lei, 2016, Conference Track Proceedings. K. 2018. Knowledge-aware attentive neural network for rank- Zanzotto, F. M., and Moschitti, A. 2006. Automatic learning of tex- ing question answer pairs. In The 41st International ACM SIGIR tual entailments with cross-pair similarities. In Proceedings of the Conference on Research & Development in Information Retrieval, 21st International Conference on Computational Linguistics and 901–904. ACM. 44th Annual Meeting of the Association for Computational Lin- Shen, G.; Yang, Y.; and Deng, Z.-H. 2017. Inter-weighted align- guistics, 401–408. ment network for sentence pair modeling. In Proceedings of the Zhou, D.; Bousquet, O.; Lal, T. N.; Weston, J.; and Scholkopf,¨ B. 2017 Conference on Empirical Methods in Natural Language Pro- 2004. Learning with local and global consistency. In Advances in cessing, 1179–1189. neural information processing systems, 321–328. Tay, Y.; Tuan, L. A.; and Hui, S. C. 2018a. Hyperbolic representation learning for fast and efficient neural question answering. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 583–591. ACM.

9377