Joint Training of Candidate Extraction and Answer Selection for Reading Comprehension

Zhen Wang Jiachen Liu Xinyan Xiao Yajuan Lyu Tian Wu Baidu Inc., Beijing, China {wangzhen24, liujiachen, xiaoxinyan, lvyajuan, wutian}@baidu.com

Abstract Q : Rum, lime, and cola drink make a . While sophisticated neural-based tech- A Cuba Libre P1 , the custom of mixing lime with rum for a niques have been developed in reading cooling drink on a hot Cuban day, has been around comprehension, most approaches model a long time.

the answer in an independent manner, ig- P2 recipe for a Daiquiri, a classic rum and noring its relations with other answer can- lime drink that every bartender should know. didates. This problem can be even worse P3 Hemingway Special Daiquiri: are a fam- ily of cocktails whose main ingredients are rum and in open-domain scenarios, where candi- lime juice.

dates from multiple passages should be P4 A homemade Cuba Libre Preparation To make a combined to answer a single question. In Cuba Libre properly, fill a highball glass with ice and half fill with cola. this paper, we formulate reading com- P5 The difference between the Cuba Libre and Rum is prehension as an extract-then-select two- a lime wedge at the end. stage procedure. We first extract answer candidates from passages, then select the Table 1: The answer candidates are in a bold font. final answer by combining information The key information is marked in italic, which from all the candidates. Furthermore, we should be combined from different text pieces to regard candidate extraction as a latent vari- select the correct answer ”Cuba Libre”. able and train the two-stage process jointly with reinforcement learning. As a result, our approach has improved the state-of- Most existing approaches mainly focus on mod- the-art performance significantly on two eling the interactions between questions and pas- challenging open-domain reading compre- sages (Dhingra et al., 2017a; Seo et al., 2017; hension datasets. Further analysis demon- Wang et al., 2017), paying less attention to infor- strates the effectiveness of our model com- mation concerning answer candidates. However, ponents, especially the information fusion when human solve this problem, we often first of all the candidates and the joint training read each piece of text, collect some answer candi- of the extract-then-select procedure. dates, then focus on these candidates and combine

arXiv:1805.06145v1 [cs.CL] 16 May 2018 their information to select the final answer. This 1 Introduction collect-then-select process can be more significant Teaching machines to read and comprehend hu- in open-domain scenarios, which require the com- man languages is a long-standing objective in nat- bination of candidates from multiple passages to ural language processing. In order to evaluate this answer one single question. This phenomenon is ability, reading comprehension (RC) is designed illustrated by the example in Table1. to answer questions through reading relevant pas- With this motivation, we formulate an extract- sages. In recent years, RC has attracted intense in- then-select two-stage architecture to simulate the terest. Various advanced neural models have been above procedure. The architecture contains two proposed along with newly released datasets (Her- components: (1) an extraction model, which gen- mann et al., 2015; Rajpurkar et al., 2016; Dunn erates answer candidates, (2) a selection model, et al., 2017; Dhingra et al., 2017b; He et al., 2017). which combines all these candidates and finds out the final answer. However, answer candidates to and a reader for producing the answer from it. be focused on are often unobservable, as most RC However, this approach only depended on one pas- datasets only provide golden answers. Therefore, sage when producing the answer, hence put great we treat candidate extraction as a latent variable demands on the precisions of both components. and train these two stages jointly with reinforce- Worse still, this framework cannot handle the sit- ment learning (RL). uation where multiple passages are needed to an- In conclusion, our work makes the following swer correctly. In consideration of evidence aggre- contributions: gation, Wang et al.(2018b) proposed a re-ranking 1. We formulate open-domain reading compre- method to resolve the above issue. However, their hension as a two-stage procedure, which first ex- re-ranking stage was totally isolated from the can- tracts answer candidates and then selects the final didate extraction procedure. Being different from answer. With joint training, we optimize these two the re-ranking perspective, we propose a novel se- correlated stages as a whole. lection model to combine the information from 2. We propose a novel answer selection model, all the extracted candidates. Moreover, with rein- which combines the information from all the ex- forcement learning, our candidate extraction and tracted candidates using an attention-based corre- answer selection models can be learned in a joint lation matrix. As shown in experiments, the infor- manner. Trischler et al.(2016) also proposed a mation fusion is greatly helpful for answer selec- two-step extractor-reasoner model, which first ex- tion. tracted K most probable single-token answer can- 3. With the two-stage framework and the joint didates and then compared the hypotheses with training strategy, our method significantly sur- all the sentences in the passage. However, in passes the state-of-the-art performance on two their work, each candidate was considered isolat- challenging public RC datasets Quasar-T (Dhingra edly, and their objective only took into account the et al., 2017b) and SearchQA (Dunn et al., 2017). ground truths compared with our RL treatment. The training strategy employed in our paper 2 Related Work is reinforcement learning, which is inspired by recent work exploiting it into question answer- In recent years, reading comprehension has made ing problem. The above mentioned coarse-to-fine remarkable progress in methodology and dataset framework (Choi et al., 2017; Wang et al., 2018a) construction. Most existing approaches mainly treated sentence selection as a latent variable and focus on modeling sophisticated interactions be- jointly trained the sentence selection module with tween questions and passages, then use the pointer the answer generation module via RL. Shen et al. networks (Vinyals et al., 2015) to directly model (2017) modeled the multi-hop reasoning proce- the answers (Dhingra et al., 2017a; Wang and dure with a termination state to decide when it is Jiang, 2017; Seo et al., 2017; Wang et al., 2017). adequate to produce an answer. RL is suitable to These methods prove to be effective in existing capture this stochastic behavior. Hu et al.(2018) close-domain datasets (Hermann et al., 2015; Hill merely modeled the extraction process, using F1 et al., 2015; Rajpurkar et al., 2016). as rewards in addition to maximum likelihood es- More recently, open-domain RC has attracted timation. RL was utilized in their training process, increasing attention (Nguyen et al., 2016; Dunn as the F1 measure is not differentiable. et al., 2017; Dhingra et al., 2017b; He et al., 2017) and raised new challenges for question answer- 3 Two-stage RC Framework ing techniques. In these scenarios, a question is paired with multiple passages, which are often In this work, we mainly consider the open-domain collected by exploiting unstructured documents or extractive reading comprehension. In this sce- web data. Aforementioned approaches often rely nario, a given question Q is paired with mul- on recurrent neural networks and sophisticated at- tiple passages P = {P1,P2, ..., PN }, based on tentions, which are prohibitively time-consuming which we aim to find out the answer A. Moreover, if passages are concatenated altogether. There- the golden answers are almost subspans shown in fore, some work tried to alleviate this problem in some passages in P. Our main framework con- a coarse-to-fine schema. Wang et al.(2018a) com- sists of two parts, which are: (1) extracting answer bined a ranker for selecting the relevant passage candidates C = {C1,C2, ..., CM } from passages b e P1 P P Candidate Start End Scoring

GP P2 . C Q A . Question & Passage ~ . Interaction HP …

Attention PN

H Candidate Extraction Answer Selection Question & Passage Q HP Representation … Q P … 1 1 l Figure 1: Two-stage RC Framework. The first part x … lQ x 2 x P Q xQ P xP … P extracts candidates (denoted with circles) from all Question Passage the passages. The second part establishes interac- tions among all these candidates to select the final Figure 2: Candidate Extraction Model Architec- answer. The different gray scales of dashed lines ture. between candidates represent different intensities of interactions. Question & Passage Representation Firstly, k lQ we embed the question Q = {xQ}k=1 and its rele- t lP P and (2) selecting the final answer A from candi- vant passage P = {xP }t=1 ∈ P with word vectors d ×l d ×l dates C. This process is illustrated in Figure1. We to form Q ∈ R w Q and P ∈ R w P respec- design different models for each part and optimize tively, where dw is the dimension of word embed- them as a whole with joint reinforcement learning. dings, lQ and lP are the length of Q and P . We then feed Q and P to a bidirectional LSTM 3.1 Candidate Extraction to form their contextual representations HQ ∈ dh×lQ and H ∈ dh×lP : We build candidate set C by independently ex- R P R tracting K candidates from each passage Pi ac- HQ = BiLSTM(Q) cording to the following distribution: (3) HP = BiLSTM(P) N Y Question & Passage Interaction Modeling the p( |Q, ) = p({C }K |Q, P ) C P ij j=1 i interactions between questions and passages is a i (1) critical step in reading comprehension. Here, we N [ K adopt the attention mechanism similar to (Lee C = {Cij}j=1 et al., 2016) to generate question-dependent pas- i=1 k lQ sage representation He P . Assume HQ = {hQ}k=1, where C denotes the jth candidate extracted t lP ij HP = {hP }t=1 , we have: from the ith passage. K is set as a constant num- hk ·ht ber in our formulation. Taking K as 2 for an ex- e Q P αtk = 1 ≤ k ≤ lQ, 1 ≤ t ≤ lP ample, we denote each probability shown on the lQ hk ·ht P e Q P right side of Equation 1 through sampling without k=1 lQ replacement: t X k heP = αtkhQ 1 ≤ t ≤ lP k=1 p({Ci1,Ci2}) = p(Ci1)p(Ci2)/(1 − p(Ci1)) t lP He P ={he } + p(Ci1)p(Ci2)/(1 − p(Ci2)) P t=1 (2) (4) After concatenating two kinds of passage rep- where we neglect Q, Pi to abbreviate the condi- tional distributions in Equation 1. resentations HP and He P , we use another bidirec- tional LSTM to get the final representation of ev- Consequently, the basic block of our candidate ery position in passage P as G ∈ dg×lP : extraction stage turns out to be the distribution of P R each candidate P (C |Q, P ). In the rest of this ij i GP = BiLSTM([HP ; He P ]) (5) subsection, we will elaborate on the model archi- tecture concerning candidate extraction, which is Candidate Scoring Then we use two linear 1×dg 1×dg displayed in Figure2. transformations wb ∈ R and we ∈ R to ~ s calculate the begin and the end scores for each po- Answer r C Scoring r sition: C1 lQ t r C 2 MaxPooling z {bP } = bP = wbGP C t=1 . Candidate . (6) . lQ Representation t r C F {eP } = eP = weGP r P t=1 CM

S At last, we model the probability of every sub- c S Candidates P t Ce Passage span in passage P as a candidate C = {xP }t=C … b Representation rQ RP x1 2 xlP according to its begin and end position: MaxPooling P xP … P

Passage Cb Ce S exp(b + e ) Q P P Question p(C|Q, P ) = (7) Representation PlP PlP k t … Q exp(b + e ) 1 k=1 t=k P P x … lQ Q xQ In this definition, the probabilities of all the valid Question answer candidates are already normalized. Figure 3: Answer Selection Model Architecture. 3.2 Answer Selection As the second part of our framework, the answer • Word embedding: each word expresses its selection model finds out the most probable an- basic feature with the word vector. swer by calculating p(C|Q, P, C) for each candi- date C ∈ C. The model architecture is illustrated • Common word: the feature has value 1 when in Figure3. the word occurs in the question, otherwise 0. Notably, selection model receives candidate set • Question independent representation: the C as additional information. This more focused condensed representation r . information allows the model to combine evi- q dences from all the candidates, which would be With these features, information not only in Q but useful for selecting the best answer. also in P is considered. By concatenating them, For ease of understanding, we briefly describe t we get rP corresponding to every position t in pas- the selection stage as follows. After being ex- sage P . Then with another bidirectional LSTM, tracted from a single passage, a candidate borrows we fuse these features to form the contextual rep- information from other candidates across different ds×lP resentation of P as SP ∈ R : passages. With this global information, the pas- t lP sage is reread to confirm the correctness of the RP = {r } P t=1 (10) candidate further. The following are details about SP = BiLSTM(RP ) the selection model. Question Representation Questions are funda- Candidate Representation Candidates provide mental for finding out the correct answer. As did more focused information for answer selection. for the extraction model, we embed the question Therefore, for each candidate, we first build its in- d ×l Q with word vectors to form Q ∈ R w Q . Then dependent representation according to its position we use a bidirectional LSTM to establish its con- in the passage, then construct candidates fused textual representation: representation through combination of other cor- related candidates. Sq = BiLSTM(Q) (8) Given the candidate C = {xt }Ce in the pas- P t=Cb A max-pooling operation across all the positions sage P , we extract its corresponding span from t lP t Ce SP = {s } to form SC = {s } as its is followed to get the condensed vector represen- P t=1 P t=Cb tation: contextual encoding. Moreover, we calculate its condensed vector representation through its begin rq = MaxPooling(Sq) (9) and end positions:

Passage Representation Assume the candidate Cb Ce rC = tanh(WbsP + WesP ) (11) C is extracted from the passage P ∈ P. To be dc×ds dc×ds informed of C, we first build the representation of where Wb ∈ R , We ∈ R . P . For every word in P , three kinds of features To model the interactions among all the answer are utilized: candidates, we calculate the correlations of the candidate C, which is assumed to be indexed by • Candidates fused representation: use erC to M j in C, with others {Cm}m=1,m6=j via attention consider all the other candidates interacting mechanism: with the concerned candidate C.

Vjm = wvtanh(WcrC + WorCm ) (12) With these features, we capture the information from the question, the passages and all the candi- dc×dc dc×dc where Wc ∈ R , Wo ∈ R and wv ∈ dates. By concatenating them, we get ut in every 1×d P R c are linear transformations to capture the in- position in the passage P . Combining these fea- tensity of each interaction. tures with a bidirectional LSTM, we get: In this way, we form a correlation matrix V ∈ M×M t lP R , where M is the total number of candi- UP = {uP }t=1 dates. With the correlation matrix, for the can- (15) FP = BiLSTM(UP ) didate C, we normalize its interactions via a softmax operation, which emphasizes the influ- Answer Scoring At last, the max pooling of ence of stronger interactions: each dimension of FP is performed, resulting in a condensed vector representation, which contains eVjm αm = (13) all the concerned information in a candidate: PM Vjm m=1,m6=j e zC = MaxPooling(FP ) (16) To take into account different influences of all the other candidates, it is sensible to generate a The final score of this candidate as the answer candidates fused representation according to the is calculated via a linear transformation, which is above normalized interactions: then normalized across all the candidates: M X s = wzzC rC = αmrC (14) e m es (17) m=1,m6=j p(C|Q, P, C) = M P esk In this formulation, all the other candidates con- k=1 tribute their influences to the fused representation 3.3 Joint Training with RL by their interactions with C, thus information from In our formulation, the answer candidate set in- different passages is gathered altogether. In our fluences the result of answer selection to a large experiments, this kind of information fusion is the extent. However, with only golden answers pro- key point for performance improvements. vided in the training data, it is not apparent which Passage Advanced Representation As more candidates should be considered further. focused information of the candidate C is avail- To alleviate the above problem, we treat candi- able, we are provided with a better way to confirm date extraction as a latent variable, jointly train the its correctness by rereading its corresponding pas- extraction model and the selection model with re- sage P . Specifically, we equip each position t in inforcement learning. Formally, in the extraction P with following advanced features: and selection stages, two kinds of actions are mod- eled. The action space for the extraction model • Passage contextual representation: the for- is to select from different candidate sets, which is t mer passage representation sP . formulated by Equation 1. The action space for the selection model is to select from all extracted • Candidate-dependent passage representation: candidates, which is formulated by Equation 17. replace H with S and H with S in Q C P P Our goal is to select the final answer that leads to Equation 4 to model the interactions between t a high reward. Inspired by Wang et al.(2018a), candidates and passages to form s . eP we define the reward of a candidate to reflect its • Candidate related distance feature: the rela- accordance with the golden answer: tive distance to the candidate C can be a ref-  erence of the importance of each position.  2 if C == A r(C,A) = f1(C,A) else if C ∩ A 6= ∅ • Candidate independent representation: use  −1 else rC to consider the concerned candidate C. (18) #q(train) #q(dev) #q(test) #p The detailed statistics of these two datasets is Quasar-T 28,496 3,000 3,000 100 shown in Table2. SearchQA 99,811 13,893 27,247 50 4.2 Model Settings Table 2: The statistics of our experimental We initialize word embeddings with the 300- datasets. #q represents the number of questions dimensional Glove vectors1. All the bidirectional for each split of the datasets. #p is the number of LSTMs hold 1 layer and 100 hidden units. All passages for each question. the linear transformations take the size of 100 as output dimension. The common word feature and where f1(., .) ∈ [0, 1] is the function to measure the candidate related distance feature are embed- word-level F1 score between two sequences. In- ded with vectors of dimension 4 and 50 respec- corporating this reward can alleviate the overstrict tively. By default, we set K as 2 in Equation 1, requirements set by traditional maximum likeli- which means each passage generates two candi- hood estimation as well as keep consistent with dates based on the extraction model. our evaluation methods in experiments. For ease of training, we first initialize our mod- The learning objective becomes to maximize els by maximum likelihood estimation and fine- the expected reward modeled by our framework, tune them with RL. The similar training strategy is where θ stands for all the parameters involved: commonly employed when RL process is involved (Ranzato et al., 2015; Li et al., 2016a; Hu et al., L(θ) = −EC∼P (C|Q,P)[EC∼P (C|Q,P,C)r(C,A)] 2018). To pre-train the extraction model, we only X use passages containing ground truths as training = −EC∼P (C|Q,P)[ P (C|Q, P, C)r(C,A)] C data. The log likelihood of Equation 7 is taken as (19) the training objective for each question and pas- Following REINFORCE algorithm, we approx- sage pair. After pre-training the extraction model, imate the gradient of the above objective with a we use it to generate two top-scoring candidates sampled candidate set, C ∼ P (C|Q, P), resulting from each passage, forming the training data to in the following form: pre-train our selection model, and maximize the log likelihood of the Equation 17 as our second X ∇L(θ) ≈ − ∇P (C|Q, P, C)r(C,A) objective. In pre-training, we use the batch size C of 30 for the extraction model, 20 for the selection X −∇logP (C|Q, P)[ P (C|Q, P, C)r(C,A)] model and RMSProp (Tieleman and Hinton, 2012) C with an initial learning rate of 2e-3. In fine-tuning (20) with RL, we use the batch size of 5 and RMSProp with an initial learning rate of 1e-4. Also, we use 4 Experiments a dropout rate of 0.1 in each training procedure. 4.1 Datasets 4.3 Experimental Results We evaluate our models on two publicly available In addition to results of previous work, we add open-domain RC datasets, which are commonly two baselines to demonstrate the effectiveness of adopted in related work. our framework. The first baseline only applies Quasar-T (Dhingra et al., 2017b) consists the extraction model to score the answers, which of 43,000 open-domain trivia questions and is aimed at explaining the importance of the se- corresponding answers obtained from various lection model. The second one only uses the internet sources. Each question is paired pre-trained extraction model and selection model with 100 sentence-level passages retrieved from to illustrate the benefits from our joint training ClueWeb09 (Callan et al., 2009) based on Lucene. schema. SearchQA (Dunn et al., 2017) starts from exist- The often used evaluation metrics for extractive ing question-answer pairs, which are crawled from RC are exact match (EM) and F1 (Rajpurkar et al., J!Archive, and is augmented with text snippets re- 2016). The experimental results on Quasar-T and trieved by Google, resulting in more than 140,000 SearchQA are shown in Table3. question-answer pairs with each pair having 49.6 snippets on average. 1http://nlp.stanford.edu/data/wordvecs/glove.840B.300d.zip Quasar-T SearchQA EM F1 EM F1 GA (Dhingra et al., 2017a) 26.4 26.4 - - BIDAF (Seo et al., 2017) 25.9 28.5 28.6 34.6 AQA (Buck et al., 2018) - - 38.7 45.6 R3 (Wang et al., 2018a) 35.3 41.7 49.0 55.3 Re-Ranker (Wang et al., 2018b) Strength-Based Re-Ranker (Probability) 36.1 42.4 50.4 56.5 Strength-Based Re-Ranker (Counting) 37.1 46.7 54.2 61.6 Coverage-Based Re-Raner 40.6 49.1 53.6 60.6 Full Re-Ranker 42.3 49.6 57.0 63.2 Our Methods Extraction Model 35.4 41.6 44.7 51.2 Extraction + Selection (Isolated Training) 41.6 49.5 49.7 56.6 Extraction + Selection (Joint Training) 45.9 53.9 58.3 64.2

Table 3: Experimental results on the test set of Quasar-T and SearchQA. Full re-ranker is the ensemble of three different re-rankers in (Wang et al., 2018b).

As seen from the results on Quasar-T, our quite Quasar-T EM F1 simple extraction model alone almost reaches the Extraction + Selection (Joint Training) 45.9 53.9 state-of-the-art result compared with other meth- -question representation 42.5 50.5 ods without re-rankers. The combination of the -question and passage common words 41.0 48.7 -candidate independent representation 44.5 53.3 extraction and selection models exceeds our ex- -candidate related distance feature 44.7 53.0 traction baseline by a great margin, and also re- -candidate dependent passage representation 44.4 52.3 sults in performance surpassing the best single re- -candidates fused representation 39.2 45.8 ranker in (Wang et al., 2018b). This result illus- trates the necessity of introducing the selection Table 4: Ablation results concerning the selec- model, which incorporates information from all tion model on the test set of Quasar-T. Obviously, the candidates. In the end, by joint training with candidates fused representation is the most evident RL, our method produces better performance even feature when modeling the answer selection pro- compared with the ensemble of three different re- cedure. rankers. On SearchQA, we find that our extraction model alone performs not that well compared with the lation analysis on the Quasar-T to prove the effec- state-of-the-art model without re-rankers. How- tiveness of its major components. As shown in Ta- ever, the improvement brought by our selection ble4, all these components modeling the selection model isolatedly or jointly trained still demon- procedure play important roles in our final archi- strates the importance of our two-stage frame- tecture. work. Not surprisingly, comparing the results, our Specifically, introducing the independent repre- isolated training strategy still lags behind the sin- sentation of the question and its common words gle re-ranker proposed in (Wang et al., 2018b), with the passage seems an efficient way to con- partly because of the deficiency with our extrac- sider the information of questions, which is con- tion model. However, uniting our extraction and sistent with previous work (Li et al., 2016b; Chen selection models with RL makes up the dispar- et al., 2017). ity, and the performance surpasses the ensemble As for features related to candidates, the incor- of three different re-rankers, let alone the result of poration of the candidate independent information any single re-ranker. contributes to the final result more or less. These features include candidate-dependent passage rep- 4.4 Further Analysis resentation, candidate independent representation Effect of Features in Selection Model As the and candidate related distance feature. incorporation of the selection model improves the Most importantly, the candidates fused repre- overall performance significantly, we conduct ab- sentation, which combines the information from Q Cocktails : Rum , lime , and cola drink make a . 1.0 A Cuba Libre Nica Libre P In Nicaragua , when it is mixed using Flor de Ca a -LRB- the 1 Daiquiri national brand of rum -RRB- and cola , it is called a Nica Libre . 0.8 P2 The drink ... Daiquiri The custom of mixing lime with rum for a cooling drink on a hot Cuban day has been around a long time . Daiquiri P3 If you only learn to make two cocktails , the Manhattan should be one of them . 0.6 Daiquiri P4 Daiquiri Cocktail recipe for a Daiquiri , a classic rum and lime drink that every bartender should know . Baco P5 Hemingway Special Daiquiri : Daiquiris are a family of cocktails 0.4 whose main ingredients are rum and lime juice . Cuba Libre P 6 In the Netherlands the drink is commonly called Baco , from the Bacardi two ingredients of Bacardi rum and cola . 0.2 P7 A homemade Cuba Libre Preparation To make a Cuba Libre prop- erly , fill a highball glass with ice and half fill with cola . Cuba Libre P8 Bacardi Cocktail Cocktail recipe for a Bacardi Cocktail , a clas- sic cocktail of Bacardi rum , lemon or lime juice and grenadine 0.0 Roy Rogers -LRB- non-alcoholic -RRB- Cocktail recipe for a Roy Rogers , Baco Bacardi Daiquiri Daiquiri Daiquiri Margarita P Margarita Cocktail recipe for a Margarita , a popular refreshing Nica Libre Manhattan 9 Cuba Libre Cuba Libre tequila and lime drink for summer .

P10 The difference between the Cuba Libre and Rum is a lime wedge at the end . Figure 4: The attention map generated when mod- eling candidates fused representations for the ex- Table 5: An example from Quasar-T to illustrate ample in Table5. the necessity of fused information. Candidates ex- tracted from passages are in a bold font. To cor- Quasar-T EM F1 rectly answer the question, information in P7 and K=1 43.9 52.4 P10 should be combined. K=2 45.9 53.9 K=3 45.8 53.9 all the candidates, demonstrates its indispensable Table 6: Different number of extracted candidates role in candidate modeling, with a performance results in different final performance on the test set drop of nearly 8% when discarded. This phe- of Quasar-T. nomenon also verifies the necessity of our extract- then-select procedure, showing the importance of combining information scattered in different text what information should be focused on further. pieces when picking out the final answer. Therefore, we also test the influence of different K when extracting candidates from each passage. Example for Candidates Fused Representation The results are shown in Table6. Taking K = 1 We conduct a case study to demonstrate the im- degrades the performance, which conforms to the portance of candidates fused information further. expectation, as the correct candidates become less In Table5, each candidate only partly matches the in this stricter situation. However, taking K = 3 description of the question in its independent con- can not improve the performance further. Al- text. To correctly answer the question, informa- though a larger K means a higher possibility to tion in P7 and P10 should be combined. In exper- include good answers, it raises more challenges iments, our selection model provides the correct for the selection model to pick out the correct one answer, while the wrong candidate ”Daiquiri”, a from candidates with more varieties. different kind of cocktail, is selected if candidates fused representation is discarded. The attention 5 Conclusion map established when modeling the fusion of can- didates (corresponding to Equation 13) in this ex- In this paper, we formulate the problem of RC as ample is illustrated in Figure4, in which we can a two-stage process, which first generates candi- see the interactions among all the candidates from dates with an extraction model, then selects the different passages. In this figure, it is obvious that final answer by combining the information from the interaction of ”Cuba Libre” in P7 and P10 is all the candidates. Furthermore, we treat can- the key point to answer the question correctly. didate extraction as a latent variable and jointly train these two stages with RL. Experiments on Effect of Candidate Number The candidate ex- public open-domain RC datasets Quasar-T and traction stage takes an important role to decide SearchQA show the necessity of introducing the selection model and the effectiveness of fusing Karl Moritz Hermann, Tomas Kocisky, Edward candidates information when modeling. More- Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su- over, our joint training strategy leads to significant leyman, and Phil Blunsom. 2015. Teaching ma- chines to read and comprehend. In Advances in improvements in performance. Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Sys- Acknowledgments tems 2015, December 7-12, 2015, Montreal, Que- bec, Canada. pages 1693–1701. This work is supported by the National Basic Research Program of China (973 program, No. Felix Hill, Antoine Bordes, Sumit Chopra, and Jason 2014CB340505). We thank Ying Chen and anony- Weston. 2015. The goldilocks principle: Reading children’s books with explicit memory representa- mous reviewers for valuable feedback. tions. arXiv preprint arXiv:1511.02301 . Minghao Hu, Yuxing Peng, and Xipeng Qiu. 2018. Re- References inforced mnemonic reader for machine comprehen- sion. In IJCAI. Christian Buck, Jannis Bulian, Massimiliano Cia- ramita, Andrea Gesmundo, Neil Houlsby, Wojciech Kenton Lee, Shimi Salant, Tom Kwiatkowski, Ankur Gajewski, and Wei Wang. 2018. Ask the right ques- Parikh, Dipanjan Das, and Jonathan Berant. 2016. tions: Active question reformulation with reinforce- Learning recurrent span representations for ex- ment learning. In ICLR. tractive question answering. arXiv preprint arXiv:1611.01436 . Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. 2009. Clueweb09 data set. Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. 2016a. Deep re- Danqi Chen, Adam Fisch, Jason Weston, and Antoine inforcement learning for dialogue generation. In Bordes. 2017. Reading wikipedia to answer open- Proceedings of the 2016 Conference on Empirical domain questions. In Proceedings of the 55th An- Methods in Natural Language Processing. Associ- nual Meeting of the Association for Computational ation for Computational Linguistics, Austin, Texas, Linguistics. Association for Computational Linguis- pages 1192–1202. tics, Vancouver, Canada, pages 1870–1879. Peng Li, Wei Li, Zhengyan He, Xuguang Wang, Ying Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Cao, Jie Zhou, and Wei Xu. 2016b. Dataset and Polosukhin, Alexandre Lacoste, and Jonathan Be- neural recurrent sequence labeling model for open- rant. 2017. Coarse-to-fine question answering for domain factoid question answering. arXiv preprint long documents. In Proceedings of the 55th Annual arXiv:1607.06275 . Meeting of the Association for Computational Lin- guistics. Association for Computational Linguistics, Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Vancouver, Canada, pages 209–220. Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William reading comprehension dataset. In Proceedings Cohen, and Ruslan Salakhutdinov. 2017a. Gated- of the Workshop on Cognitive Computation: Inte- attention readers for text comprehension. In Pro- grating neural and symbolic approaches 2016 co- ceedings of the 55th Annual Meeting of the Asso- located with the 30th Annual Conference on Neural ciation for Computational Linguistics. Association Information Processing Systems (NIPS 2016). for Computational Linguistics, Vancouver, Canada, pages 1832–1846. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for Bhuwan Dhingra, Kathryn Mazaitis, and William W machine comprehension of text. In Proceedings of Cohen. 2017b. Quasar: Datasets for question an- the 2016 Conference on Empirical Methods in Natu- swering by search and reading. arXiv preprint ral Language Processing. Association for Computa- arXiv:1707.03904 . tional Linguistics, Austin, Texas, pages 2383–2392. Matthew Dunn, Levent Sagun, Mike Higgins, Ugur Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, Guney, Volkan Cirik, and Kyunghyun Cho. 2017. and Wojciech Zaremba. 2015. Sequence level train- Searchqa: A new q&a dataset augmented with ing with recurrent neural networks. arXiv preprint context from a search engine. arXiv preprint arXiv:1511.06732 . arXiv:1704.05179 . Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Wei He, Kai Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Hannaneh Hajishirzi. 2017. Bidirectional attention Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao flow for machine comprehension. In ICLR. She, Xuan Liu, Tian Wu, and Haifeng Wang. 2017. Dureader: a chinese machine reading comprehen- Yelong Shen, Po-Sen Huang, Jianfeng Gao, and sion dataset from real-world applications. arXiv Weizhu Chen. 2017. Reasonet: Learning to stop preprint arXiv:1711.05073 . reading in machine comprehension. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pages 1047–1055. Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running av- erage of its recent magnitude. COURSERA: Neural networks for machine learning 4(2):26–31. Adam Trischler, Zheng Ye, Xingdi Yuan, Philip Bach- man, Alessandro Sordoni, and Kaheer Suleman. 2016. Natural language comprehension with the epireader. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Pro- cessing. Association for Computational Linguistics, Austin, Texas, pages 128–137. Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in Neural Information Processing Systems 28: Annual Con- ference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. pages 2692–2700. Shuohang Wang and Jing Jiang. 2017. Machine com- prehension using match-lstm and answer pointer. In ICLR. Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerald Tesauro, Bowen Zhou, and Jing Jiang. 2018a. R3: Reinforced reader-ranker for open-domain question answering. In AAAI. Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaoxiao Guo, Shiyu Chang, Zhiguo Wang, Tim Klinger, Gerald Tesauro, and Murray Campbell. 2018b. Evidence aggregation for answer re-ranking in open-domain question answering. In ICLR. Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching net- works for reading comprehension and question an- swering. In Proceedings of the 55th Annual Meet- ing of the Association for Computational Linguis- tics. volume 1, pages 189–198.