Multi-hop Question Generation with Graph Convolutional Network

Dan Su, Yan Xu, Wenliang Dai, Ziwei Ji, Tiezheng Yu, Pascale Fung Center for Artificial Intelligence Research (CAiRE) Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong {dsu, yxucb, wdaiai, zjiad, tyuah}@connect.ust.hk, [email protected]

Abstract Paragraph A: Marine Tactical Air Command Squadron 28 (Location T) is a United States Marine Multi-hop Question Generation (QG) aims to Corps aviation command and control unit based generate answer-related questions by aggre- at Marine Corps Air Station Cherry Point (Location gating and reasoning over multiple scattered evidence from different paragraphs. It is a C) ... more challenging yet under-explored task com- Paragraph B: Marine Corps Air Station Cherry pared to conventional single-hop QG, where Point (Location C) ... is a United States Marine the questions are generated from the sentence Corps airfield located in Havelock, North Car- containing the answer or nearby sentences in olina (Location H), USA ... the same paragraph without complex reason- Answer: Havelock, North Carolina (Location H) ing. To address the additional challenges Question: What city is the Marine Air Control in multi-hop QG, we propose Multi-Hop En- coding Fusion Network for Question Genera- Group 28 (Location T) located in? tion (MulQG), which does context encoding in multiple hops with Graph Convolutional Table 1: An example of multi-hop QG in the Hot- Network and encoding fusion via an Encoder potQA (Yang et al., 2018) dataset. Given the answer Reasoning Gate. To the best of our knowl- is Location H, to ask where is T located, the model edge, we are the first to tackle the challenge needs a bridging evidence to know that T is located in of multi-hop reasoning over paragraphs with- C, and C is located in H (T → C → H). This is done out any sentence-level information. Empiri- by multi-hop reasoning. cal results on HotpotQA dataset demonstrate the effectiveness of our method, in comparison with baselines on automatic evaluation metrics. single-hop reasoning (Zhou et al., 2017; Zhao et al., Moreover, from the human evaluation, our pro- 2018). Little effort has been put in multi-hop QG, posed model is able to generate fluent ques- which is a more challenging task. Multi-hop QG re- tions with high completeness and outperforms quires aggregating several scattered evidence spans the strongest baseline by 20.8% in the multi- from multiple paragraphs, and reasoning over them hop evaluation. The code is publicly available to generate answer-related, factual-coherent ques- at https://github.com/HLTCHKUST/MulQG. tions. It can serve as an essential component in 1 Introduction education systems (Heilman and Smith, 2010; Lind- berg et al., 2013; Yao et al., 2018), or be applied Question Generation (QG) is a task to automati- in intelligent systems (Shum et al., cally generate a question from a given context and, 2018; Pan et al., 2019). It can also combine with optionally, an answer. Recently, we have observed question answering (QA) models as dual tasks to an increasing interest in text-based QG (Du et al., boost QA systems with reasoning ability (Tang 2017; Zhao et al., 2018; Scialom et al., 2019; Nema et al., 2017). et al., 2019; Zhang and Bansal, 2019). Intuitively, there are two main additional chal- Most of the existing works on text-based QG lenges needed to be addressed for multi-hop QG. focus on generating SQuAD-style (Rajpurkar et al., The first challenge is how to effectively iden- 2016; Puri et al., 2020) questions, which are gen- tify scattered pieces of evidence that can con- erated from the sentence containing the answer nect the reasoning path of the answer and ques- or nearby sentences in the same paragraph, via tion (Chauhan et al., 2020). As the example shown

4636 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4636–4647 November 16 - 20, 2020. c 2020 Association for Computational Linguistics in Table1, to generate a question asking about “Ma- • To the best of our knowledge, we are the first rine Air Control Group 28” given only the answer to tackle the challenge of multi-hop reasoning “Havelock, North Carolina”, we need the bridging over paragraphs without any sentence-level evidence like “Marine Corps Air Station Cherry information in QG tasks. Point”. The second challenge is how to reason over • We propose a new and effective framework for multiple pieces of scattered evidence to generate Multi-hop QG, to do context encoding in mul- factual-coherent questions. tiple hops(steps) with Graph Convolutional Previous works mainly focus on single-hop QG, Network (GCN). which use neural network based approaches with • We show the effectiveness of our method on the sequence-to-sequence (Seq2Seq) framework. both automatic evaluation and human evalu- Different architectures of encoder and decoder have ation, and we make the first step to evaluate been designed (Nema et al., 2019; Zhao et al., 2018) the model performance in multi-hop aspect. to incorporate the information of answer and con- text to do single-hop reasoning. To the best of 2 Methodology our knowledge, none of the previous works ad- dress the two challenges we mentioned above for The intuition is drawn from human’s multi-hop multi-hop QG task. The only work on multi-hop question generation process (Davey and McBride, QG (Chauhan et al., 2020) uses multi-task learning 1986). Firstly, given the answer and context, we with an auxiliary loss for sentence-level supporting skim to establish a general understanding of the fact prediction, requiring supporting fact sentences texts. Then, we find the mentions of entities in in different paragraphs being labeled in the training or correlated to the answer from the context, and data. While labeling those supporting facts requires analyse nearby sentences to extract useful evidence. heavy human labor and is time-consuming, their Besides, we may also search for linked information method cannot be applied to general multi-hop QG in other paragraphs to gain a further understanding cases without supporting facts. of the entities. Finally, we coherently fuse our knowledge learned from the previous steps and In this paper, we propose a novel architecture start to generate questions. named Multi-Hop Encoding Fusion Network for Question Generation (MulQG) to address the afore- To mimic this process, we develop our MulQG mentioned challenges for multi-hop QG. First of framework. The encoding stage is achieved by a all, it extends the Seq2Seq QG framework from novel Multi-hop Encoder. At the decoding stage, sing-hop to multi-hop for context encoding. Ad- we use maxout pointer decoder as proposed in Zhao ditionally, it leverages a Graph Convolutional Net- et al.(2018). The overview of the framework is work (GCN) on an answer-aware dynamic entity shown in Figure1. graph, which is constructed from entity mentions in answer and input paragraphs, to aggregate the po- 2.1 Multi-hop Encoder tential evidence related to the questions. Moreover, Our Multi-hop Encoder includes three modules: we use different attention mechanisms to imitate (1) Answer-aware context encoder (2) GCN-based the reasoning procedures of human beings in multi- entity-aware answer encoder (3) Gated encoder hop generation process, the details are explained in reasoning layer. Section2. The context and answer are split into word-level We conduct the experiments on the multi-hop tokens and denoted as c = {c1, c2, ..., cn} and QA dataset HotpotQA (Yang et al., 2018) with a = {a1, a2, ..., am}, respectively. Each word our model and the baselines. The proposed model is represented by the pre-trained GloVe embed- outperforms the baselines with a significant im- ding (Pennington et al., 2014). Furthermore, for the provement on automatic evaluation results, such as words in context, we also append the answer tag- BLEU (Papineni et al., 2002). The human evalua- ging embeddings as described in Zhao et al.(2018). tion results further validate that our proposed model The context and answer embeddings are fed into is more likely to generate multi-hop questions with two bidirectional LSTM-RNNs separately to obtain Fluency Answerability d×n high quality in terms of , and their initial contextual representations C0 ∈ R Completeness d×m scores. and A0 ∈ R , in which d is the hidden state Our contributions are summarized as follows: dimension in LSTM.

4637 Multi-hop Encoder Decoder Encoder Reasoning Gate

Answer-aware Context Encoder Attention Layer Entity-aware Answer Encoder

residual

Bi-Attn Maxout Pointer Generator Entity Graph Answer-aware Context Encoder

LSTM LSTM LSTM Encoder Encoder Decoder Word Embeddings + Answer Tag Answer Context

Figure 1: Overview of our MulQG framework. In the encoding stage, we pass the initial context encoding C0 and answer encoding A0 to the Answer-aware Context Encoder to obtain the first context encoding C1, then C1 and A0 will be used to update a multi-hop answer encoding A1 via the GCN-based Entity-aware Answer Encoder, and † we use A1 and C1 back to the Answer-aware Context Encoder to obtain C2. The final context encoding Cfinal are obtained from the Encoder Reasoning Gate which operates over C1 and C2, and will be used in the max-out based decoding stage.

mask d×n C1 = BiLSTM([C˜1; C0]) ∈ R (6) GCN Firstly, we compute an alignment matrix S (Eq.1), and normalize it column-wise and row- wise to get two attention matrices S0 (Eq.2) and S00 ... (Eq.3). S0 represents the relevance of each answer token over the context, and S00 represents the rele- vance of each context token over the answer. The 0 new answer representation A0 w.r.t. the context is obtained by Eq.4. Next, the answer dependent Bidirectional Attention context representation is calculated by concatenat- ing old and new answer representations and times Figure 2: The illustration of GCN-based Entity-aware 00 Answer Encoder. the attention weight matrix S (Eq.5). Finally, to deeply incorporate the interaction between answer and context, we feed the answer dependent rep- 2.1.1 Answer-aware Context Encoder resentation C˜1 combined with original C0 into a bi-directional LSTM and obtain the answer-aware Inspired by the co-attention reasoning mecha- context encoding C (Eq.6). nism in previous machine reading comprehension 1 works (Xiong et al., 2016), we compute the answer- 2.1.2 GCN-based Entity-aware Answer aware context representation via the following Encoder steps: As shown in Figure2, in order to obtain the multi- hop answer representation, we first compute the entity encoding from the answer-aware context en- T n×m S = C0 A0 ∈ R (1) coding C1, then we apply GCN to propagate multi- S0 = softmax(S) ∈ Rn×m (2) hop information on the answer-aware sub-graph. Finally we obtain the updated answer encoding A S00 = softmax(ST ) ∈ Rm×n (3) 1 via bi-attention mechanism. 0 0 d×m A = C0 · S ∈ R (4) 0 Entity Graph Construction The entity graph ˜ 0 00 2d×n C1 = [A0; A0] · S ∈ R (5) is constructed with the name entities in context

4638 as nodes, where we use BERT-based name entity entities on the sub-graph as memories to update recognition model to recognize name entities from our multi-hop answer encoding A1 via: the context. The edges are created for the entity A1 = BiAttention(A0,EM ) (10) pairs if they are in the same sentence, or appear in the same paragraphs. We also connect the enti- 2.1.3 Encoder Reasoning Gate ties from each paragraph title to entities within the We apply a gated feature fusion module on the same paragraph. answer-aware context representations C1 and C2 from previous context encoder hops, to keep and Entity Encoding With the answer-aware context forget information to form the final context repre- encoding C obtained from Answer-aware Context 1 sentation Cfinal via: Encoder, we use a mapping matrix M to calcu- late the entity encoding. M is a binary matrix Cfinal = gt C1 +(1−gt) C2 (11) T T T where Mi,j = 1 if the i-th token in the context gt = σ(w2 C2 +w1 C1 +w0 C0 +b) (12) is within the span of the j-th entity. Each entity’s 2.2 Maxout Pointer Decoder encoding will be calculated via a mean-max pool- ing applied over it’s corresponding context token Uni-directional LSTM model is utilized as the de- 2d×g encoding span. E0 = {e1, e2, ..., eg} ∈ R , coder of our model. Moreover, we introduce the where g is the number of entities, and 2d is the Maxout Pointer proposed by Zhao et al.(2018) dimension since we directly concatenate the mean- into the decoder for sake of reducing the repeti- pooling and max-pooling encoding. tions in the generation. Pointer Generator enables the decoder to generate the next output token by Answer-aware GCN First we calculate an either computing from the generative probabilistic answer-aware sub-graph, where irrelevant entities distribution over the vocabulary or copying from are masked out, only those entity nodes related the input sequence. To compute the copy score, to answer are allowed to disseminate informa- the attention over the input sequence which has a tion. Similar to Xiao et al.(2019), a soft mask vocabulary of V from the current decoder hidden M = [m1, m2, ..., mg] is calculated via Eq.7, state is leveraged. For the Maxout Pointer Gener- where each mi indicate the relatedness of the entity ator, instead of leveraging all the attention score i to the answer, and then apply M on the original over the input tokens, only the maximal is taken graph entities to obtain answer-aware dynamic sub into consideration to avoid the repetitions caused entities graph Esub via Eq.8. by the input tokens (as it’s shown in Eq. 13, where a annotates the decoder-encoder attention score). T 1×g t,k M = σ(a0 · V · E0) ∈ R (7) Esub = M · E0 (8)  max a , yt ∈ V copy  t,k sc = k,where xk=yt (13) where V is a linear projection matrix and a0 is the −inf , otherwise mean pooling over answer encoding A0, and σ is sigmoid function. 2.3 Breadth-First Search Loss Then we calculate the answer-aware sub-graph’s In addition to the cross-entropy loss, we also in- attention matrix as described in Velickoviˇ c´ et al. troduce Breadth-First Search (BFS) Loss (Xiao g×g (2017) AG = {αi,j} ∈ R , where αi,j repre- et al., 2019) which is a weakly supervised loss to sents the information that will be assigned from further assist the training procedure. Given the entity i to it’s neighbor j, and obtain the one-layer answer entities, we conduct the BFS over the adja- information propagation over the sub-graph via: cent matrices of the entity graph we build to obtain heuristic masks as a weak supervision signal. The E1 = ReLU(AG · Esub) (9) BFS loss is calculated via binary cross-entropy loss between the predicted soft masks M in GCN-based The computation from Eq.9 can be repeated for Entity-aware Answer Encoder (Section 2.1.2) and multiple times to obtain multi-hop entity represen- the heuristic masks using Eq. 14 to encourage the tation EM . model to learn the answer-aware dynamic entity graph better. Multi-hop Answer Encoding we use bi- attention mechanism (Seo et al., 2016) regarding Loss = LCrossEntropy + λLBFS (14) 4639 n-gram Answer- Model QBLEU4 BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE-L METEOR ability RefNet1 29.79 19.58 14.41 11.10 30.94 18.59 51.80 70.40 (Nema et al., 2019) MP-GSN∗ 34.38 23.00 17.05 13.18 31.85 19.67 48.10 64.60 (Zhao et al., 2018) MulQG 40.08 26.58 19.61 15.11 35.35 20.24 53.90 72.70 MulQG + BFS loss 40.15 26.71 19.73 15.20 35.30 20.51 54.00 72.80

Table 2: Performance comparison between our MultQG model and state-of-the-art QG models on HotpotQA test set. 1The results are obtained with the original implementation of RefNet model. We also follows all the hyper- parameter settings as they are described in the paper.

n-gram Answer- Setting QBLEU4 BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE-L METEOR ability MulQG 40.08 26.58 19.61 15.11 35.35 20.24 53.90 72.70 (our model) MulQG 37.55 25.44 18.95 14.70 34.21 20.56 53.60 72.10 (1-layer GCN) w/o GEAEnc 36.62 24.80 18.50 14.36 33.53 20.39 52.10 70.50 w/o GEAEnc + ACEnc 37.85 26.19 20.15 16.21 33.35 17.86 53.40 71.90 w/o ERG 36.33 24.47 18.14 14.01 33.44 20.28 53.20 71.70 w/o GEAEnc + ACEnc + ERG 34.01 22.95 17.09 13.26 31.90 19.90 52.40 70.70

Table 3: Ablation Study of QG performances on HotpotQA test set, with different encoder modules removed. (Here GEAEnc: Graph-based Entity-aware Answer Encoder, ACEnc: Answer-aware Context Encoder†, ERG: Encoder Reasoning Gate) where λ here is a heuristic number and can be because of their high relevance with our task and selected using cross-validation. relatively superior performance:

3 Experiment MP-GSN is the first QG model to demonstrate 3.1 Dataset a large improvement with paragraph-level inputs To demonstrate the performance of our model, we for single-hop QG proposed by Zhao et al.(2018). conduct the experiments using HotpotQA (Yang While they conducted their experiments on SQuAD et al., 2018) dataset in an opposite manner. In the (Rajpurkar et al., 2016), we use exactly the same QG task, paragraphs and the answers are consid- experiment settings provided in their configuration ered as input, while the corresponding questions file on HotpotQA dataset. are the expected output. HotpotQA is a multi- hop question answering dataset, which contains Wikipedia-based question-answer pairs, with each RefNet is the first work that has reported results question requiring multi-hop reasoning across mul- on HotpotQA dataset for QG proposed by Nema tiple paragraphs to infer the answer. There are et al.(2019). However, their inputs based on the mainly two types of multi-hop reasoning in the gold supporting sentences, which contains the facts HotpotQA dataset: bridge and comparison. Fo- related to the multi-hop question, and no paragraph- cusing on the multi-hop ability of our model, we level results have been shown. We experiment with filter out all the yes/no data samples in the dataset the code they released on paragraphs-level, and test and run our experiments using the remaining corre- their model’s performance on both their validation sponding train and test set, which consists of 73k set and test set of HotpotQA dataset. questions in the training set and 8k in the test set. We also fine-tuned large pre-trained lan- guage models UniLM (Dong et al., 2019) and 3.2 Baselines BART (Lewis et al., 2019) on the multi-hop QG Since multi-hop QG has been under explored so task as comparison benchmark, to further show the far, there are very few existing baselines for our effectiveness of our method. The details and the comparison. We choose the following two models results will be covered in Appendix.

4640 Model Fluency Answerability Completeness Multi-hop Baseline 2.26 (0.50) 2.08 (0.87) 2.30 (0.79) 51.5% Ours 2.46 (0.43) 2.49 (0.61) 2.83 (0.33) 72.3% Human 2.57 (0.43) 2.67 (0.41) 2.86 (0.26) 81.2%

Table 4: The results of Human Evaluation. The mean values and the standard deviations of the first three evaluation scores, along with the percentage of questions assessed as multi-hop type are shown above.

3.3 Implementation Details regard to all those measuring metrics, which indi- cates that the multi-hop procedure can significantly Our word embeddings are initialized by boost the quality of the encoding representations glove.840B.300d (Pennington et al., 2014) and thus improve the multi-hop question genera- and we keep our vocab size as 45000. We use tion performance. Also the BFS loss can further two-layer bi-directional LSTMs for encoder and improve the system performance by encouraging two-layer uni-directional LSTMs for decoder, and learning the answer-aware dynamic entity graph the hidden size is 300 for all the models. We use better, which is a key and bottleneck module in the stochastic gradient descent (SGD) as the optimizer. MulQG model. The initial learning rate is 0.1, and it is reduced during the training stage using a cosine annealing 3.4.3 Ablation Study scheduler (Loshchilov and Hutter, 2016). The To further evaluate and investigate the performance batch size is 12 and the beam size is 10. We set of different components in our model, we perform the dropout probability for LSTM to 0.2 and 0.3 the ablation study. As we can see from Table3, for GCN. The maximum number of epochs is set both the GCN-based entity-aware answer encoder to 20. We set the maximum number of entities module and Gated Context Reasoning module are in each context to 80, and we use a two-layer important to the model. Each of them provides a GCN in our GCN-based answer encoder module. relative contribution of 2%-3% for overall perfor- After training the model for 10 epochs, we further mance improvement. fine-tune the MulQG model with the help of BFS loss, where the λ in Eq.14 is set to 0.5. w/o GEAEnc: Without GCN-based Entity- aware Answer Encoder, answer-related multi-hop 3.4 Automatic Evaluation evidence information cannot be identified. With- out multi-hop answer encoding being updated, next 3.4.1 Metrics step’s answer-aware context encoding will be af- We use the metrics in previous work on single-hop fected and thus the performance will drop a lot. QG to evaluate the generation performance of our model, with n-gram similarity metrics BLEU1 (Pa- w/o GEAEnc + ACEnc: The performance con- pineni et al., 2002), ROUGE-L (LIN, 2004), and tinues to decrease but not that much. This matches METEOR using the package released in Lavie with our expectation, since without an informative and Denkowski(2009). We also quantify the input A1 containing multi-hop information from QBLEU4 (Nema and Khapra, 2018a) and answer- the GCN-based Entity-aware Answer Encoder, the † ability score of our models, which was shown to Answer-aware Context Encoder cannot generate correlate significantly better with human judge- an informative C2. Thus remove it won’t hurt the ments (Nema and Khapra, 2018b). performance that much. w/o ERG: When we remove the Encoder Rea- 3.4.2 Results and Analysis soning Gate, the performance drops by around 3% Table2 shows the performance of various models in BLEU-1. This also matches our intuition since on the HotpotQA test set. We report the both results without effective feature reasoning and fusion, all of the experiments on our proposed model before the previous encoders cannot generate effective rep- and after fine-tuning with auxiliary BFS loss. As resentations. Thus the generation performance will it’s shown in the table, our MulQG model perform be affected. much better than the two baselines methods, with w/o GEAEnc + ACEnc + ERG: Without the 1https://github.com/Maluuba/nlg-eval three modules, the performance directly drops to

4641 0 -0.0018 House of Many Ways Example I Paragraph A: House of Many Ways is a young adult fantasy novel written by Diana Wynne Jones. 0 0.0007 Diana Wynne Jones :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :::The:::::story::is:::set::in::::the :::::same world:::::::as::::::::“Howl’s::::::::Moving ::::::Castle” and “Castle in the Air”. 0 -0.0004 Howl's Moving Castle Paragraph B: Howl’s Moving Castle is a fantasy novel by British author Diana Wynne Jones. ... In 0 -0.0017 Castle in the Air 2004 it was adapted as an animated film of the same name, which was nominated for the academy award for best-animated feature. 0 -0.0010 Howl's Moving Castle Answer: academy award for best animated feature 0 0.0012 British Baseline: house of many ways is a young adult fantasy novel written by diana wynne jones , the 0 0.0009 Diana Wynne Jones story is set in the same world as ” howl ’s moving castle Ours: what award was the author of the book house of many ways nominated for ? Academy Award for Best 1 0.0007 Human: house of many ways is a young adult fantasy novel set in the same world as a novel Animated Feature that was adapted as an animated film of the same name and nominated for what ? Example II 1 -0.0028 Prudence Jane Goward Paragraph A: Prudence Jane Goward ( born 2 September 1952 in Adelaide ), an Australian politician, ... she has previously served as the minister for mental health, minister for medical 0 0.3688 Australian research, and assistant minister for health between April 2015 and January 2017. 0 -0.4160 Mental Health ... Goward is a member of the new south wales legislative assembly representing Goulburn for the liberal party of Australia since 2007. 0 -0.0602 Medical Research Paragraph B: Goulburn is an electoral district of the legislative assembly in the Australian state of New South Wales :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 0 0.0361 ::::new :::::south::::::wales.::It::is:::::::::::represented ::by::::Pru:::::::Goward:::of :::the::::::liberal:::::party:::of :::::::::Australia. Legislative Assembly Answer: jane goward 0 0.0170 Goulburn Baseline: goulburn is an electoral district of the legislative assembly in the australian state of 0 -0.2563 Liberal Party of Australia new south wales , it is represented by pru goward of the liberal party of australia Ours: which member of the electoral district of goulburn has previously served as the minister 0 0.1605 2007 for mental health? Human: which australian politician represented electoral district of goulburn Example III

Paragraph A: Jeremy Lee Renner (born January 7, 1971) is an :::::::::American:::::actor. ... He was nominated 0 0.0048 Jeremy Lee Renner for the academy award for best supporting actor for his much-praised performance in 0 0.1639 American “The Town”. Paragraph B: Arrival is a 2016 American science fiction film directed by Denis Villeneuve ... It Academy Award for Best :::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 0 -0.0356 stars Amy Adams, Jeremy Renner, and Forest Whitaker, ... Supporting Actor Answer: jeremy renner 0 0.4070 2016 Baseline: which american actor starred in the 2016 american science fiction film directed by 0 0.1450 American denis villeneuve ? Ours: which star of the movie arrival was nominated for the academy award for best sup- 0 0.5070 Denis Villeneuve porting actor for his performance in “ the town ”? 0 0.4875 Amy Adams Human: name the actor who has acted in the film arrival and who has been nominated for the 1 1.0000 Jeremy Renner academy award for best supporting actor for the film “ the town ” ?

Figure 3: Case study of three examples from the HotpotQA test set. The left part of the figure shows the importance of the entitie nodes, where he left column in red indicates the answer entities and the right colume in blue displays the importance of the entities of graph reasoning at the starting point by the shade of color. The tables show the generated questions from different models along with the corresponding paragraphs and the answer. Moreover, we

highlight the reasoning paths of our proposed model in green for a more intuitive display. We also use:::::wavy ::::lines to mark out the snippets of the paragraphs that the questions generated by the MP-GSN model derive from.

single-hop QG system level, which proves the con- the question, while Completeness only focuses on tributions of the whole proposed model. the sentence completeness. Answerability mainly indicates the relationship between the answers and MulQG (1-layer GCN): When apply 1-layer the generated questions. For the first three index, GCN and only allow information propagation being the score for each data sample could be chosen limited to each node’s neighbor, the answer-related from {1,2,3} in comparison with the other samples evidences might not be able to be fully obtained, generated from the other two models with the same thus the performance are not as good as our 2-layer input, where a higher score indicates a better perfor- GCN-based model. mance on that matrix, For the multi-hop evaluation, we only carry out binary discrimination. We ran- 3.5 Human Evaluation domly sample 100 data samples from the test set. Human evaluation is conducted to further analyze Ten annotators are asked in total to evaluate them the performance of our model (Table4). We com- on the aforementioned four metrics. Each sample pare the generated questions from MP-GSN model, is evaluated by three different annotators. our model and gold ones on four metrics: Fluency, To present a more convincing analysis, we con- Completeness, Answerability and whether the gen- duct the t-test on the human evaluation results. All erated questions are multi-hop question or not. the reported results between our proposed model Fluency emphasizes the grammar correctness of and the baseline are statistically significant with a p-

4642 value<0.05. We also calculate the inter-annotator tively fluent outputs. agreement using Fleiss’ Kappa (Fleiss, 1971) mea- sure and achieve high agreement scores on the pro- 4 Related Work posed model. We observe that our MulQG model Question Generation Early single-hop QG use largely outperforms the MP-GSN model in terms rule-based methods to transform sentences to ques- of Fluency, Answerability and Completeness with tions (Labutov et al., 2015; Lindberg et al., 2013). more stable quality. Moreover, our model tends Recently neural network based approaches adopt to generate more complete question and achieve the sequence-to-sequence (Seq2Seq) based frame- comparable completeness score with the human work, with different types of encoders and de- annotations. For the multi-hop evaluation, we out- coders have been designed (Zhou et al., 2017; perform the strongest baseline by 20.8% on the Nema et al., 2019; Zhao et al., 2018). Zhao et al. multi-hop evaluation. (2018) proposes to incorporate paragraph level con- tent by using Gated Self Attention and Maxout 3.6 Case Study pointer networks, while Nema et al.(2019) pro- We present a case study comparing between the poses a model which contains two decoders where strong baseline MP-GSN model, our model and the the second decoder refines the question generated human annotations. Three cases are presented in by the first decoder using . Figure3. In the first two examples, it’s clearly There are different ways to attend answer infor- shown in the examples that the baseline model mation to the context encoding stage. Zhou et al. tends to copy a contiguous and long span of context (2017) and Liu et al.(2019) directly concatenate as the generation, while our proposed model per- answer tagging with the context embedding, while forms better in this aspect. we observe that since Nema et al.(2019) also applies bi-attention mech- the supporting fact information is not leveraged anism proposed by Seo et al.(2016) for QA to do in our method, the generated questions from our answer-aware context representation. Chen et al. model may show a different reasoning path with (2019) is the most recent work which proposes that for the gold question. There could be multiple a reinforcement learning based graph-to-sequence ways to construct a multi-hop question given the (Graph2Seq) model which use a bidirectional graph same input. So the generations may be much dif- encoder on a syntax-based graph for QG, while they ferent from the gold label, although they are still still focus on the single-hop QG. correct questions, which could be indicated from Multi-hop QA Popular Graph Nueral Network the first two examples. This phenomenon causes a (GNN) frameworks, such as graph convolutional lower score in automatic matrices, such as BLEU networks (Kipf and Welling, 2016), graph atten- and METEOR, but we note that the generated ques- tion network (Velickoviˇ c´ et al., 2017), and graph tions still follow the multi-hop scheme and can be recurrent network (Song et al., 2018) have been ex- answered with the given answers. plored and showed promising results on multi-hop In Example III, we show the data sample in an QA task that requiring reasoning. Xiao et al.(2019) easier mode. In this case, while the answer entity is proposes a dynamic fused graph network to work in one paragraph, a similar entity (annotated with on multi-hop QA on the HotpotQA dataset. De Cao orange color) also appears in another paragraph, et al.(2018) proposes an entity-GCN method to rea- which gives a strong clue of the reasoning path and son over across multiple documents for multi-hop makes it easier for the model to attend to both para- QA on the WIKIHOP dataset (Welbl et al., 2018). graphs. The generations from our model and the human annotation show almost the same reason- 5 Conclusion ing path. However, we observe that the question generated by MP-GSN model still tends to attend Multi-hop QG task is more challenging and worthy to the entities that are closer to the answer entities. of exploration compared to conventional single- Moreover, for the human annotation in Example I hop QG. To address the additional challenges in and Example III, the gold questions have a problem multi-hop QG, we propose MulQG, which does with fluency, which is harmful for the QG mod- multi-hop context encoding with Graph Convolu- els, but interestingly, even with training using these tional Network and encoding fusion via a Gated labels, our model is still capable of generating rela- Reasoning module. To the best of our knowledge,

4643 we are the first to tackle the challenge of multi-hop Igor Labutov, Sumit Basu, and Lucy Vanderwende. reasoning over paragraphs without any sentence- 2015. Deep questions without deep understanding. level information. The model performance on Hot- In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the potQA dataset demonstrates its effectiveness on 7th International Joint Conference on Natural Lan- aggregating scattered pieces of evidence across the guage Processing (Volume 1: Long Papers), pages paragraphs and fusing information effectively to 889–898. generate multi-hop questions. The strong reason- Alon Lavie and Michael J Denkowski. 2009. The ing ability of the Multi-hop Encoder in the MulQA meteor metric for automatic evaluation of machine model can potentially be leveraged in complex gen- translation. Machine translation, 23(2-3):105–115. eration tasks for the future work. Mike Lewis, Yinhan Liu, Naman Goyal, Mar- jan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. References Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and Hardik Chauhan, Asif Ekbal, Pushpak Bhattacharyya, comprehension. arXiv preprint arXiv:1910.13461. et al. 2020. Reinforced multi-task approach for multi-hop question generation. arXiv preprint C-Y LIN. 2004. Rouge: A package for automatic arXiv:2004.02143. evaluation of summaries. In Proc. of Workshop on Text Summarization Branches Out, Post Conference Yu Chen, Lingfei Wu, and Mohammed J Zaki. 2019. Workshop of ACL 2004. Reinforcement learning based graph-to-sequence model for natural question generation. arXiv David Lindberg, Fred Popowich, John Nesbit, and Phil preprint arXiv:1908.04942. Winne. 2013. Generating natural language ques- tions to support learning on-line. In Proceedings of Beth Davey and Susan McBride. 1986. Effects the 14th European Workshop on Natural Language of question-generation training on reading com- Generation, pages 105–114. prehension. Journal of Educational Psychology, 78(4):256. Bang Liu, Mingjun Zhao, Di Niu, Kunfeng Lai, Yancheng He, Haojie Wei, and Yu Xu. 2019. Learn- Nicola De Cao, Wilker Aziz, and Ivan Titov. 2018. ing to generate questions by learningwhat not to gen- Question answering by reasoning across documents erate. In The World Wide Web Conference, pages with graph convolutional networks. arXiv preprint 1106–1118. ACM. arXiv:1808.09920. Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochas- Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi- tic gradient descent with warm restarts. arXiv aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, preprint arXiv:1608.03983. and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understand- Preksha Nema and Mitesh M Khapra. 2018a. Towards ing and generation. In Advances in Neural Informa- a better metric for evaluating question generation tion Processing Systems, pages 13063–13075. systems. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Xinya Du, Junru Shao, and Claire Cardie. 2017. Learn- pages 3950–3959. ing to ask: Neural question generation for reading comprehension. In Proceedings of the 55th An- Preksha Nema and Mitesh M. Khapra. 2018b. Towards nual Meeting of the Association for Computational a better metric for evaluating question generation Linguistics (Volume 1: Long Papers), pages 1342– systems. In Proceedings of the 2018 Conference on 1352. Empirical Methods in Natural Language Processing, pages 3950–3959, Brussels, Belgium. Association Joseph L Fleiss. 1971. Measuring nominal scale agree- for Computational Linguistics. ment among many raters. Psychological bulletin, 76(5):378. Preksha Nema, Akash Kumar Mohankumar, Mitesh M Khapra, Balaji Vasan Srinivasan, and Balaraman Michael Heilman and Noah A. Smith. 2010. Good Ravindran. 2019. Let’s ask again: Refine network question! statistical ranking for question genera- for automatic question generation. In Proceedings tion. In Human Language Technologies: The 2010 of the 2019 Conference on Empirical Methods in Annual Conference of the North American Chap- Natural Language Processing and the 9th Interna- ter of the Association for Computational Linguistics, tional Joint Conference on Natural Language Pro- pages 609–617, Los Angeles, California. Associa- cessing (EMNLP-IJCNLP), pages 3305–3314. tion for Computational Linguistics. Boyuan Pan, Hao Li, Ziyu Yao, Deng Cai, and Huan Thomas N Kipf and Max Welling. 2016. Semi- Sun. 2019. Reinforced dynamic reasoning for con- supervised classification with graph convolutional versational question generation. arXiv preprint networks. arXiv preprint arXiv:1609.02907. arXiv:1907.12667.

4644 Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Caiming Xiong, Victor Zhong, and Richard Socher. Jing Zhu. 2002. Bleu: a method for automatic eval- 2016. Dynamic coattention networks for question uation of machine translation. In Proceedings of answering. arXiv preprint arXiv:1611.01604. the 40th annual meeting on association for compu- tational linguistics, pages 311–318. Association for Zhilin Yang, Peng Qi, Saizheng Zhang, , Computational Linguistics. William Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. 2018. Hotpotqa: A dataset for Jeffrey Pennington, Richard Socher, and Christopher D diverse, explainable multi-hop question answering. Manning. 2014. Glove: Global vectors for word rep- In Proceedings of the 2018 Conference on Empiri- resentation. In Proceedings of the 2014 conference cal Methods in Natural Language Processing, pages on empirical methods in natural language process- 2369–2380. ing (EMNLP), pages 1532–1543. Kaichun Yao, Libo Zhang, Tiejian Luo, Lili Tao, and Raul Puri, Ryan Spring, Mostofa Patwary, Mohammad YanJun Wu. 2018. Teaching machines to ask ques- Shoeybi, and Bryan Catanzaro. 2020. Training ques- tions. In Proceedings of the 27th International Joint tion answering models from synthetic data. arXiv Conference on Artificial Intelligence, pages 4546– preprint arXiv:2002.09599. 4552. AAAI Press. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Shiyue Zhang and Mohit Bansal. 2019. Address- Percy Liang. 2016. Squad: 100,000+ questions for ing semantic drift in question generation for semi- machine comprehension of text. In Proceedings of supervised question answering. In Proceedings of the 2016 Conference on Empirical Methods in Natu- the 2019 Conference on Empirical Methods in Nat- ral Language Processing, pages 2383–2392. ural Language Processing and the 9th International Joint Conference on Natural Language Processing Thomas Scialom, Benjamin Piwowarski, and Jacopo (EMNLP-IJCNLP), pages 2495–2509. Staiano. 2019. Self-attention architectures for answer-agnostic neural question generation. In Pro- Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa ceedings of the 57th Annual Meeting of the Asso- Ke. 2018. Paragraph-level neural question genera- ciation for Computational Linguistics, pages 6027– tion with maxout pointer and gated self-attention net- 6032. works. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and pages 3901–3910. Hannaneh Hajishirzi. 2016. Bidirectional attention flow for machine comprehension. arXiv preprint Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan, arXiv:1611.01603. Hangbo Bao, and Ming Zhou. 2017. Neural ques- tion generation from text: A preliminary study. In Heung-Yeung Shum, Xiao-dong He, and Di Li. 2018. National CCF Conference on Natural Language From eliza to xiaoice: challenges and opportunities Processing and Chinese Computing, pages 662–671. with social chatbots. Frontiers of Information Tech- Springer. nology & Electronic Engineering, 19(1):10–26. Linfeng Song, Yue Zhang, Zhiguo Wang, and Daniel Gildea. 2018. A graph-to-sequence model for amr-to-text generation. arXiv preprint arXiv:1805.02473. Duyu Tang, Nan Duan, Tao Qin, Zhao Yan, and Ming Zhou. 2017. Question answering and ques- tion generation as dual tasks. arXiv preprint arXiv:1706.02027. Petar Velickoviˇ c,´ Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903. Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. Transac- tions of the Association for Computational Linguis- tics, 6:287–302. Yunxuan Xiao, Yanru Qu, Lin Qiu, Hao Zhou, Lei Li, Weinan Zhang, and Yong Yu. 2019. Dynamically fused graph network for multi-hop reasoning. arXiv preprint arXiv:1905.06933.

4645 A Appendix A.1 Detailed Experiment Settings We run our experiments on 1 GeForce® GTX 1080 Ti GPU, with batch size to 12. The average run- time for our model is around 7500s for one epoch. The total numbers of parameters for our model is : 84250510, while we freeze the word embedding parameters, so our total number of parameters need to be optimized is 57250510. We run the baselines also on the same computing environment, using the configuration file they provided. For the Maxout Pointer baseline, we use a batch size of 16 to fit with our GPU memory.

A.2 Comparison with fine-tuning large pre-trained language models In order to further show the effectiveness of our method, we further fine-tuned UniLM (Dong et al., 2019) and BART (Lewis et al., 2019) on the multi- hop QG task. UniLM and BART has obtained state- of-the-art performance on the summarization tasks and also on question generation task on SQuAD (Rajpurkar et al., 2016) dataset. As we can see from Table A1, the performance of our model is on par with fine-tuning the large- pretrained models on the multihop QG tasks. While our model is much more light-weight and can pro- vide explicit reasoning interpretability.

4646 Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE-L METEOR Finetune-UniLM(l48 p0 b1) 42.37 29.95 22.61 17.61 40.34 25.48 Finetune-BART(test.hypo.l32 p0 b5) 41.41 30.90 24.39 19.75 36.13 25.20 MulQG 40.08 26.58 19.61 15.11 35.35 20.24 MulQG + BFS loss 40.15 26.71 19.73 15.20 35.30 20.51

Table A1: Performance comparison between our MultQG model and fine-tuning state-of-the-art large pre-trained models on HotpotQA test set.

4647