Arxiv:1906.07132V1 [Cs.CL] 17 Jun 2019
Total Page:16
File Type:pdf, Size:1020Kb
Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for Multi-Hop QA Yichen Jiang and Mohit Bansal UNC Chapel Hill fyichenj, [email protected] n o i Abstract t What was the father of Kasper Schmeichel s e voted to be by the IFFHS in 1992? u Multi-hop question answering requires a Q model to connect multiple pieces of evidence Kasper Peter Schmeichel (] ; born 5 November 1986) is a Danish professional footballer who plays as a scattered in a long context to answer the ques- g n i goalkeeper ... He is the son of former Manchester n s o tion. In this paper, we show that in the multi- c United and Danish international goalkeeper s o a e hop HotpotQA (Yang et al., 2018) dataset, D Peter Schmeichel. n i R a n h the examples often contain reasoning shortcuts e Peter Bolesław Schmeichel MBE (] ; born 18 C d l through which models can directly locate the o November 1963) is a Danish former professional answer by word-matching the question with a G footballer who played as a goalkeeper, and was voted sentence in the context. We demonstrate this the IFFHS World's Best Goalkeeper in 1992 and 1993. issue by constructing adversarial documents Edson Arantes do Nascimento (] ; born 23 October 1940), that create contradicting answers to the short- known as Pelé (] ), is a retired Brazilian professional footballer who played as a forward. In 1999, he was cut but do not affect the validity of the origi- r voted World Player of the Century by IFFHS. o t s c c nal answer. The performance of strong base- a o r t Kasper Hvidt (born 6 February 1976 in Copenhagen) D s line models drops significantly on our adver- i D is a Danish retired handball goalkeeper, who lastly played sarial evaluation, indicating that they are in- for KIF Kolding and previous Danish national team. ... deed exploiting the shortcuts rather than per- Hvidt was also voted as Goalkeeper of the Year forming multi-hop reasoning. After adversar- March 20, 2009, second place was Thierry Omeyer ... ial training, the baseline’s performance im- R. Bolesław Kelly MBE (] ; born 18 November 1963) proves but is still limited on the adversarial is a Danish former professional footballer who played as a Defender, and was voted the IFFHS evaluation. Hence, we use a control unit that Doc World's Best Defender in 1992 and 1993. dynamically attends to the question at differ- Adversarial ent reasoning hops to guide the model’s multi- Prediction: World's Best Goalkeeper (correct) hop reasoning. We show that this 2-hop model Prediction under adversary: IFFHS World's Best Defender trained on the regular data is more robust to Figure 1: HotpotQA example with a reasoning short- the adversaries than the baseline model. Af- cut, and our adversarial document that eliminates this ter adversarial training, this 2-hop model not shortcut to necessitate multi-hop reasoning. only achieves improvements over its counter- part trained on regular data, but also outper- forms the adversarially-trained 1-hop baseline. We hope that these insights and initial im- necessary to answer the question is concentrated arXiv:1906.07132v1 [cs.CL] 17 Jun 2019 provements will motivate the development of in a single sentence or located closely in a single new models that combine explicit composi- paragraph (Q: “What’s the color of the sky?”, Con- 1 tional reasoning with adversarial training. text: “The sky is blue.”, Answer: “Blue”). Such 1 Introduction datasets emphasize the role of matching and align- ing information between the question and the con- The task of question answering (QA) requires the text (“sky!sky, color!blue”). Previous works model to answer a natural language question by have shown that models with strong question- finding relevant information in a given natural lan- aware context representation (Seo et al., 2017; guage context. Most QA datasets require single- Xiong et al., 2017) can achieve super-human per- hop reasoning only, which means that the evidence formance on single-hop QA tasks like SQuAD 1Our code and data are publicly available at: (Rajpurkar et al., 2016, 2018). https://github.com/jiangycTarheel/ Adversarial-MultiHopQA Recently, several multi-hop QA datasets, such as QAngaroo (Welbl et al., 2017) and Hot- model is confused by our adversary and predicts potQA (Yang et al., 2018), have been proposed the wrong answer (“World’s Best Defender”). Our to further assess QA systems’ ability to perform experiments further reveal that when strong su- composite reasoning. In this setting, the informa- pervision of the supporting facts that contain the tion required to answer the question is scattered in evidence is applied, the baseline achieves a sig- the long context and the model has to connect mul- nificantly higher score on the adversarial dev set. tiple evidence pieces to pinpoint to the final an- This is because the strong supervision encourages swer. Fig.1 shows an example from the HotpotQA the model to not only locate the answer but also dev set, where it is necessary to consider infor- find the evidence that completes the first reason- mation in two documents to infer the hidden rea- ing hop and hence promotes robust multi-hop rea- son of soning chain “Kasper Schemeichel −−−−! Peter soning behavior from the model. We then train voted as Schemeichel −−−−−! World’s Best Goalkeeper” the baseline with supporting fact supervision on that leads to the final answer. However, in this our generated adversarial training set (adv-train) example, one may also arrive at the correct an- and observe significant improvement on adv-dev. swer by matching a few keywords in the question However, the result is still poor compared to the (“voted, IFFHS, in 1992”) with the corresponding model’s performance on the regular dev set be- fact in the context without reasoning through the cause this single-hop model is not well-designed first hop to find “father of Kasper Schmeichel”, to perform multi-hop reasoning. as neither of the two distractor documents con- To motivate and analyze some new multi-hop tains sufficient distracting information about an- reasoning models, we propose an initial architec- other person “voted as something by IFFHS in ture by incorporating the recurrent control unit 1992”. Therefore, a model performing well on the from Hudson and Manning(2018), which dynam- existing evaluation does not necessarily suggest its ically computes a distribution over question words strong compositional reasoning ability. To truly at each reasoning hop to guide the multi-hop bi- promote and evaluate a model’s ability to perform attention. In this way, the model can learn to multi-hop reasoning, there should be no such “rea- put the focus on “father of Kasper Schmeichel” at soning shortcut” where the model can locate the the first step and then attend to “voted by IFFHS answer with single-hop reasoning only. This is a in 1992” in the second step to complete this 2- common pitfall when collecting multi-hop exam- hop reasoning chain. When trained on the regu- ples and is difficult to address properly. lar data, this 2-hop model outperforms the single- In this work, we improve the original HotpotQA hop baseline in the adversarial evaluation, indi- distractor setting2 by adversarially generating bet- cating improved robustness against adversaries. ter distractor documents that make it necessary to Furthermore, this 2-hop model, with or without perform multi-hop reasoning in order to find the supporting-fact supervision, can benefit from ad- correct answer. As shown in Fig.1, we apply versarial training and achieve better performance phrase-level perturbations to the answer span and on adv-dev compared to the counterpart trained the titles in the supporting documents to create the with the regular training set, while also outper- adversary with a new title and a fake answer to forming the adversarially-trained baseline. Over- confuse the model. With the adversary added to all, we hope that these insights and initial improve- the context, it is no longer possible to locate the ments will motivate the development of new mod- correct answer with the single-hop shortcut, which els that combine explicit compositional reasoning now leads to two possible answers (“World’s Best with adversarial training. Goalkeeper” and “World’s Best Defender”). We 2 Adversarial Evaluation evaluate the strong “Bi-attention + Self-attention” model (Seo et al., 2017; Wang et al., 2017) from 2.1 The HotpotQA Task Yang et al.(2018) on our constructed adversar- The HotpotQA dataset (Yang et al., 2018) is ial dev set (adv-dev), and find that its EM score composed of 113k human-crafted questions, each drops significantly. In the example in Fig.1, the of which can be answered with facts from two Wikipedia articles. During the construction of 2HotpotQA has a fullwiki setting as an open-domain QA task. In this work, we focus on the distractor setting as it pro- the dataset, the crowd workers are asked to come vides a less noisy environment to study machine reasoning. up with questions requiring reasoning about two Question: Where is the company that Sachin Warrier worked for as a software engineer headquartered? Supporting Doc 1: Sachin Warrier Original answer: Title: Sachin Warrier is a playback singer and composer in Mumbai Tata Consultancy Services the Malayalam cinema industry from Kerala. (Step 1) (Step 2) He became notable with the song "Muthuchippi Poloru" Generate fake Sample from the film Thattathin Marayathu. He made his debut answer title with the movie Malarvaadi Arts Club. He was working Advesarial answer: New title: as a software engineer in Tata Consultancy Services Delhi Valencia Street Circuit in Kochi.