Active Reading Comprehension: a Dataset for Learning the Question-Answer Relationship Strategy
Total Page:16
File Type:pdf, Size:1020Kb
Active Reading Comprehension: A dataset for learning the Question-Answer Relationship strategy Diana Galvan Tohoku University Graduate School of Information Sciences Sendai, Japan [email protected] Abstract evaluation can result in the false impression of weak understanding (i.e., misunderstanding of the Reading comprehension (RC) through ques- text, the question, or both) or the absence of re- tion answering is a useful method for evaluat- quired knowledge. However, the system could ing if a reader understands a text. Standard ac- curacy metrics are used for evaluation, where a have failed to arrive at the correct answer for some high accuracy is taken as indicative of a good other reasons. For example, consider the reading understanding. However, literature in quality comprehension task shown in Figure1. For the learning suggests that task performance should question “What were the consequences of Eliza- also be evaluated on the undergone process to beth Choy's parents and grandparents being ‘more answer. The Question-Answer Relationship advanced for their times’?” the correct answer (QAR) is one of the strategies for evaluating is in the text but it is located in different sen- a reader’s understanding based on their ability to select different sources of information de- tences. If the system only identifies “They wanted pending on the question type. We propose the their daughters to be educated” as an answer, it creation of a dataset to learn the QAR strategy would be judged to be incorrect when it did not with weak supervision. We expect to comple- fail at finding the answer, it failed at connecting ment current work on reading comprehension it with the fact “we were sent to schools away by introducing a new setup for evaluation. from home” (linking problem). Similarly, any an- swer the system infers from the text for the ques- 1 Introduction tion “What do you think are the qualities of a war Computer system researchers have long been try- heroine?” would be wrong because the answer is ing to imitate human cognitive skills like mem- not in the text, it relies exclusively on background ory (Hochreiter and Schmidhuber, 1997; Chung knowledge (wrong choice of information source). et al., 2014) and attention (Vaswani et al., 2017). We propose to adopt the thesis that reading is not These skills are essential for a number of Natural a passive process by which readers soak up words Language Processing (NLP) tasks including read- and information from the text, but an active pro- ing comprehension (RC). Until now, the method cess1 by which they predict, sample, and confirm for evaluating a system’s understanding imitated or correct their hypotheses about the text (Weaver, the common classroom setting where students are 1988). One of these hypotheses is which source evaluated based on their number of correct an- of information the question requires. The reader swers. In the educational assessment literature might think it is necessary to look in the text to this is known as product-based evaluation and is then realize she could have answered without even one of the performance-based assessments types reading or, on the contrary, try to think of an an- (McTighe and Ferrara, 1994). However, there swer even though it is directly in the text. For is an alternative form: process-based evaluation. this reason, Raphael(1982) devised the Question- Process-based evaluation does not emphasize the Answer Relation (QAR) strategy, a technique to output of the activity. This assessment aims to help the reader decide the most suitable source know the step-by-step procedure followed to re- of information as well as the level of reasoning solve a given task. 1Not to be confused with active learning, a machine learn- When a reading comprehension system is not ing concept for a series of methods that actively participate in able to identify the correct answer, product-based the collection of training examples. (Thompson et al., 1999) 106 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 106–112 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics Interviewer: Mrs. Choy, would you like to tell us something about your background before the Japanese invasion? In the text Elizabeth Choy: 1. Oh, it will go back quite a long way, you Right there question: When did Elizabeth Choy come know, because I came to Singapore in December 1929 for to Singapore for higher education? higher education. 2. I was born in North Borneo which is Sabah Answer: December 1929 now. 3. My ancestors were from China. 4. They went to Hong Kong, and from Hong Kong, they came to Malaysia. 5. They Think and search: What were the consequences of started plantations, coconut plantations, rubber plantations. 6. Elizabeth Choy’s parents and grandparents being ‘more My parents and grandparents were more advanced for their advanced for their times’? times and when they could get on a bit, they wanted their Answer: They wanted their daughters to be educated daughters to be educated too. so they sent them to schools away from home. 7. So, we were sent to schools away from home. 8. First, we went to Jesselton which is Kota Kinabalu now. 9. There was a girls’ school run by English missionaries. 10. My aunt and I In my head were there for half a year. 11. And then we heard there was another better school – bigger school in Sandakan also run by Author and me: What do you think of Elizabeth Choy's English missionaries. 12. So we went to Sandakan as character from the interview? boarders. On my own: What do you think are the qualities of a 13. When we reached the limit, that is, we couldn’t study war heroine? anymore in Malaysia, we had to come to Singapore for higher education. And I was very lucky to be able to get into the Convent of the Holy Infant Jesus where my aunt had been for a year already. Figure 1: Example of reading comprehension applying the Question-Answer Relationship strategy to categorize the questions. needed based on the question type. type of reasoning. As a result, the good perfor- In this work, we introduce a new evaluation mance of a model drops when it is tested on a setting for reading comprehension systems. We different domain. For example, the knowledge- overview the QAR strategy as an option to move able reader of Mihaylov and Frank(2018) was beyond a scenario where only the product of com- trained to solve questions from children narra- prehension is evaluated and not the process. We tive texts that require commonsense knowledge, discuss our proposed approach to create a new achieving competitive results. However, it did not dataset for learning the QAR strategy using exist- perform equally well when tested on basic science ing reading comprehension datasets. questions that also required commonsense knowl- edge (Mihaylov et al., 2018). Systems have been 2 Related work able to match human performance but it has also been proven by Jia and Liang(2017) that they can Reading comprehension is an active research area be easily fooled with adversarial distracting sen- in NLP. It is composed of two main components: tences that would not change the correct answer text and questions. This task can be found in many or mislead humans. possible variations: setups where no options are given and the machine has to come up with an The motivation behind introducing adversarial answer (Yang et al., 2015; Weston et al., 2015; examples for evaluating reading comprehension is Nguyen et al., 2016; Rajpurkar et al., 2016) and to discern to what extent systems truly understand setups where the question has multiple choices language. Mudrakarta et al.(2018) followed the and the machine needs to choose one of them steps of Jia and Liang(2017) proposing a tech- (Richardson et al., 2013; Hill et al., 2015; On- nique to analyze the sensitivity of a model to ques- ishi et al., 2016; Mihaylov et al., 2018). In either tion words, with the aim to empower investigation case, standard accuracy metrics are used to eval- of reading models’ performance. With the same uate systems based on the number of correct an- goal in mind, we propose a process-based evalu- swers retrieved; a product-based evaluation. In ad- ation that will favor a closer examination of the dition to this evaluation criteria, the current read- process taken by current systems to solve a read- ing comprehension setting constrains systems to ing comprehension task. In the educational assess- be trained on a particular domain for a specific ment literature, this approach is recommended to 107 identify the weaknesses of a student. If we trans- thinking is required to relate the information pro- fer this concept to computers, we would be able vided in the text with background information. Fi- to focus on the comprehension tasks a computer is nally, On my own questions ask the reader to only weak in, regardless of the data in which the system use their background knowledge to come up with has been trained. an answer. Figure1 shows how QAR is applied to a reading comprehension task. Note that for 3 Question-answer relationship both In the text questions, one can easily match the Raphael(1982) devised the Question-Answer Re- words in the question with the words in the text. lationship as a way of improving children reading However, Think and search goes beyond matching performance across grades and subject areas. This ability; the reader should be able to conclude that approach reflects the current concept of reading as the information in sentences 6 and 7 are equally re- an active process influenced by characteristics of quired to answer the question being asked.