Arxiv:1904.09679V3 [Cs.CL] 17 Dec 2019 Hind Their English Counterparts

Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension Kai Sun1∗ Dian Yu2 Dong Yu2 Claire Cardie1 1Cornell University, Ithaca, NY 2Tencent AI Lab, Bellevue, WA [email protected], {yudian, dyu}@tencent.com, [email protected] Abstract Machine reading comprehension (MRC) tasks have attracted substantial attention from both Machine reading comprehension tasks re- academia and industry. These tasks require a ma- quire a machine reader to answer questions chine reader to answer questions relevant to a relevant to the given document. In this pa- given document provided as input (Poon et al., per, we present the first free-form multiple- 2010; Richardson et al., 2013). In this paper, we Choice Chinese machine reading Compre- hension dataset (C3), containing 13,369 focus on free-form multiple-choice MRC tasks — documents (dialogues or more formally given a document, select the correct answer option written mixed-genre texts) and their associ- from all options associated with a free-form ques- ated 19,577 multiple-choice free-form question, which is not limited to a single question type tions collected from Chinese-as-a-second- such as cloze-style questions formed by removing language examinations. a span or a sentence in a text (Hill et al., 2016; Ba- We present a comprehensive analysis of the jgar et al., 2016; Mostafazadeh et al., 2016; Xie prior knowledge (i.e., linguistic, domain- et al., 2018; Zheng et al., 2019) or close-ended specific, and general world knowledge) questions that can be answered with a minimal an- needed for these real-world problems. We swer (e.g., yes or no (Clark et al., 2019)). implement rule-based and popular neural methods and find that there is still a sig- Researchers have developed a variety of free- nificant performance gap between the best form multiple-choice MRC datasets that contain performing model (68:5%) and human read- a significant percentage of questions focusing on ers (96:0%), especially on problems that re- the implicitly expressed facts, events, opinions, or quire prior knowledge. We further study emotions in the given text (Richardson et al., 2013; the effects of distractor plausibility and data Lai et al., 2017; Ostermann et al., 2018; Khashabi augmentation based on translated relevant et al., 2018; Sun et al., 2019a). Generally, we re- datasets for English on model performance. We expect C3 to present great challenges quire the integration of our own prior knowledge to existing systems as answering 86:8% of and the information presented in the given text to questions requires both knowledge within answer these questions, posing new challenges for and beyond the accompanying document, MRC systems. However, until recently, progress and we hope that C3 can serve as a platform in the development of techniques for addressing to study how to leverage various kinds of this kind of MRC task for Chinese has lagged be- prior knowledge to better understand a given arXiv:1904.09679v3 [cs.CL] 17 Dec 2019 hind their English counterparts. A primary reason written or orally oriented text. C3 is avail- able at https://dataset.org/c3/. is that most previous work focuses on constructing MRC datasets for Chinese in which most answers are either spans (Cui et al., 2016; Li et al., 2016; 1 Introduction Cui et al., 2018a; Shao et al., 2018) or abstractive texts (He et al., 2017) merely based on the infor- “Language is, at best, a means of directing mation explicitly expressed in the provided text. others to construct similar-thoughts from their own prior knowledge.” With a goal of developing similarly challenging, but free-form multiple-choice datasets, and pro- Adams and Bruce(1982) moting the development of MRC techniques for ∗ Part of this work was conducted when K. S. was an Chinese, we introduce the first free-form multiple- intern at the Tencent AI Lab, Bellevue, WA. Choice Chinese machine reading Comprehension dataset (C3) that not only contains multiple types man performance. Our hope is that C3 can serve of questions but also requires both the informa- as a platform for researchers interested in study- tion in the given document and prior knowledge ing how to leverage different types of prior knowl- to answer questions. In particular, for assessing edge for in-depth text comprehension and facili- model generalizability across different domains, tate future work on crosslingual and multilingual 3 3 C includes a dialogue-based task CD in which the machine reading comprehension. given document is a dialogue, and a mixed-genre 3 2 Related Work task CM in which the given document is a mixed- genre text that is relatively formally written. All Traditionally, MRC tasks have been designed to problems are collected from real-world Chinese- be text-dependent (Richardson et al., 2013; Her- as-a-second-language examinations carefully de- mann et al., 2015): they focus on evaluating com- signed by experts to test the reading comprehen- prehension of machine readers based on a given sion abilities of language learners of Chinese. text, typically by requiring a model to answer We perform an in-depth analysis of what kinds questions relevant to the text. This is distinguished of prior knowledge are needed for answering from many question answering (QA) tasks (Fader questions correctly in C3 and two representative et al., 2014; Clark et al., 2016), in which no ground free-form multiple-choice MRC datasets for En- truth document supporting answers is provided glish (Lai et al., 2017; Sun et al., 2019a), and to with each question, making them relatively less what extent. We find that solving these general- suitable for isolating improvements to MRC. We domain problems requires linguistic knowledge, will first discuss standard MRC datasets for En- domain-specific knowledge, and general world glish, followed by MRC/QA datasets for Chinese. knowledge, the latter of which can be further bro- English. Much of the early MRC work focuses ken down into eight types such as arithmetic, con- on designing questions whose answers are spans notation, cause-effect, and implication. These from the given documents (Hermann et al., 2015; free-form MRC datasets exhibit similar character- Hill et al., 2016; Bajgar et al., 2016; Rajpurkar istics in that (i) they contain a high percentage et al., 2016; Trischler et al., 2017; Joshi et al., (e.g., 86:8% in C3) of questions requiring knowl- 2017). As a question and its answer are usu- edge gained from the accompanying document as ally in the same sentence, state-of-the-art meth- well as at least one type of prior knowledge and (ii) ods (Devlin et al., 2019) have outperformed hu- regardless of language, dialogue-based MRC tasks man performance on many such tasks. To in- tend to require more general world knowledge and crease task difficulty, researchers have explored less linguistic knowledge compared to tasks ac- a number of options including adding unanswer- companied with relatively formally written texts. able (Trischler et al., 2017; Rajpurkar et al., Specifically, compared to existing MRC datasets 2018) or conversational (Choi et al., 2018; Reddy for Chinese (He et al., 2017; Cui et al., 2018b), C3 et al., 2019) questions that might require rea- requires more general world knowledge (57:3% of soning (Zhang et al., 2018a), and designing ab- questions) to arrive at the correct answer options. stractive answers (Nguyen et al., 2016; Kociskˇ y` We implement rule-based and popular neural et al., 2018; Dalvi et al., 2018) or (question, an- approaches to the MRC task and find that there swer) pairs that involve cross-sentence or cross- is still a significant performance gap between the document content (Welbl et al., 2018; Yang et al., best-performing model (68:5%) and human read- 2018). In general, most questions concern the ers (96:0%), especially on problems that require facts that are explicitly expressed in the text, mak- prior knowledge. We find that the existence of ing these tasks possible to measure the level of wrong answer options that highly superficially fundamental reading skills of machine readers. match the given text plays a critical role in in- Another research line has studied MRC tasks, creasing the difficulty level of questions and the usually in a free-form multiple-choice form, con- demand for prior knowledge. Furthermore, addi- taining a significant percentage of questions that tionally introducing 94k training instances based focus on the understanding of the implicitly ex- on translated free-form multiple-choice datasets pressed facts, events, opinions, or emotions in the for English can only lead to a 4:6% improvement given text (Richardson et al., 2013; Mostafazadeh in accuracy, still far from closing the gap to hu- et al., 2016; Khashabi et al., 2018; Lai et al., 2017; Chinese Task Document Question Answer Question English Counterpart Genre Type Type Size Question Answering QS (Cheng et al., 2016) N/A free-form multiple-choice 0:6K ARC (Clark et al., 2016) MCQA (Guo et al., 2017a) N/A free-form multiple-choice 14:4K ARC (Clark et al., 2016) MedQA (Zhang et al., 2018b) N/A free-form multiple-choice 235:2K ARC (Clark et al., 2016) GeoSQA (Huang et al., 2019) N/A free-form multiple-choice 4:1K DD (Lally et al., 2017) Machine Reading Comprehension PD (Cui et al., 2016) news cloze extractive 876:7K CNN/Daily (Hermann et al., 2015) CFT (Cui et al., 2016) books cloze extractive 3:6K CBT (Hill et al., 2016) CMRC 2018 (Cui et al., 2018b) Wiki free-form extractive 19:1K SQuAD (Rajpurkar et al., 2016) DuReader (He et al., 2017) web free-form abstractive ≈ 200K MS MARCO (Nguyen et al., 2016) ChID (Zheng et al., 2019) mixed-genre cloze multiple-choice 728:7K CLOTH (Xie et al., 2018) 3 CM (this work) mixed-genre free-form multiple-choice 10:0K RACE (Lai et al., 2017) 3 CD (this work) dialogue free-form multiple-choice 9:6K DREAM (Sun et al., 2019a) Table 1: Comparison of C3 and representative Chinese question answering and machine reading comprehension tasks.

Load more