Yiming Cui, Ting Liu, Ziqing Yang, Zhipeng Chen, Wentao Ma, Wanxiang Che, Shijin Wang, Guoping Hu
Total Page:16
File Type:pdf, Size:1020Kb
A Sentence Cloze Dataset for Chinese Machine Reading Comprehension Yiming Cui, Ting Liu, Ziqing Yang, Zhipeng Chen, Wentao Ma, Wanxiang Che, Shijin Wang, Guoping Hu Research Center for Social Computing and Information Retrieval (SCIR), Harbin Institute of Technology, China Joint Laboratory of HIT and iFLYTEK Research (HFL), Beijing, China COLING 2020 STAY SAFE, STAY WELL OUTLINE • Introduction • The Proposed Dataset • Data collection, candidate generation • Baseline Systems • Experiments • Evaluation metrics, human performance, baseline performance • Conclusion & Future Work Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 3 / 29 CMRC 2019 Dataset - Outline INTRODUCTION • To comprehend human language is essential in A.I. • Machine Reading Comprehension (MRC) has been a trending task in recent NLP research Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 4 / 29 CMRC 2019 Dataset - Introduction INTRODUCTION • Machine Reading Comprehension (MRC) • To read and comprehend a given article and answer the questions based on it • Type of MRC Datasets • Cloze-style: CNN/DailyMail (Hermann et al., 2015), CBT (Hill et al., 2015), PD&CFT (Cui et al., 2016) • Span-extraction: SQuAD (Rajpurkar et al., 2016), CMRC 2018 (Cui et al., 2019) • Choice-selection: MCTest (Richardson et al., 2013), RACE (Lai et al., 2017), C3 (Lai et al., 2017) • Conversational: CoQA (Reddy et al., 2018), QuAC (Choi et al., 2018) • … • Problem: Current MRC datasets lack of evaluations on sentence-level inference ability Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 5 / 29 CMRC 2019 Dataset - Introduction INTRODUCTION • Contributions • We propose a new machine reading comprehension task called Sentence Cloze-style Machine Reading Comprehension (SC-MRC), which aims to test the ability of sentence- level inference. • We release a challenging Chinese dataset CMRC 2019, which consists of 100K blanks, to evaluate the SC-MRC task. • Experiments on several state-of-the-art Chinese pre-trained language models show that there is still much room for them to surpass human performance, indicating the proposed data is challenging. Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 6 / 29 CMRC 2019 Dataset - Introduction DATASET CONSTRUCTION • Task Definition of SC-MRC • A natural extension to cloze-style machine reading comprehension • Instead of filling a word or an entity in the blank, we require the machine to fill in sentence • Steps for SC-MRC data generation • Passage selection Fake candidates • Cloze generation Correct order of sentence ID • (optional) Fake candidate generation Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 7 / 29 CMRC 2019 Dataset - Dataset DATASET CONSTRUCTION • Passage Selection • Genre of raw data: Chinese children’s books (fairy tales, narratives, etc.) • Restrict the passage length in the range of 500 ~ 750 characters • Too short: will be only few blanks. • Too long: will be harder for the model to process (considering most of the pre-trained language models only supports a maximum length of up to 512.) • Finally, we obtained 10K passages and split them into train/dev/test Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 8 / 29 CMRC 2019 Dataset - Dataset DATASET CONSTRUCTION • Cloze Generation • SC-MRC data can be generated with automatic method • To ensure a high-quality cloze generation, we also apply the following rules • The first sentence is always skipped, which usually contains important topic information. • Select the sentence based on the comma or period mark, resulting in the range of 10 to 30 characters. Note that we eliminate the comma or period at the end of the candidate sentence. • If a part of a long sentence is selected, we do not choose other parts to avoid too many consecutive blanks. Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 9 / 29 CMRC 2019 Dataset - Dataset DATASET CONSTRUCTION • Fake Candidates • To increase the difficulties, we also include fake candidates to confuse system • A good fake candidate should have the following characteristics. • The topic of the fake candidate should be the same as the passage. • If there are named entities in the fake candidates, it should also appear in the passage. • It could NOT be a machine-generated sentence, or it would be very easy for the machine to pick the fake one out. Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 10 / 29 CMRC 2019 Dataset - Dataset DATASET CONSTRUCTION • Fake Candidates • Minimize the cost by human annotations → automatic generation • Our approach: extract a sentence from the story but not in current passage 1 2 Snow White 3 Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 11 / 29 CMRC 2019 Dataset - Dataset DATASET CONSTRUCTION • Statistics • The resulting dataset contains over 100K blanks. • Note that, we DO NOT include fake candidates in training set. • We want to test the generalization of MRC system without annotated fake candidates. Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 12 / 29 CMRC 2019 Dataset - Dataset BASELINE SYSTEM • Pre-trained Language Model (PLM) Baseline • Replace blanks with special tokens [unused{Num}], where Num ranges from 0 ~ (blanks -1) • Concatenate each candidate with the passage as input sequence [CLS] Candidatei [SEP] Passage [SEP] Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 13 / 29 CMRC 2019 Dataset - Baseline BASELINE SYSTEM • Main Model Label • Using BERT to calculate the probability of each blank 0.3 0.1 0.2 0.4 Softmaxed scores … [BLANK1]…[BLANK2]…[BLANK3]….[BLANK4]… BERT [CLS] Candidatei [SEP] Passage [SEP] Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 14 / 29 CMRC 2019 Dataset - Baseline BASELINE SYSTEM • Decoding Phase • For each blank, we choose the candidate that has largest probability Blank1 Blank2 Blank3 Blank4 Candid 0.1 0.2 0.4 0.3 Blank1 → Candidate4 ate1 Candid Blank2 → Candidate2 0.2 0.4 0.1 0.2 ate2 Blank3 → Candidate1 Candid 0.3 0.1 0.2 0.4 ate3 Blank4 → Candidate3 Candid 0.4 0.3 0.3 0.1 ate4 Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 15 / 29 CMRC 2019 Dataset - Baseline EXPERIMENTS • Evaluation Metrics • Question-level Accuracy (QAC) # correct predictions ��� = ×100% # total blanks in dataset • Passage-level Accuracy (PAC) # correct passages ��� = ×100% # total passages in dataset Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 16 / 29 CMRC 2019 Dataset - Experiments EXPERIMENTS • Human Performance • 100 passages for three annotators (both dev & test) • Average the QAC and PAC scores to obtain estimated human performance 1 ����, ���� ��� + ��� + ��� ��� = � � � ����� 3 2 ����, ���� ��� + ��� + ��� ��� = � � � × 100 ����� 3 3 ����, ���� Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 17 / 29 CMRC 2019 Dataset - Experiments EXPERIMENTS • Setups • Backbone models • BERT-base, Multi-lingual BERT (Devlin et al., 2018) • BERT-wwm, BERT-wwm-ext, RoBERTa-wwm-ext (base/large) (Cui et al., 2019) • Hyperparameter settings • Initial learning rate of 3e-5, maximum sequence length of 512 • Batch size of 24 with 3 training epochs • Implementation done on PyTorch with Transformers (Wolf et al., 2019) library @ NVIDIA V100 Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 18 / 29 CMRC 2019 Dataset - Experiments EXPERIMENTS • Results: Baselines • Human performance achieved near 95% on QAC and 75~80% on PAC. • Baseline systems achieved high scores on QAC but not on PAC. Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 19 / 29 CMRC 2019 Dataset - Experiments EXPERIMENTS • Results: Comparison with Top Submissions • Top submissions adopt data augmentation, ensemble, etc. • Improves the baseline a lot, but the PAC metric is still far from human performance. Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 20 / 29 CMRC 2019 Dataset - Experiments CONCLUSION & FUTURE WORK • Conclusion • Propose a new task called Sentence Cloze-style Machine Reading Comprehension (SC-MRC) • Release the CMRC 2019 dataset for evaluating SC-MRC ability • Baseline as well as top submissions are still far from the human performance • Future Work • Designing an effective way to dynamically filling candidates into blanks without trial-and-error. • Maybe SC-MRC could be a way to enhance the power of pre-trained language models. Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 21 / 29 CMRC 2019 Dataset - Conclusion ACKNOWLEDGMENT • We would like to thank • Anonymous reviewers for theirs valuable comments on our work • Supporting Funds • National Natural Science Foundation (NSFC) via grant 61976072, 61632011, 61772153 • SQuAD Team • For sharing their website template Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 21 / 29 CMRC 2019 Dataset - Acknowledgment SUBMISSIONS • Official Submissions for Test Set • Like our previous editions of CMRC, we do not release the test set. • Submit your system to CodaLab Worksheet to get the performance on our hidden test set. Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 23 / 29 CMRC 2019 Dataset - Submissions SUBMISSIONS • Official Leaderboard: https://ymcui.github.io/cmrc2019/ Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 24 / 29 CMRC 2019 Dataset - Submissions USEFUL RESOURCES • CMRC 2017 (Cui et al., LREC 2018) • https://github.com/ymcui/cmrc2017 • CMRC 2018 (Cui et al., EMNLP 2019) • https://github.com/ymcui/cmrc2018 • Chinese BERT-wwm, RoBERTa (Cui et al., Findings of EMNLP 2020) • https://github.com/ymcui/Chinese-BERT-wwm Y.