<<

A Sentence Cloze Dataset for Chinese Machine Reading Comprehension

Yiming Cui, Ting , Ziqing , Zhipeng , Wentao , Wanxiang Che, Shijin , Guoping

Research Center for Social Computing and Information Retrieval (SCIR), Harbin Institute of Technology, China Joint Laboratory of HIT and iFLYTEK Research (HFL), Beijing, China

COLING 2020 STAY SAFE, STAY WELL OUTLINE

• Introduction

• The Proposed Dataset

• Data collection, candidate generation

• Baseline Systems

• Experiments

• Evaluation metrics, human performance, baseline performance

• Conclusion & Future Work

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 3 / 29 CMRC 2019 Dataset - Outline INTRODUCTION

• To comprehend human language is essential in A.I.

• Machine Reading Comprehension (MRC) has been a trending task in recent NLP research

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 4 / 29 CMRC 2019 Dataset - Introduction INTRODUCTION

• Machine Reading Comprehension (MRC)

• To read and comprehend a given article and answer the questions based on it

• Type of MRC Datasets

• Cloze-style: CNN/DailyMail (Hermann et al., 2015), CBT (Hill et al., 2015), PD&CFT (Cui et al., 2016)

• Span-extraction: SQuAD (Rajpurkar et al., 2016), CMRC 2018 (Cui et al., 2019)

• Choice-selection: MCTest (Richardson et al., 2013), RACE ( et al., 2017), C3 (Lai et al., 2017)

• Conversational: CoQA (Reddy et al., 2018), QuAC (Choi et al., 2018)

• …

• Problem: Current MRC datasets lack of evaluations on sentence-level inference ability

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 5 / 29 CMRC 2019 Dataset - Introduction INTRODUCTION

• Contributions

• We propose a new machine reading comprehension task called Sentence Cloze-style Machine Reading Comprehension (SC-MRC), which aims to test the ability of sentence- level inference.

• We release a challenging Chinese dataset CMRC 2019, which consists of 100K blanks, to evaluate the SC-MRC task.

• Experiments on several state-of-the-art Chinese pre-trained language models show that there is still much room for them to surpass human performance, indicating the proposed data is challenging.

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 6 / 29 CMRC 2019 Dataset - Introduction DATASET CONSTRUCTION

• Task Definition of SC-MRC

• A natural extension to cloze-style machine reading comprehension

• Instead of filling a word or an entity in the blank, we require the machine to fill in sentence

• Steps for SC-MRC data generation

• Passage selection Fake candidates • Cloze generation Correct order of sentence ID • (optional) Fake candidate generation

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 7 / 29 CMRC 2019 Dataset - Dataset DATASET CONSTRUCTION

• Passage Selection

• Genre of raw data: Chinese children’s books (fairy tales, narratives, etc.)

• Restrict the passage length in the range of 500 ~ 750 characters

• Too short: will be only few blanks.

• Too : will be harder for the model to process (considering most of the pre-trained language models only supports a maximum length of up to 512.)

• Finally, we obtained 10K passages and split them into train/dev/test

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 8 / 29 CMRC 2019 Dataset - Dataset DATASET CONSTRUCTION

• Cloze Generation

• SC-MRC data can be generated with automatic method

• To ensure a high-quality cloze generation, we also apply the following rules

• The first sentence is always skipped, which usually contains important topic information.

• Select the sentence based on the comma or period mark, resulting in the range of 10 to 30 characters. Note that we eliminate the comma or period at the end of the candidate sentence.

• If a part of a long sentence is selected, we do not choose other parts to avoid too many consecutive blanks.

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 9 / 29 CMRC 2019 Dataset - Dataset DATASET CONSTRUCTION

• Fake Candidates

• To increase the difficulties, we also include fake candidates to confuse system

• A good fake candidate should have the following characteristics.

• The topic of the fake candidate should be the same as the passage.

• If there are named entities in the fake candidates, it should also appear in the passage.

• It could NOT be a machine-generated sentence, or it would be very easy for the machine to pick the fake one out.

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 10 / 29 CMRC 2019 Dataset - Dataset DATASET CONSTRUCTION

• Fake Candidates

• Minimize the cost by human annotations → automatic generation

• Our approach: extract a sentence from the story but not in current passage

1

2

Snow White 3

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 11 / 29 CMRC 2019 Dataset - Dataset DATASET CONSTRUCTION

• Statistics

• The resulting dataset contains over 100K blanks.

• Note that, we DO NOT include fake candidates in training set.

• We want to test the generalization of MRC system without annotated fake candidates.

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 12 / 29 CMRC 2019 Dataset - Dataset BASELINE SYSTEM

• Pre-trained Language Model (PLM) Baseline

• Replace blanks with special tokens [unused{Num}], where Num ranges from 0 ~ (blanks -1)

• Concatenate each candidate with the passage as input sequence

[CLS] Candidatei [SEP] Passage [SEP]

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 13 / 29 CMRC 2019 Dataset - Baseline BASELINE SYSTEM

• Main Model Label • Using BERT to calculate the probability of each blank

0.3 0.1 0.2 0.4 Softmaxed scores

… [BLANK1]…[BLANK2]…[BLANK3]….[BLANK4]…

BERT

[CLS] Candidatei [SEP] Passage [SEP]

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 14 / 29 CMRC 2019 Dataset - Baseline BASELINE SYSTEM

• Decoding Phase

• For each blank, we choose the candidate that has largest probability

Blank1 Blank2 Blank3 Blank4

Candid 0.1 0.2 0.4 0.3 Blank1 → Candidate4 ate1

Candid Blank2 → Candidate2 0.2 0.4 0.1 0.2 ate2 Blank3 → Candidate1 Candid 0.3 0.1 0.2 0.4 ate3 Blank4 → Candidate3 Candid 0.4 0.3 0.3 0.1 ate4

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 15 / 29 CMRC 2019 Dataset - Baseline EXPERIMENTS

• Evaluation Metrics

• Question-level Accuracy (QAC)

# correct predictions ��� = ×100% # total blanks in dataset

• Passage-level Accuracy (PAC)

# correct passages ��� = ×100% # total passages in dataset

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 16 / 29 CMRC 2019 Dataset - Experiments EXPERIMENTS

• Human Performance

• 100 passages for three annotators (both dev & test)

• Average the QAC and PAC scores to obtain estimated human performance

1 ����, ����

��� + ��� + ��� ��� = � � � ����� 3 2 ����, ���� ��� + ��� + ��� ��� = � � � × 100 ����� 3 3 ����, ����

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 17 / 29 CMRC 2019 Dataset - Experiments EXPERIMENTS

• Setups

• Backbone models

• BERT-base, Multi-lingual BERT (Devlin et al., 2018)

• BERT-wwm, BERT-wwm-ext, RoBERTa-wwm-ext (base/large) (Cui et al., 2019)

• Hyperparameter settings

• Initial learning rate of 3e-5, maximum sequence length of 512

• Batch size of 24 with 3 training epochs

• Implementation done on PyTorch with Transformers (Wolf et al., 2019) library @ NVIDIA V100

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 18 / 29 CMRC 2019 Dataset - Experiments EXPERIMENTS

• Results: Baselines

• Human performance achieved near 95% on QAC and 75~80% on PAC.

• Baseline systems achieved high scores on QAC but not on PAC.

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 19 / 29 CMRC 2019 Dataset - Experiments EXPERIMENTS

• Results: Comparison with Top Submissions

• Top submissions adopt data augmentation, ensemble, etc.

• Improves the baseline a lot, but the PAC metric is still far from human performance.

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 20 / 29 CMRC 2019 Dataset - Experiments CONCLUSION & FUTURE WORK

• Conclusion

• Propose a new task called Sentence Cloze-style Machine Reading Comprehension (SC-MRC)

• Release the CMRC 2019 dataset for evaluating SC-MRC ability

• Baseline as well as top submissions are still far from the human performance

• Future Work

• Designing an effective way to dynamically filling candidates into blanks without trial-and-error.

• Maybe SC-MRC could be a way to enhance the power of pre-trained language models.

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 21 / 29 CMRC 2019 Dataset - Conclusion ACKNOWLEDGMENT

• We would like to thank

• Anonymous reviewers for theirs valuable comments on our work

• Supporting Funds

• National Natural Science Foundation (NSFC) via grant 61976072, 61632011, 61772153

• SQuAD Team

• For sharing their website template

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 21 / 29 CMRC 2019 Dataset - Acknowledgment SUBMISSIONS

• Official Submissions for Test Set

• Like our previous editions of CMRC, we do not release the test set.

• Submit your system to CodaLab Worksheet to get the performance on our hidden test set.

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 23 / 29 CMRC 2019 Dataset - Submissions SUBMISSIONS

• Official Leaderboard: https://ymcui.github.io/cmrc2019/

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 24 / 29 CMRC 2019 Dataset - Submissions USEFUL RESOURCES

• CMRC 2017 (Cui et al., LREC 2018)

• https://github.com/ymcui/cmrc2017

• CMRC 2018 (Cui et al., EMNLP 2019)

• https://github.com/ymcui/cmrc2018

• Chinese BERT-wwm, RoBERTa (Cui et al., Findings of EMNLP 2020)

• https://github.com/ymcui/Chinese-BERT-wwm

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 25 / 29 CMRC 2019 Dataset - Resources REFERENCES

• Wanxiang Che, Zhenghua , and Ting Liu. LTP: A technology platform. In Proceedings of COLING 2010: Demonstrations, pages 13–16.

• Yiming Cui, Ting Liu, Zhipeng Chen, Shijin Wang, and Guoping Hu. Consensus attention-based neural networks for Chinese reading comprehension. In Proceedings of COLING 2016, pages 1777–1786.

• Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. 2017. Attention-over-attention neural networks for reading comprehension. In Proceedings of ACL 2017, pages 593–602.

• Yiming Cui, Ting Liu, Zhipeng Chen, Wentao Ma, Shijin Wang, and Guoping Hu. 2018. Dataset for the first evaluation on Chinese machine reading comprehension. In Proceedings of LREC 2018, pages 2721–2725.

• Yiming Cui, Wanxiang Che, Ting Liu, Bing , Ziqing Yang, Shijin Wang, and Guoping Hu. 2019a. Pre-training with whole word masking for Chinese BERT. arXiv preprint arXiv:1906.08101.

• Yiming Cui, Ting Liu, Wanxiang Che, Li , Zhipeng Chen, Wentao Ma, Shijin Wang, and Guoping Hu. 2019b. A span-extraction dataset for Chinese machine reading comprehension. In Proceedings of EMNLP-IJCNLP 2019, pages 5886–5891.

• Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. Revisiting pre-trained models for Chinese natural language processing. Findings of EMNLP 2020.

• Jacob Devlin, Ming-Wei , Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL 2019, pages 4171–4186.

• Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov. 2017. Gated-attention readers for text comprehension. In Proceedings of ACL 2017, pages 1832–1846.

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 26 / 29 CMRC 2019 Dataset - References REFERENCES

• Wei , Kai Liu, Jing Liu, Yajuan Lyu, Shiqi , Xinyan Xiao, Liu, Yizhong Wang, Hua , Qiaoqiao She, Xuan Liu, Wu, and Haifeng Wang. 2018. DuReader: a Chinese machine reading comprehension dataset from real-world applications. In MRQA 2018, pages 37–46.

• Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1684–1692.

• Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301.

• Minghao Hu, Yuxing , Zhen , Xipeng Qiu, Furu Wei, and Ming . 2018. Reinforced mnemonic reader for machine reading comprehension. In Proceedings of IJCAI-18, pages 4099–4106.

• Rudolf Kadlec, Martin Schmid, Ondˇrej Bajgar, and Jan Kleindienst. 2016. Text understanding with the attention sum reader network. In Proceedings of ACL 2016, pages 908–918.

• Guokun Lai, Qizhe , Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. In Proceedings of EMNLP 2017, pages 796–805.

• Peng Li, Wei Li, Zhengyan He, Xuguang Wang, Ying , Jie Zhou, and Wei . 2016. Dataset and neural recurrent sequence labeling model for open-domain factoid question answering. ArXiv, abs/1607.06275.

• Yinhan Liu, Myle Ott, Naman Goyal, Jingfei , Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

• Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming , Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop.

• Pranav Rajpurkar, Jian , Konstantin Lopyrev, and Percy . 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of EMNLP 2016, pages 2383–2392.

• Pranav Rajpurkar, Robin , and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of ACL 2018, pages 784–789.

• Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266.

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 27 / 29 CMRC 2019 Dataset - References REFERENCES

• Chih Chieh , Trois Liu, Yuting Lai, Yiying Tseng, and Sam Tsai. 2018. Drcd: a chinese machine reading comprehension dataset. arXiv preprint arXiv:1806.00920.

• Kai , Dian , Dong Yu, and Claire Cardie. 2019. Probing prior knowledge needed in challenging Chinese machine reading comprehension. ArXiv, abs/1904.09679.

• Shuohang Wang and Jing Jiang. 2016. Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905.

• Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In Proceedings of ACL 2017, pages 189–198.

• Wei Wang, Ming , and Chen Wu. 2018. Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering. In Proceedings of ACL 2018, pages 1705–1714.

• Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.

• Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin , Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.

• Caiming , Victor , and Richard Socher. 2016. Dynamic coattention networks for question answering. arXiv preprint arXiv:1611.01604.

• Zhilin Yang, Zihang , Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.

• Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541.

• Chujie , Minlie Huang, and Aixin Sun. 2019. ChID: A large-scale Chinese IDiom dataset for cloze test. In Proceedings of ACL 2019, pages 778–787.

Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu 28 / 29 CMRC 2019 Dataset - References THANK YOU ! https://github.com/ymcui/cmrc2019

[email protected]