KLUE: Korean Language Understanding Evaluation
Total Page:16
File Type:pdf, Size:1020Kb
KLUE: Korean Language Understanding Evaluation Sungjoon Park* Jihyung Moon* Sungdong Kim* Won Ik Cho* Upstage, KAIST Upstage NAVER AI Lab Seoul National University [email protected] [email protected] [email protected] [email protected] Jiyoon Hany Jangwon Park Chisung Song Junseong Kim Yonsei University [email protected] [email protected] Scatter Lab [email protected] [email protected] Yongsook Song Taehwan Ohy Joohong Lee Juhyun Ohy KyungHee University Yonsei University Scatter Lab Seoul National University [email protected] [email protected] [email protected] [email protected] Sungwon Lyu Younghoon Jeong Inkwon Lee Sangwoo Seo Dongjun Lee Kakao Enterprise Sogang University Sogang University Scatter Lab [email protected] [email protected] [email protected] [email protected] [email protected] Hyunwoo Kim Myeonghwa Lee Seongbo Jang Seungwon Do Sunkyoung Kim Seoul National University KAIST Scatter Lab [email protected] KAIST [email protected] [email protected] [email protected] [email protected] Kyungtae Lim Jongwon Lee Kyumin Park Jamin Shin Seonghyun Kim Hanbat National University [email protected] KAIST Riiid AI Research [email protected] [email protected] [email protected] [email protected] Lucy Park Alice Oh** Jung-Woo Ha** Kyunghyun Cho** Upstage KAIST NAVER AI Lab New York University [email protected] [email protected] [email protected] [email protected] Abstract We introduce Korean Language Understanding Evaluation (KLUE) benchmark. KLUE is a collection of 8 Korean natural language understanding (NLU) tasks, including Topic Classification, Semantic Textual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking. We build all of the tasks from scratch from diverse source corpora while respecting copyrights, to ensure accessibility for anyone without any restrictions. With ethical considerations in mind, we carefully design annotation protocols. Along with the benchmark tasks and data, we provide suitable evaluation metrics and fine-tuning recipes for pretrained language models for each task. We furthermore release arXiv:2105.09680v3 [cs.CL] 11 Jun 2021 the pretrained language models (PLM), KLUE-BERT and KLUE-RoBERTa, to help reproducing baseline models on KLUE and thereby facilitate future research. We make a few interesting ob- servations from the preliminary experiments using the proposed KLUE benchmark suite, already demonstrating the usefulness of this new benchmark suite. First, we find KLUE-RoBERTaLARGE outperforms other baselines, including multilingual PLMs and existing open-source Korean PLMs. Second, we see minimal degradation in performance even when we replace personally identifiable information from the pretraining corpus, suggesting that privacy and NLU capability are not at odds with each other. Lastly, we find that using BPE tokenization in combination with morpheme-level pre-tokenization is effective in tasks involving morpheme-level tagging, detection and generation. In addition to accelerating Korean NLP research, our comprehensive documentation on creating KLUE will facilitate creating similar resources for other languages in the future. KLUE is available at https://klue-benchmark.com/. *Equal Contribution. A description of each author’s contribution is available at the end of paper. **Corresponding Authors. yWork done at Upstage. Contents 1 Introduction 4 1.1 Summary . .4 2 Source Corpora 8 2.1 Corpora Selection Criteria . .8 2.2 Selected Corpora . .9 2.2.1 Potential Concerns . 11 2.3 Preprocessing . 11 2.4 Task Assignment . 12 3 KLUE Benchmark 13 3.1 Topic Classification (TC) . 13 3.1.1 Dataset Construction . 13 3.1.2 Evaluation Metric . 14 3.1.3 Related Work . 14 3.1.4 Conclusion . 15 3.2 Semantic Textual Similarity (STS) . 16 3.2.1 Dataset Construction . 16 3.2.2 Evaluation Metrics . 18 3.2.3 Related Work . 19 3.2.4 Conclusion . 19 3.3 Natural Language Inference (NLI) . 20 3.3.1 Dataset Construction . 20 3.3.2 Evaluation Metric . 23 3.3.3 Related Work . 23 3.3.4 Conclusion . 24 3.4 Named Entity Recognition (NER) . 25 3.4.1 Dataset Construction . 25 3.4.2 Evaluation Metrics . 26 3.4.3 Related Work . 27 3.4.4 Conclusion . 27 3.5 Relation Extraction (RE) . 28 3.5.1 Data Construction . 28 3.5.2 Evaluation Metrics . 31 3.5.3 Related Work . 32 3.5.4 Conclusion . 32 3.6 Dependency Parsing (DP) . 33 3.6.1 Dataset Construction . 33 3.6.2 Evaluation Metrics . 34 2 3.6.3 Related Work . 36 3.6.4 Conclusion . 36 3.7 Machine Reading Comprehension (MRC) . 37 3.7.1 Dataset Construction . 37 3.7.2 Evaluation Metrics . 40 3.7.3 Analysis . 41 3.7.4 Related Work . 42 3.7.5 Conclusion . 43 3.8 Dialogue State Tracking (DST) . 44 3.8.1 Dataset Construction . 44 3.8.2 Evaluation Metrics . 48 3.8.3 Analysis . 48 3.8.4 Related Work . 48 3.8.5 Conclusion . 49 4 Pretrained Language Models 50 4.1 Language Models . 50 4.2 Existing Language Models . 52 5 Fine-tuning Language Models 52 5.1 Task-Specific Architectures . 52 5.1.1 Single Sentence Classification . 53 5.1.2 Sentence Pair Classification / Regression . 53 5.1.3 Multiple-Sentence Slot-Value Prediction . 53 5.1.4 Sequence Tagging . 54 5.2 Fine-Tuning Configurations . 55 5.3 Evaluation Results . 55 5.4 Analysis of Models . 55 6 Ethical Considerations 57 6.1 Copyright and Accessibility . 57 6.2 Toxic Content . 57 6.3 Personally Identifiable Information . 58 7 Related Work 58 8 Discussion 59 9 Conclusion 60 Index 74 A Dev Set Results 76 3 1 Introduction A major factor behind recent success of pretrained language models, such as BERT [30] and its variants [82, 22, 49] as well as GPT-3 [110] and its variants [111, 76, 9], has been the availability of well-designed benchmark suites for evaluating their effectiveness in natural language understanding (NLU). GLUE [133] and SuperGLUE [132] are representative examples of such suites and were designed to evaluate diverse aspects of NLU, including syntax, semantics and pragmatics. The research community has embraced GLUE and SuperGLUE, and has made rapid progress in developing better model architectures as well as learning algorithms for NLU. The success of GLUE and SuperGLUE has sparked interest in building such a standardized benchmark suite for other languages, in order to better measure the progress in NLU in languages beyond English. Such efforts have been pursued along two directions. First, various groups in the world have independently created language-specific benchmark suites; a Chinese version of GLUE (CLUE [142]), a French version of GLUE (FLUE [72]), an Indonesian variant [137], an Indic version [57] and a Russian variant of SuperGLUE [125]. On the other hand, some have relied on both machine and human translation of existing benchmark suites for building multilingual version of the benchmark suites which were often created initially in English. These include for instance XGLUE [78] and XTREME [54]. Although the latter approach scales much better than the former does, the latter often fails to capture societal aspects of NLU and also introduces various artifacts arising from translation. To this end, we build a new benchmark suite for evaluating NLU in Korean which is the 13-th most used language in the world according to [34] but lacks a unified benchmark suite for NLU. Instead of starting from existing benchmark tasks or corpora, we build this benchmark suite from ground up by determining and collecting base corpora, identifying a set of benchmark tasks, designing appropriate annotation protocols and finally validating collected annotation. This allows us to preemptively address and avoid properties that may have undesirable consequences, such as copyright infringement, annotation artifacts, social biases and privacy violations. In the rest of this section, we summarize a series of decisions and principles that went behind creating KLUE. 1.1 Summary In designing the Korean Language Understanding Evaluation (KLUE), we aim to make KLUE; 1) cover diverse.