Discriminating Between Mandarin Chinese and Swiss-German Varieties Using Adaptive Language Models

Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models Tommi Jauhiainen Heidi Jauhiainen Department of Digital Humanities Department of Digital Humanities University of Helsinki University of Helsinki [email protected]@helsinki.fi Krister Linden´ Department of Digital Humanities University of Helsinki [email protected] Abstract using a similar language model (LM) adaptation technique to the one we used in the second Var- This paper describes the language identifica- Dial Evaluation Campaign (Zampieri et al., 2018). tion systems used by the SUKI team in the Dis- In that Evaluation Campaign, we used the HeLI criminating between the Mainland and Taiwan variation of Mandarin Chinese (DMT) and the language identification method (Jauhiainen et al., German Dialect Identification (GDI) shared 2016) together with a new LM adaptation ap- tasks which were held as part of the third Var- proach, winning the Indo-Aryan Language Identi- Dial Evaluation Campaign. The DMT shared fication (ILI) and the GDI 2018 shared tasks with task included two separate tracks, one for the a wide margin (Jauhiainen et al., 2018b,c). After simplified Chinese script and one for the tradi- the second Evaluation Campaign, we had devel- tional Chinese script. We submitted three runs oped a new version of the HeLI method and fur- on both tracks of the DMT task as well as on the GDI task. We won the traditional Chinese ther refined the LM adaptation technique (Jauhi- track using Naive Bayes with language model ainen et al., 2019b). With the HeLI 2.0 method adaptation, came second on GDI with an adap- and the refined adaptation technique, we came sec- tive version of the HeLI 2.0 method, and third ond in the GDI 2019 shared task using only char- on the simplified Chinese track using again the acter 4-grams as features. Furthermore, we had adaptive Naive Bayes. implemented several baseline language identifiers for the CLI shared task (Jauhiainen et al., 2019a). 1 Introduction One of them was a Naive Bayes (NB) identifier us- The third VarDial Evaluation Campaign (Zampieri ing variable length character n-grams, which fared et al., 2019) included three shared tasks on lan- better than the HeLI method on the CLI dataset. guage, dialect, and language variety identification. We modified our LM adaptation technique to be The Discriminating between Mainland and Tai- used with the NB classifier and this fared better wan variation of Mandarin Chinese (DMT) con- than the adaptive HeLI 2.0 method on both of the centrated on finding differences between the va- Chinese datasets. With the adaptive NB identifier, rieties of Mandarin Chinese written on mainland we won the traditional Chinese track and came China and Taiwan. The task included two tracks, third on the simplified one. one for the simplified script and another for the In this paper, we first go through some related traditional one. The German Dialect Identifica- work in Section2, after which we introduce the tion (GDI) task was already the third of its kind datasets and the evaluation setup used in the DMT (Zampieri et al., 2017, 2018). In GDI 2019, and the GDI shared tasks (Section3). We then use the task was to distinguish between four Swiss- the training and the development sets to evaluate German dialects. The third task was that of our baseline methods (Sections 4.1 and 4.2) and Cuneiform Language Identification (CLI), but we the HeLI 2.0 method (Section 4.3), after which we did not participate in that as we were partly re- evaluate the efficiency of our LM adaptation pro- sponsible for creating its dataset (Jauhiainen et al., cedure with the HeLI 2.0 and NB methods in Sec- 2019a). tions 4.4 and 4.5. Finally we introduce and discuss We evaluated several language identification the results of our official submissions (Section5) methods using the development sets of the DMT as well as give some conclusions and ideas for fu- and GDI tasks. Our best submissions were created ture work (Section6). 178 Proceedings of VarDial, pages 178–187 Minneapolis, MN, June 7, 2019 c 2019 Association for Computational Linguistics 2 Related work Huang et al.(2014) show how light verbs have different distributional tendencies in Mainland and In this section, we introduce some background in- Taiwan varieties of Mandarin Chinese. Using K- formation on previous studies in language identi- Means clustering they show that the varieties can fication in general, language identification in the be differentiated. context of Chinese and German languages, as well Xu et al.(2016) describe an approach to distin- as LM adaptation. guish between several varieties of Mandarin Chi- 2.1 Language identification in texts nese: Mainland, Hong Kong, Taiwan, Macao, Malaysia, and Singapore. In another study (Xu Language identification (LI) is the task of iden- et al., 2018), they used support vector machines tifying the language of a text. The same meth- (SVM) to distinguish between Gan Chinese di- ods which are used for LI are generally also used alects. for dialect and language variety identification. A comprehensive survey of language identification 2.3 German dialect identification in general has been published in arXiv by Jauhi- The GDI 2019 task was already the third of its kind ainen et al.(2018d). (Zampieri et al., 2017, 2018). In 2017, we did not The series of shared tasks in language iden- participate in the shared task, which was won us- tification began in 2014 with the Discriminat- ing an SVM meta-classifier ensemble with words ing Between Similar languages (DSL) shared task and character n-grams from one to six as features (Zampieri et al., 2014) and similar tasks have been (Malmasi and Zampieri, 2017). We won the 2018 arranged each year since (Zampieri et al., 2015; edition using the HeLI method with LM adapta- Malmasi et al., 2016; Zampieri et al., 2017, 2018). tion and character 4-grams as features (Jauhiainen It is notable that, so far, deep neural net- et al., 2018b). We were the only ones employing works have not gained an upper hand when com- LM adaptation and won with a wide margin to the pared with the more linear classification meth- second system which was an SVM ensemble using ods (Ç oltekin¨ and Rama, 2017; Medvedeva et al., both character and word n-grams (Benites et al., 2017; Ali, 2018). 2018). For a more complete overview of dialect iden- 2.2 Chinese dialect identification tification for the German language, we refer the In the DMT shared task, the text material in the reader to our recent paper where we used LM dataset is UTF-8 encoded. Before the widespread adaptation with the datasets from the GDI 2017 use of UTF-8, different encodings for different and 2018 shared tasks (Jauhiainen et al., 2019b). scripts were widely used. Li and Momoi(2001) Our experiments using a refined LM adaptation discussed methods for automatically detecting the scheme with the HeLI 2.0 method produced the encoding of documents for which an encoding was best published identification results for both of the unknown. They present two tables showing dis- datasets. tributional results for Chinese characters. In their research they had found that the 4096 most fre- 2.4 Language model adaptation quent characters in simplified Chinese encoded in Language model (LM) adaptation is a technique GB2312 cover 99.93 percent of all text and they in which the language models used by a language report that earlier results of traditional Chinese in identifier are modified during the identification Big5 encoding are very similar with the 4096 most process. It is advantageous especially when there frequent characters covering 99.91% of text. is a clear domain (topic, genre, idiolect, time pe- Huang and Lee(2008) used a bag of words riod, etc.) difference between the texts used as method to distinguish between Mainland, Sin- training data and the text being identified (Jauhi- gapore and Taiwan varieties of Chinese. They ainen et al., 2018a,b,c). If an adaptation technique reached an accuracy of 0.929. is successful, the language identifier learns the pe- Brown(2012) displays a confusion matrix of culiarities of the new text and is better able to clas- four varieties of the Chinese macrolanguage as sify it into the given language categories. In the part of his LI experiments for 923 languages. The shared tasks of the VarDial Evaluation Campaigns, Gan and Wu Chinese were among the languages we are provided with the complete test sentence with the highest error rates of all languages. collection at once. This means, that we can addi- 179 tionally choose in which order we learn from the fied script and in the traditional track the texts test data and even process the same test sentences from mainland China originally in the simplified several times before providing the final language script had been converted to the traditional script. labels. The conversion had been made using a tool called The LM adaptation technique and the confi- “OpenCC”.1 dence measure we use in the systems described The texts used as the source for the datasets in this article are similar to those used earlier in were news articles from mainland China and from speech language identification by Chen and Liu Taiwan. The participants were provided with (2005) and Zhong et al.(2007). The adaptation training and development sets for both simplified technique is an improved version of the one we and traditional scripts. Both datasets had been used in our winning submissions at the second tokenized by inserting whitespace characters be- VarDial Evaluation Campaign (Jauhiainen et al., tween individual words. Furthermore, all punctu- 2018b,c).

Load more