Evaluating Transformer-Based Multilingual Text Classification

arXiv:2004.13939v2 [cs.CL] 30 Apr 2020 etknit con uigteceto fNLP of creation ( the models during account into gen- taken should is be properties it linguistic While that recognized per- languages. erally tools all NLP across that fairly mean form not does this However, ( given data annotated languages exist them there different that preventing for utilized explicitly being nothing from is there that corpora text availability. language’s a of all regardless across it languages, equally made perform has models that technologies imperative NLP of prevalence ing yooia hrceitc hrdb hs lan- those by ( guages shared characteristics favors thus typological and data, training se- abundant which exists for a there languages on English-similar of focused group lect has research NLP Historically, Introduction 1 urn L ol r utlnuli h sense the in multilingual are tools NLP Current { ∗ eietf hc aibe otafc lan- affect most variables which identify we etcasfiaino ih agae n eight models. and multi-class languages eight using on classification conduct text we of experiment analysis our an support to then We background study. this empirical use our aid similar- to indices morphological ity and cal- order we word addition, culate in efficacy; modeling guage linguistics, morpho- comparative and typology, typology, order logical word of detailed discussion a Through and structures. syntactic morphological different unequally with perform tools languages NLP across result, a As models. language state-of-the-art of analysis differences its in typological on does primarily research focus NLP not However, typolog- structures. of variety ical a with languages to today’s applied increasingly in are they ubiquitous landscape, become technological tools NLP As oherewl,sonvli iyu aeshaparekh lilyou, shonnavalli, sophiegroenwold, qa contribution. Equal vlaigTasomrBsdMliiga etClassific Text Multilingual Transformer-Based Evaluating oh tal. et Joshi Bender oheGroenwold Sophie , 2009 { , hrney iiz,william dimirza, sharonlevy, Abstract 2020 ,teei iteconsensus little is there ), hrnLv,Db iz,WlimYn Wang Yang William Mirza, Diba Levy, Sharon .Hwvr h increas- the However, ). ilee al. et Mielke nvriyo aiona at Barbara Santa California, of University aht Honnavalli Samhita , ∗ , 2019 ). ecluaesmlrt nie o ohwr order word both for indices similarity calculate We ain rmtefis w aeois ebro a borrow we categories, two first the from vations a,Iain aaee usa,adSaih of ( Spanish) corpus and MLDoc the Russian, Ger- Japanese, French, Italian, English, man, in (Chinese, texts the on languages on trained eight each models demon- classification, eight text that of of out task performance carry the we study strates empirical an of features. morphological and features. English to are similar features language’s how target describe a to used is which simi- index, literature: larity linguistics related the from metric obser- our quantify To family). language languages on of based comparison or- com- (the word and linguistics parative typology, categories: morphological main typology, three der into down bro- efficacy, ken hy- modeling we language that affect aspects pothesize linguistic of background sive attributes. we typological these Therefore, identify to seek lan- performance. a modeling influence most guage’s features linguistic which on ugsms erefined. be lan- must two while guages between similarity outlined, linguistic of properties it con- measures typological to that the researchers conclude NLP sider We for imperative lan- remains results. of modeling indication best guage the provides family guage ete pl hs bevtost h results the to observations these apply then We comprehen- a provide first we end, this Towards hs u otiuin include: contributions our Thus, lan- that find we consider, we metrics the Of • • rte rmaprpciepriett NLP, to pertinent perspective a from written oes hr ahmdli rie and trained is model each where eight of models, performance demon- the in that disparity a strates experiment comprehensive A languages major features. of those on categorization based a and features, linguistic significant of summary A iyOu Lily , ∗ } @cs.ucsb.edu eh Parekh Aesha , ∗ cwn n Li and Schwenk } @ucsb.edu ation , ∗ , 2018 ). tested on a single language for varying sizes In this paper, we will build upon the aforemen- of data with equal label distribution. tioned work by evaluating a text classification task on eight languages, where for each task we use • An analysis of language modeling efficacy in the same language for both training and testing. the contexts of word order typology, morpho- This allows us to pinpoint linguistic features that logical typology, and comparative linguistics, aid modeling in a general context, as opposed to a including an illustration of the relationship cross-lingual context. Additionally, we expand our between language-pair similarity and perfor- discussion of linguistics to include both word or- mance. der and morphological typology, for both of which we will quantify by using a metric from the related 2 Related Work linguistic literature.

Mielke et al. (2019) investigates NLP language- 3 Linguistic Background agnosticism by evaluating a difficulty parameter We approach our analysis of the language mod- with recurrent neural network language models. eling disparity from a linguistic perspective; as This work has found that word inventory size, or such, we break our discussion into three main ar- the number of unique tokens in the training set, eas: word order typology, morphological typology, had the highest correlation with their difficulty pa- and comparative linguistics. For further research, rameter (Mielke et al., 2019). This is in opposi- it may help the reader to know that the former tion to their previous hypothesis that morpholog- two lie within the subfield of linguistic typology ical counting complexity (MCC), a simple met- (which examines languages based on their struc- ric to measure the extent to which a given lan- tural features), whereas the latter (which studies guage utilizes morphological inflection, is a pri- the historical development and family categoriza- mary factor in determining language modeling per- tion of languages) is a different subfield of linguis- formance (Cotterell et al., 2018). We wish to build tics. on this work by introducing increasingly nuanced indices for measuring language-pair similarity in 3.1 Word Order both word order and morphological aspects. Word order typology is the study of the ordering of Much of the prior research in analyzing typol- words within a sentence. In our discussion of word ogy from a modeling perspective focuses on the order, we will focus chiefly on constituent word or- linguistic learning of BERT (Devlin et al., 2019), der, where a constituent is defined as a stand-alone and as much of our analysis is also centered on our unit of language (for example, a word or phrase); BERT results, we will do so as well here. Previous however, within word order typology there do ex- work has sought to pinpoint what typological as- ist other points of analysis, such as modifier order, pects BERT learns through cross-lingual transfer that we will not consider due to their perceived lim- tasks. Pires et al. (2019) evaluates mBERT in this ited effect on modeling efficacy. We will also dis- context and hypothesizes that lexical overlap and cuss word order flexibility, which is the frequency typological similarity improve cross-lingual trans- at which a sentence will vary from its dominant or- fer. To assess typological similarity, they use six der. A summary of these details can be found in word order features adopted from WALS (Pires Table 1. et al., 2019; Dryer and Haspelmath, 2013). In this work, the authors note that mBERT transfers 3.1.1 Constituent Order between languages with entirely different scripts, Nearly all languages have a dominant word order and thus no lexical overlap. Building on this, K that is the most frequently used ordering of a sen- et al. (2020) disproves that word-piece similar- tence’s subject, verb, and object (Comrie, 1989). ity, or lexical overlap, improves or otherwise af- Given the flexibility of word order in a certain lan- fects cross-lingual abilities. Instead, they propose guage (see Section 3.1.2 below), a sentence in that that structural similarities are the sole language at- language might vary from the dominant order be- tributes that determine modeling efficacy, but do cause of situational constraints, to place emphasis, not elaborate on the specific linguistic properties or to convey emotion. Since the dataset we use in that may be (word order, morphological typology, our empirical study is composed of formal news and word frequency, for example) (K et al., 2020). stories, we expect our samples to adhere to each Analytic Agglutinative Language Constituent Order Language Family or Synthetic or Fusional English SVO Analytic N/A Indo-European: Germanic Chinese SVO Analytic N/A Sino-Tibetan: Sinitic French SVO Synthetic Fusional Indo-European: Romance Italian SVO Synthetic Fusional Indo-European: Romance Spanish SVO Synthetic Fusional Indo-European: Romance Japanese SOV Synthetic Agglutinative Japonic: Japanese Russian SVO Synthetic Fusional Indo-European: Slavic German SOV & SVO Synthetic Fusional Indo-European: Germanic

Table 1: Major topics of discussion in Section 3 by language, ordered by word order from most rigid (English) to most flexible (German). language’s dominant word order more than what the other (Dryer, 2013; Sakel, 2015). However, is perhaps representative of the language in other when discussing word order flexibility it is imper- contexts (e.g. informally). ative to identify which specific elements are flexi- The constituent orders present in this dataset ble, and thus we include this when necessary. are subject-verb-object (SVO) for English, Span- Of the languages in this dataset, English has ish, French, Russian, Italian, and Chinese and the most rigid word order, as modern English al- subject-object-verb (SOV) for Japanese. German ways follows a subject-verb format, which is usu- is not categorized as either. There are other ally realized in an SVO setting (Bozsahin, 2020). word ordering schemes present in the world’s lan- Chinese follows similar constituent order patterns guages, such as verb-subject-object (VSO) and to English, with greater degrees of flexibility in object-verb-subject (OVS) that are not represented verb-object order (Gao, 2008). Next, French, Ital- in this corpus. ian and Spanish are generally regarded as having We note two language-specific details here. Al- similar levels of word order flexibility; however, though most languages can be classified into a related work views the three languages in terms single dominant word order, the degree of “dom- of their grammatization, and thus we can view inance the dominant word order has is related to French as the most rigid of the three, followed the language’s word order flexibility. Although we by Italian and then Spanish, though the margins will discuss word order flexibility shortly, it is im- of difference between the three are low (Lahousse portant to note that Russian, as a language with and Lamiroy, 2012). Japanese has more flexibil- flexible word order, is still considered an SVO lan- ity than the languages discussed previously, as it guage despite that ordering being less strict than frequently uses both SOV and OSV word orders other languages (Dryer, 2013). However, German (Bozsahin, 2020). Finally, as mentioned above, varies between SVO (in main clauses without an Russian depends heavily on morphological mark- auxiliary verb) and SOV (in clauses with an auxil- ing despite following an SVO constituent order, iary verb and in subordinate clauses) equally and and German does not follow a single constituent is thus categorized as lacking a dominant verb or- order at all, thus making it the most flexible (Dryer, der (Dryer, 2013). 2013).

3.1.2 Word Order Flexibility 3.2 Morphological Typology Without delving too deep into the nuances of word While in the previous section we used word order order flexibility, we wish to provide some gen- typology to look at the ordering of constituents eral observations. For our purposes, we simplify within a sentence or phrase, we now turn to mor- word order flexibility to a spectrum with rigid phological typology to examine the internal struc- word order (where the constituent and modifier ture of words, i.e. the patterns that involve the word orders follow strict grammatical rules) on creation and structure of words, primarily by their one end and flexible word order (where morpho- morphemes (Genetti, 2014). A morpheme is the logical marking is grammatically necessary) on smallest semantic unit in language, and differs from a constituent in that it cannot convey mean- 3.2.2 Agglutinative - Fusional Axis ing alone; for instance, the word “unbelievable Within synthetic languages, agglutinative lan- would be considered a constituent, whereas “un, guages have easily discernible morphemes, and “believe, and “able are morphemes. morphemes typically contain a single feature; fu- Having a thorough understanding of mor- sional languages use a single inflectional morphemes and the different ways they are utilized pheme to signify multiple grammatical meanings. across languages is essential for researchers in Six languages in this dataset–French, Spanish, Ital- NLP because many contemporary NLP models ian, Russian, German, and Japanese–are synthetic, use word embeddings as the underlying represen- where all are fusional except Japanese which is ag- tation for language, using primarily Word2Vec glutinative. Therefore, Japanese follows a more (Mikolov et al., 2013) or GloVe (Pennington et al., regular morphology, whereas the fusional lan- 2014) architectures. Recent work in this area guages may be considered more morphologically has developed word embedding methods based on complex (with German, then Russian having the morphemes, instead of constituents (Luong et al., highest degrees of inflection). 2013; Cotterell et al., 2016). Considering the prevalence of these methods and their importance 3.3 Comparative Linguistics in word embedding processes, we further investi- Another beneficial angle to compare languages gate the morphological diversity of the languages from is comparative linguistics, which traces the in this dataset. development of modern natural languages through Traditionally, linguists have sought to classify historical comparison and categorizes them based languages morphologically along two axes: an- on common origin into language families. alytic/synthetic and agglutinative/fusional (Gar- First, we examine the families of the languages land, 2006). The former describes the number of of this dataset. Of the eight languages we used, six morphemes per word, and the latter the clarity of belong to the West European branch of the Indo- distinctions between morphemes. A summary of European language family; English and German these details can be found in Table 1. are Germanic languages; Spanish, Italian, and French are Romance languages; and Russian is a 3.2.1 Analytic - Synthetic Axis Slavic language. Chinese descends from the Sino- Given that inflection is a modification of a word Tibetan family, and Japanese from the Japonic for a particular grammatical category, we can de- family (Gordon and Grimes, 2019). fine an analytic language as containing fewer mor- This dataset acutely under-represents languages phemes per word and using less inflection, and a from other language families. For instance, West synthetic language as containing more morphemes European languages are but one branch of the per word and using more inflection. Of this Indo-European superfamily; a significant sister dataset, English and Chinese are analytic, while family to the West European languages is the Indo- the remaining languages fall within the synthetic Iranian branch (which includes Persian from the spectrum. Chinese is more so than English, as it Iranian sub-family and Hindi from the Indic sub- formerly was seen as an isolating language (an ex- family). We also have no languages from the treme of analytic, where there is no use of inflec- Niger-Congo and Afro-Asiatic families (which are tion and words are little more than morphemes); the two major superfamilies of Africa and together however, today it is considered to be analytic be- make up 27% of the known world languages) the cause its words commonly contain more than one Austronesian families (at 17.7%), the Trans-New morpheme (Bybee, 1997). English is more moder- Guinea families (at 6.8%), or considerable others ately analytic and is thus commonly categorized as (Gordon and Grimes, 2019). fusional, which is typically reserved for synthetic Looking at familial origins is beneficial for easy languages. categorization of languages; however, they are not Because analytic languages use little inflec- always indicative of typological properties, which tion, information is instead conveyed through tools are intuitively more important for a given lan- such as word order, helper words, and context. guage’s modeling efficacy. For example, although Thus, there is a correlation between increased both English and German are classified as Ger- word order rigidity and the analytic categorization. manic languages, they are different when com- English - Russian 92.86 English - German 83.33 English - Spanish 86.21 English - Russian 66.67 English - Italian 85.19 English - French 58.33 English - Chinese 70.73 English - Japanese 58.33 English - French 65.52 English - Spanish 50 English - German 51.72 English - Chinese 50 English - Japanese 30.44 Table 3: Morphological similarity indices of the lan- Table 2: Word order similarity indices of the languages guages in the MLDoc corpus, from greatest to least.2 in the MLDoc corpus, from greatest to least.

and six that contain the same value. Thus the En- pared in both word order and morphological ty- glish - Chinese morphological similarity index is pology (see Table 1); this is because English has 50. A complete table of word order similarity come into areal contact with more languages than indices and morphological similarity indices be- German, which is considered more archaic. Thus, tween English and the remaining seven languages Modern English typology more closely resembles in this dataset can be found in Tables 2 and 3, re- French typology, as it has more recent influence spectively.1 from French and other Romance languages than We note that it is perhaps problematic to mea- from its Germanic roots. sure languages based on their similarity to English. However, we do so to demonstrate an existing dis- 4 Measures of Language Similarity parity in language modeling efficacy; by no means are we suggesting that it is ethical for English to For clarity of analysis, we use two similarity in- be the accepted standard in broader applications dices that we have adopted and modified from the of NLP. related linguistic literature (Comrie, 2016). We use the relevant features from the World Atlas of 5 Multilingual Disparity in Text Language Structures, or WALS, where a feature Classification is defined as a structural property of language that differs across languages (Dryer and Haspelmath, We train eight models using texts in each of eight 2013). For each similarity index, we use only languages, and compare their performance on the WALS entries that are categorized as “morphol- task of multi-class text classification. First, we ogy” or “word order” features, as appropriate. control for the amount of training data across all If there is any difference in language perfor- languages so that any difference in performance mance across NLP models, we expect it to cor- can be attributed to either model design or linguis- relate with how similar a language is to English. tic factors. We then vary the amounts of data used We anticipate this simply because many NLP tech- to train these models to observe the possible ef- nologies have been made with the underlying as- fects of minimal training data. Additionally, by sumption of English typology. Thus, to mea- using the MLDoc corpus we ensure that the distri- sure English-similarity, evaluate this hypothesis, bution of labels in all datasets is identical. and identify the typological characteristics that We analyze performance based on the accuracy have the greatest correlation with modeling perfor- and F1-score. Results for these metrics, in addi- mance, we utilize two similarity metrics: word or- tion to precision and recall, are displayed in the der similarity index and morphological similarity Appendix. index. Each similarity index is determined using 5.1 Text Classification Models WALS data and is computed as follows. Given two We investigate eight commonly used text clas- languages, we first count the number of features sification models, using default hyperparameters that are documented for both; we then count the instances of equal categorization for each feature. 1English is excluded, as the similarity indices displayed For instance, in finding the English - Chinese mor- in the table are computed with respect to English. 2We do not include a morphological similarity index for phological similarity index, we identify 12 mor- Italian, as there are only two WALS morphology features en- phology features that WALS has recorded for both, coded for both Italian and English. 0.98 0.96 0.94 0.92

Macro F1-score 0.9 0.88

Chinese Spanish English Russian Japanese French German Italian BERT-base-cased mBERT-base-cased

Figure 1: Macro F1-score for each of the eight languages in MLDoc when trained on size 10,000, for both BERT- base-cased and mBERT-base-cased.3 given by open-source libraries: Linear Logis- 5.2 Corpus tic Regression (LLR) (Bishop, 2006), Multino- mial Naive Bayes (MNB) (Domingos and Pazzani, These models are evaluated on a multilingual sub- 1997), and Linear Support Vector Machines (lin- set of the Reuters RCV2, MLDoc (Schwenk and earSVC) (Fan et al., 2008) models from scikit- Li, 2018). The Reuters corpus contains news sto- learn, Long Short Term Memory (LSTM) (Hochre- ries in English, German, Spanish, Italian, French, iter and Schmidhuber, 1997), Bi-directional Long Russian, Chinese, and Japanese. They are sorted Short Term Memory (BiLSTM) (Hochreiter and into four groups based on the primary subject of Schmidhuber, 1997), and Convolutional Neural each story: CCAT (Corporate/Industrial), ECAT Network (CNN) (Dumoulin and Visin, 2016) mod- (Economics), GCAT (Government/Social), and els from Keras. MCAT (Markets) (Lewis et al., 2004). For each We also use both BERT and Multilingual BERT language, the data is split into multiple training 3 (mBERT) (Devlin et al., 2019). For BERT, we sets (1000, 5000, and 10000 news stories) , a de- use a pre-trained BERT model specific to the lan- velopment set (1000 news stories), and a test set guage at hand. We use BERT for English and Chi- (4000 news stories), which MLDoc guarantees uni- nese, Italian BERT 4, German BERT 5, FlauBERT form class distributions for. Thus, we will analyze for French (Le et al., 2019), BETO for Spanish model performance on the task of categorizing text (Caete et al., 2020), RuBERT for Russian (Kura- samples into one of the four classes for each lan- tov and Arkhipov, 2019), and bert-base-japanese 6 guage. for Japanese. All models are fine-tuned for the task Admittedly, there are disadvantages to using of text classification using our training data, with MLDoc: the corpus is not parallel and nearly 10% a batch size of 16, maximum sequence length of of articles are duplicates (Eriksson, 2016); addi- 250, and learning rate of 2e-5 as recommended by tionally, the languages included are not represen- the related work (Devlin et al., 2019) for 1 epoch. tative of maximal typological diversity. However, This allows us to expand the scope of data for lan- we justify the use of MLDoc as follows. The cor- guage performance differences on a wide range of pus is representative of a real-world application models. of NLP technologies. The duplicates noted by Eriksson (2016) are distributed randomly and thus will not affect the results for any single language. 3With the exception of Spanish (9458 documents) and Lastly, although it may be possible to use a dataset Russian (5216 documents), the maximum dataset sizes pro- that is perhaps more representative of the linguis- vided by MLDoc. tic variety possible, such as Wikipedia, that would 4https://github.com/dbmdz/berts#italian-bert 5https://github.com/dbmdz/berts#german-bert sacrifice either the balance or presence of labels 6https://github.com/cl-tohoku/bert-japanese provided by MLDoc. C 0.95 E F 0.96 0.9 G I 0.94 J 0.85 R S 0.92

Average F1-score 0.8 Average F1-score 0.9 0.75 0.88 1000 5000 10000 SJSTRG Training dataset size Language family

Figure 2: Comparison between average F1-score Figure 3: Average F1-score with mBERT, on training with mBERT, on training sizes 1000, 5000, and size 10,000, of each language family in the MLDoc 10,000 for each language in the MLDoc dataset. dataset. The x-axis corresponds to Slavic, Japonic, The ﬁrst letter of each language has been used to Sino-Tibetan, Romance, and Germanic.7 denote it.7

sian could be attributed to insufficient data. 5.3 Results We note that BERT displays higher scores than mBERT when measured by macro F1-score for Most models display similar trends in their un- every language except English and Russian (for equal performance based on language, where mod- which the differences are negligible: 0.03 for els trained on German, English, French, and Span- both). However, it is interesting to note that ish consistently outperform models trained on Ital- the largest increases in F1-score from mBERT to ian, Russian, Chinese, and Japanese (see Ap- BERT occur in Spanish, German, French, and Ital- pendix). The highest performing language across ian (see Figure 1). This indicates that these lan- the models we tested is German, closely followed guages gain the most performance when in use by by English. Spanish, French, and Italian follow a model that has been pretrained only by the same closely, at times outperforming English and Ger- language, and thus benefit the least from the use of man. Inconsistent with Russian’s high similarity mBERT, which has been trained on multiple lan- index (see Table 2), our results show that Rus- guages. sian is the lowest performing European language. This may be because it is morphologically distant For the non-BERT models, training on Chinese from English (see Table 1); in addition, it has flex- largely produces the least effective results, with ible word order (second only to German) and high the margin of difference between the accuracy of morphological complexity. Another contributing Chinese and the top-performing language gener- factor could be that the MLDoc dataset provides ally being between 20 to 30 percent for most mod- a maximum training size of 5216 for Russian, in els. Japanese is not significantly better, and at contrast to the 9000 to 10000 provided for all other times under performs Chinese, especially when languages. As illustrated in Figure 2, languages measured by precision. The macro and weighted appear to learn at varying rates; for example, the recall and F1-scores for Chinese are often below % F-1 score for English appears to increase by a sim- 70 , which is rarely the case with any of the Euro- pean languages. The highest precision, recall, and ilar amount going from 1000 to 5000 versus when going from 5000 to 10000. On the other hand, F1-scores always correspond to German, French, Spanish increases at a faster rate going from 1000 and Spanish (see Appendix for further details). to 5000, but nearly plateaus in the segment going from 5000 to 10000. If Russian were to follow 7With the exception of Spanish (9,458 documents) and a similar learning pattern to English, it would be Russian (5,216 documents), the maximum dataset sizes pro- likely that the model’s underperformance on Rus- vided by MLDoc. 6 Analysis 100 R I S For each of the eight models with which we tested 80 our languages, we now take the average F1-score C F on a training size of 10,000. The top four perform- 60 ing languages across our models were German, G English, French, and Spanish, which we classify 40 as high-performing languages. Thus, we will con- J sider the remaining four – Italian, Russian, Chi- Word order similarity index nese, and Japanese – as low-performing languages. 20 We also include discussion of comparative linguis- 0.88 0.9 0.92 0.94 0.96 0.98 tics within our analysis, for which Figure 3 pro- mBERT F1-score vides a useful guide. Figure 4: Correlation between the word order similarity index and average F1-scores on mBERT for each 6.1 Effects of Resource Availability language in MLDoc on training size 10,000. The first We note that there is a generally positive trend letter of each language has been used. for all languages’ performance when trained on increasingly large datasets, as can be seen in Fig- Spanish, or English, and then only by a small mar- ure 2. Although this is not a new observation, it gin). is a noteworthy one, as it indicates that because Of the other consistently high-performing lan- larger training sizes improve modeling efficacy, it guages (French, Spanish, and English), typolog- is worthwhile for future work to expand datasets ical properties are much more consistent. All from predominately underrepresented languages. three have an SVO constituent order. Although We expect languages with lower initial perfor- French and Spanish are synthetic fusional lan- mance to increase more rapidly when given larger guages, while English is analytic, English still con- training sizes. For instance, it is intuitive that tains a moderate amount of inflection and thus Spanish improves more than German when mov- there is not a substantial amount of morphological ing from training size 1000 to 5000 (8.2% versus difference between the three. 3.5% with mBERT), since Spanish has a lower ini- We noted previously from a comparative lin- tial value than German (85.1% and 92.6% with guistics standpoint that although English and Ger- mBERT, respectively). However, there are excep- man are both from the Germanic sub-family, Mod- tions to this trend. We note that Italian, Russian, ern English more closely resembles French, be- Japanese, French, and Spanish increase at faster cause of its use of loanwords and grammatical pat- rates than English, German, and Chinese, and that terns from Old French (see Section 3.3). Thus it is these distinctions are not in line with our catego- not entirely unexpected that English, French, and rizations of high and low-performing languages. Spanish have roughly equivalent performance results. What is more difficult to account for is the 6.2 Features of High-Performing Languages high performance of German, despite its relative Overall, it is surprising that German is the highest typological irregularity and morphological com- performing language of the MLDoc dataset, given plexity. that it is dissimilar to English in our primary orders 6.3 Features of Low-Performing Languages of typology: constituent order, word order flexibility, and morphological categorization. Whereas Although we mentioned that Italian, Russian, Chi- English is a rigid SVO language, German is flex- nese, and Japanese are the lower four perform- ible and is categorized as both SVO and SOV; in ing languages, this is not without some interesting addition, English is an analytic language, whereas caveats. German is synthetic. It is also important to note First, it is important to note that Italian falls that German is consistently the highest perform- under our category of “low-performing” because ing language across all the models of our study, our modeling results with Italian are lower than despite their significantly different architectures the four languages discussed in Section 6.1. How- (although it is occasionally surpassed by French, ever, the difference between Italian and the “high- G sults. Word order similarity seems to have an 80 inverse correlation with modeling efficacy, with Japanese as an obvious outlier (see Figure 4). Mor- phological similarity does not have a discernable R correlation (see Figure 5). This may be because some of our indices seem 60 J F counterintuitive. For instance, given that both En- C S glish and German are Germanic languages, we

Morphological similarity index may consider them to be similar in a general sense. 0.88 0.9 0.92 0.94 0.96 0.98 However, we would hardly expect a high similarity index in a morphological context. Consulting Ta- mBERT F1-score ble 1, we see that English is an analytic language, Figure 5: Correlation between the morphological sim- while German is synthetic and fusional; and En- ilarity index and average F1-scores on mBERT for glish is the most rigid language of our dataset, each language in MLDoc on training size 10,000. The while German is the most flexible (as we have first letter of each language has been used. 8 noted, increased word order flexibility is a good indication of high morphological complexity). Ad- ditionally, Figure 4 displays an inverse correlation performing” languages is smaller than the differ- between word order similarity and mBERT, which ence between Italian and the other languages of is also illogical. this section: Russian, Chinese, and Japanese. This is not particularly surprising, as Italian has an SVO Looking into the specific features that we in- constituent order, a fusional morphological struc- clude as factors in the morphological similarity ture, and historically has developed under simi- index, we hypothesize that certain morphological lar circumstances to our other Romance languages, attributes might hold greater importance to lan- French and Spanish. guage similarity than others (our formula for sim- Next, Chinese has different performance results ilarity indices placed equal weight on each WALS depending on the model that uses it: for the major- feature). For example, we find that for feature ity of our models, it is the lowest-performing lan- 29A, Syncretism in Verbal Person/Number Mark- guage, yet for BERT its margin of difference with ing, English and low-performing languages are the top-performing language is 2.9%, and 3.3% dissimilar whereas the high-performing languages for mBERT (both on training sizes of 10,000). are similar (Baerman and Brown, 2013). In fu- Additionally, for BERT and mBERT, the top- ture works, it may be good to look into such sub- performing language is English rather than Ger- attributes. man; and as can be seen from Table 1, English We attribute much of this to the features avail- and Chinese have similar syntactic and morpho- able in WALS. However, as is visible in Figure 3, logical structures. Thus, we conclude that BERT’s language family is still indicative of model perfor- architecture is more responsive to typological fea- mance. Because languages from the same family tures than the other models we have included in are more likely to have derived similar typological this study. In a similar matter, we conclude that characteristics, we conclude that linguistic factors since linguistic features seem to be weighted less still largely determine modeling efficacy. in our non-BERT models, these models are more dependent on other language-specific elements; 7 Conclusion for example, written Chinese and Japanese are We have provided a thorough investigation of lin- composed of logograms, while the remaining lan- guistics from an NLP perspective and demon- guages utilize phonemic writing. strated that the variance in language modeling effi- 6.4 Similarity Index Correlations cacy can be attributed to typological differences in the languages modeled. We have also shown that When comparing word order similarity indices language family is a strong indication of how well and morphological similarities against our F1- a language will perform. Although we show little score data for mBERT, we find inconclusive re- correlation between word order or morphological 8Italian is not included in Figure 5; see Section 4. similarity and the strength of a language’s modeling results, we note that language families share ty- Conference of the North American Chapter of the pological attributes, and thus conclude that a more Association for Computational Linguistics: Human nuanced definition of language similarity is neces- Language Technologies, Volume 2 (Short Papers), pages 536–541, New Orleans, Louisiana. Associa- sary to evaluate this point. It is crucial that devel- tion for Computational Linguistics. opers keep these findings in mind while creating and evaluating NLP tools. Ryan Cotterell, Hinrich Schütze, and Jason Eisner. 2016. Morphological smoothing and extrapolation Acknowledgments of word embeddings. In Proceedings of the 54th An- nual Meeting of the Association for Computational We thank Dr. Bernard Comrie (University of Cal- Linguistics (Volume 1: Long Papers), pages 1651– 1660, Berlin, Germany. Association for Computa- ifornia, Santa Barbara) for his expertise regarding tional Linguistics. cross-lingual typology and discussions in the topic, and Anusha Anand (University of South Carolina) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and for providing us with direction in which linguistic Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- features to research. standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language References Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- Matthew Baerman and Dunstan Brown. 2013. Syn- ation for Computational Linguistics. cretism in verbal person/number marking. In Matthew S. Dryer and Martin Haspelmath, editors, Pedro Domingos and Michael Pazzani. 1997. On the The World Atlas of Language Structures Online. optimality of the simple bayesian classifier under Max Planck Institute for Evolutionary Anthropol- zero-oneloss. Machine Learning - ML, 29:103–130. ogy, Leipzig. ¨ Matthew S. Dryer. 2013. Order of subject, object and Emily M. Bender. 2009. Linguistically naIve != lan- verb. In Matthew S. Dryer and Martin Haspelmath, guage independent: Why nlp needs linguistic typol- editors, The World Atlas of Language Structures On- ogy. In Proceedings of the EACL 2009 Workshop line. Max Planck Institute for Evolutionary Anthro- on the Interaction Between Linguistics and Compu- pology, Leipzig. tational Linguistics: Virtuous, Vicious or Vacuous?, ILCL ’09, pages 26–32, Stroudsburg, PA, USA. As- Matthew S. Dryer and Martin Haspelmath, editors. sociation for Computational Linguistics. 2013. WALS Online. Max Planck Institute for Evo- lutionary Anthropology, Leipzig. Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statis- Vincent Dumoulin and Francesco Visin. 2016. A guide tics). Springer-Verlag, Berlin, Heidelberg. to convolution arithmetic for deep learning. Cem Bozsahin. 2020. Word order, word order flexibil- Robin Eriksson. 2016. Quality assessment of the ity and the lexicon. Reuters vol. 2 multilingual corpus. In Proceedings Joan L. Bybee. 1997. Semantic aspects of morpholog- of the Tenth International Conference on Language ical typology. In Sandra A. Thompson Joan L. By- Resources and Evaluation (LREC 2016), pages bee, John Haiman, editor, Essays on LanguageFunc- 1813–1819, Portoroˇz, Slovenia. European Language tion and Language Type, pages 25–37. John Ben- Resources Association (ELRA). jamins Publishing. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang- Jos Caete, Gabriel Chaperon, Rodrigo Fuentes, and Rui Wang, and Chih-Jen Lin. 2008. Liblinear: A Jorge Prez. 2020. Spanish pre-trained bert model library for large linear classification. J. Mach. Learn. and evaluation data. In to appear in PML4DC at Res., 9:1871–1874. ICLR 2020. Qian Gao. 2008. Word order in mandarin: Reading and B. Comrie. 1989. Language Universals and Linguistic speaking. In Proceedings of the 20th North Ameri- Typology: Syntax and Morphology. University of can Conference on Chinese Linguistics. Chicago Press. Jennifer Garland. 2006. Morphological typology and Bernard Comrie. 2016. Measuring language typicality, the complexity of nominal morphology in sinhala. with special reference to the Americas, pages 363– In Proceedings from the Workshop on Sinhala Lin- 384. guistics. Ryan Cotterell, Sebastian J. Mielke, Jason Eisner, and Carol Genetti. 2014. How Languages Word: An Intro- Brian Roark. 2018. Are all languages equally hard duction to Languages and Linguistics. Cambridge to language-model? In Proceedings of the 2018 University Press. R. G. Gordon and B. F. Grimes. 2019. Ethnologue: Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. Languages of the World, 22nd edition. SIL Interna- How multilingual is multilingual BERT? In Pro- tional, Dallas, Texas. ceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 4996– Sepp Hochreiter and Jrgen Schmidhuber. 1997. Long 5001, Florence, Italy. Association for Computa- short-term memory. Neural computation, 9:1735– tional Linguistics. 80. Jeanette Sakel. 2015. Study skills for linguistics. Un- Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika derstanding language series. Bali, and Monojit Choudhury. 2020. The state and fate of linguistic diversity and inclusion in the nlp Holger Schwenk and Xian Li. 2018. A corpus for mul- world. In to appear in Proceedings of the 58th An- tilingual document classification in eight languages. nual Meeting of the Association for Computational In Proceedings of the Eleventh International Confer- Linguistics. ence on Language Resources and Evaluation (LREC 2018), Paris, France. European Language Resources Karthikeyan K, Zihan Wang, Stephen Mayhew, and Association (ELRA). Dan Roth. 2020. Cross-lingual ability of multilingual bert: An empirical study. In International Con- ference on Learning Representations.

Yuri Kuratov and Mikhail Arkhipov. 2019. Adaptation of deep bidirectional multilingual transformers for russian language. CoRR, abs/1905.07213.

Karen Lahousse and Batrice Lamiroy. 2012. Word order in french, spanish and italian: A grammaticaliza- tion account. Folia Linguistica, 2:387–415.

Hang Le, Loc Vial, Jibril Frej, Vincent Segonne, Max- imin Coavoux, Benjamin Lecouteux, Alexandre Al- lauzen, Benot Crabb, Laurent Besacier, and Didier Schwab. 2019. Flaubert: Unsupervised language model pre-training for french.

David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res., 5:361–397.

Thang Luong, Richard Socher, and Christopher Man- ning. 2013. Better word representations with re- cursive neural networks for morphology. In Pro- ceedings of the Seventeenth Conference on Computa- tional Natural Language Learning, pages 104–113, Soﬁa, Bulgaria. Association for Computational Lin- guistics.

Sebastian J. Mielke, Ryan Cotterell, Kyle Gorman, Brian Roark, and Jason Eisner. 2019. What kind of language is hard to language-model? In Pro- ceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 4975– 4989, Florence, Italy. Association for Computational Linguistics.

Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Efﬁcient estimation of word representations in vector space. CoRR, abs/1301.3781.

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word rep- resentation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 1532–1543, Doha, Qatar. Asso- ciation for Computational Linguistics. Appendix

Linear Logistic Regression Precision Recall F1-Score Training Size Languages Accuracy (macro, weighted) (macro, weighted) (macro, weighted) Chinese 0.678 0.737, 0.739 0.565, 0.678 0.549, 0.678 English 0.899 0.900, 0.901 0.899, 0.899 0.899, 0.899 French 0.899 0.899, 0.899 0.899, 0.899 0.899, 0.899 1000 German 0.907 0.909, 0.908 0.908, 0.907 0.908, 0.907 Italian 0.851 0.854, 0.853 0.852, 0.851 0.851, 0.851 Japanese 0.682 0.745, 0.748 0.680, 0.682 0.683, 0.682 Russian 0.835 0.834, 0.836 0.835, 0.835 0.834, 0.835 Spanish 0.925 0.918, 0.926 0.919, 0.925 0.918, 0.925 Chinese 0.736 0.824, 0.789 0.613, 0.736 0.593, 0.736 English 0.931 0.931, 0.931 0.931, 0.931 0.931, 0.931 French 0.939 0.939, 0.939 0.939, 0.939 0.939, 0.939 5000 German 0.937 0.938, 0.937 0.937, 0.937 0.938, 0.937 Italian 0.892 0.893, 0.892 0.892, 0.892 0.892, 0.892 Japanese 0.747 0.782, 0.785 0.746, 0.747 0.749, 0.746 Russian 0.836 0.849, 0.843 0.821, 0.836 0.829, 0.83 Spanish 0.899 0.907, 0.902 0.880, 0.899 0.890, 0.899 5216 Russian 0.836 0.848, 0.842 0.821, 0.836 0.829, 0.836 9458 Spanish 0.906 0.917, 0.911 0.890, 0.906 0.899, 0.906 Chinese 0.768 0.834, 0.802 0.635, 0.768 0.609, 0.768 English 0.935 0.935, 0.935 0.934, 0.935 0.934, 0.935 French 0.945 0.945, 0.945 0.944, 0.945 0.944, 0.945 10000 German 0.929 0.931, 0.929 0.929, 0.929 0.929, 0.929 Italian 0.849 0.858, 0.858 0.849, 0.849 0.848, 0.849 Japanese 0.774 0.794, 0.797 0.773, 0.774 0.776, 0.774 Multinomial Naive Bayes Precision Recall F1-Score Training Size Languages Accuracy (macro, weighted) (macro, weighted) (macro, weighted) Chinese 0.727 0.797, 0.755 0.600, 0.727 0.573, 0.727 English 0.902 0.903, 0.903 0.902, 0.902 0.902, 0.902 French 0.902 0.905, 0.905 0.902, 0.902 0.902, 0.902 1000 German 0.920 0.922, 0.922 0.921, 0.920 0.919, 0.920 Italian 0.844 0.845, 0.844 0.846, 0.844 0.844, 0.844 Japanese 0.736 0.735, 0.737 0.733, 0.736 0.732, 0.736 Russian 0.753 0.749, 0.763 0.733, 0.753 0.737, 0.753 Spanish 0.925 0.929, 0.927 0.910, 0.925 0.918, 0.925 Chinese 0.776 0.647, 0.738 0.642, 0.776 0.615, 0.776 English 0.922 0.923, 0.923 0.921, 0.922 0.921, 0.922 French 0.933 0.933, 0.933 0.932, 0.933 0.932, 0.933 5000 German 0.955 0.956, 0.955 0.955, 0.955 0.955, 0.955 Italian 0.862 0.866, 0.864 0.862, 0.862 0.862, 0.862 Japanese 0.798 0.806, 0.809 0.798, 0.798 0.799, 0.798 Russian 0.781 0.777, 0.787 0.759, 0.781 0.764, 0.781 Spanish 0.899 0.918, 0.907 0.872, 0.899 0.884, 0.899 5216 Russian 0.776 0.787, 0.760 0.781, 0.781 0.764, 0.781 9458 Spanish 0.904 0.921, 0.911 0.880, 0.904 0.891, 0.904 Chinese 0.789 0.657, 0.745 0.651, 0.789 0.621, 0.789 English 0.927 0.928, 0.928 0.926, 0.927 0.927, 0.927 French 0.937 0.937, 0.937 0.936, 0.937 0.937, 0.937 10000 German 0.959 0.959, 0.959 0.959, 0.959 0.959, 0.959 Italian 0.862 0.866, 0.865 0.863, 0.862 0.863, 0.862 Japanese 0.815 0.823, 0.826 0.815, 0.815 0.816, 0.815 LinearSVC Precision Recall F1-Score Training Size Languages Accuracy (macro, weighted) (macro, weighted) (macro, weighted) Chinese 0.682 0.653, 0.709 0.584, 0.682 0.584, 0.682 English 0.895 0.897, 0.897 0.895, 0.895 0.895, 0.895 French 0.912 0.912, 0.913 0.912, 0.912 0.912, 0.912 1000 German 0.915 0.916, 0.916 0.916, 0.915 0.916, 0.915 Italian 0.859 0.861, 0.859 0.859, 0.859 0.859, 0.859 Japanese 0.721 0.747, 0.751 0.719, 0.721 0.723, 0.721 Russian 0.833 0.832, 0.833 0.831, 0.833 0.831, 0.833 Spanish 0.932 0.925, 0.934 0.929, 0.932 0.927, 0.932 Chinese 0.701 0.753, 0.754 0.569, 0.701 0.566, 0.701 English 0.889 0.889, 0.891 0.889, 0.889 0.889, 0.889 French 0.901 0.902, 0.902 0.901, 0.901 0.901, 0.901 5000 German 0.915 0.916, 0.916 0.916, 0.915 0.916, 0.915 Italian 0.852 0.851, 0.852 0.847, 0.852 0.848, 0.852 Japanese 0.726 0.745, 0.747 0.724, 0.726 0.729, 0.726 Russian 0.818 0.824, 0.819 0.798, 0.818 0.808, 0.818 Spanish 0.902 0.901, 0.902 0.839, 0.902 0.866, 0.902 5216 Russian 0.824 0.827, 0.826 0.804, 0.824 0.813, 0.824 9458 Spanish 0.901 0.883, 0.898 0.742, 0.901 0.792, 0.901 Chinese 0.736 0.757, 0.759 0.577, 0.736 0.564, 0.736 English 0.881 0.882, 0.882 0.881, 0.881 0.881, 0.881 French 0.906 0.906, 0.906 0.905, 0.906 0.905, 0.906 10000 German 0.918 0.918, 0.918 0.918, 0.918 0.918, 0.918 Italian 0.849 0.847, 0.850 0.818, 0.849 0.829, 0.849 Japanese 0.723 0.748, 0.751 0.722, 0.723 0.726, 0.723 LSTM Precision Recall F1-Score Training Size Languages Accuracy (macro, weighted) (macro, weighted) (macro, weighted) Chinese 0.414 0.519, 0.636 0.350, 0.414 0.272, 0.414 English 0.679 0.672, 0.672 0.678, 0.679 0.673, 0.679 French 0.565 0.563, 0.566 0.562, 0.565 0.562, 0.565 1000 German 0.820 0.829, 0.827 0.821, 0.820 0.820, 0.820 Italian 0.666 0.672, 0.672 0.668, 0.666 0.667, 0.666 Japanese 0.249 0.333, 0.339 0.254, 0.249 0.106, 0.249 Russian 0.727 0.724, 0.724 0.723, 0.727 0.722, 0.727 Spanish 0.764 0.828, 0.857 0.759, 0.764 0.752, 0.764 Chinese 0.490 0.379, 0.460 0.396, 0.490 0.316, 0.490 English 0.846 0.847, 0.847 0.845, 0.846 0.845, 0.846 French 0.882 0.884, 0.885 0.882, 0.882 0.882, 0.882 5000 German 0.909 0.910, 0.909 0.909, 0.908 0.909, 0.909 Italian 0.870 0.872, 0.871 0.870, 0.870 0.871, 0.870 Japanese 0.251 0.619, 0.622 0.256, 0.251 0.110, 0.251 Russian 0.787 0.789, 0.789 0.777, 0.787 0.781, 0.787 Spanish 0.848 0.837, 0.851 0.830, 0.848 0.831, 0.848 5216 Russian 0.749 0.758, 0.760 0.735, 0.749 0.740, 0.749 9458 Spanish 0.764 0.787, 0.778 0.714, 0.764 0.726, 0.764 Chinese 0.469 0.255, 0.313 0.379, 0.466 0.295, 0.469 English 0.865 0.870, 0.871 0.865, 0.865 0.866, 0.865 French 0.879 0.879, 0.879 0.878, 0.879 0.878, 0.879 10000 German 0.922 0.922, 0.922 0.923, 0.922 0.922, 0.922 Italian 0.746 0.768, 0.765 0.743, 0.746 0.722, 0.746 Japanese 0.241 0.255, 0.257 0.253, 0.241 0.105, 0.241 BiLSTM Precision Recall F1-Score Training Size Languages Accuracy (macro, weighted) (macro, weighted) (macro, weighted) Chinese 0.728 0.684, 0.747 0.704, 0.728 0.688, 0.728 English 0.800 0.798, 0.799 0.799, 0.800 0.800, 0.800 French 0.855 0.859, 0.860 0.855, 0.855 0.855, 0.855 1000 German 0.877 0.878, 0.877 0.878, 0.877 0.878, 0.877 Italian 0.781 0.781, 0.781 0.781, 0.781 0.780, 0.781 Japanese 0.568 0.693, 0.696 0.563, 0.568 0.525, 0.568 Russian 0.809 0.803, 0.813 0.809, 0.809 0.805, 0.809 Spanish 0.901 0.891, 0.903 0.893, 0.901 0.891, 0.901 Chinese 0.740 0.616, 0.705 0.612, 0.740 0.589, 0.740 English 0.885 0.889, 0.890 0.885, 0.885 0.885, 0.885 French 0.893 0.893, 0.894 0.892, 0.893 0.892, 0.893 5000 German 0.925 0.927, 0.926 0.926, 0.925 0.926, 0.925 Italian 0.843 0.854, 0.853 0.842, 0.843 0.843, 0.843 Japanese 0.623 0.735, 0.738 0.617, 0.623 0.594, 0.623 Russian 0.825 0.837, 0.832 0.815, 0.826 0.822, 0.826 Spanish 0.911 0.910, 0.912 0.899, 0.911 0.902, 0.911 5216 Russian 0.817 0.821, 0.818 0.806, 0.817 0.811, 0.817 9458 Spanish 0.899 0.901, 0.900 0.884, 0.899 0.890, 0.899 Chinese 0.824 0.824, 0.824 0.826, 0.824 0.825, 0.824 English 0.922 0.920, 0.922 0.907,0.922 0.913, 0.922 French 0.930 0.927, 0.933 0.926, 0.930 0.925, 0.930 10000 German 0.940 0.940, 0.940 0.940, 0.940 0.940, 0.940 Italian 0.873 0.880, 0.880 0.872, 0.873 0.873, 0.873 Japanese 0.631 0.675, 0.682 0.629, 0.631 0.624, 0.631 CNN Precision Recall F1-Score Training Size Languages Accuracy (macro, weighted) (macro, weighted) (macro, weighted) Chinese 0.699 0.666, 0.739 0.687, 0.699 0.662, 0.699 English 0.879 0.880, 0.880 0.879, 0.879 0.879, 0.879 French 0.888 0.892, 0.893 0.887, 0.888 0.888, 0.888 1000 German 0.912 0.912, 0.911 0.912, 0.912 0.912, 0.912 Italian 0.829 0.829, 0.829 0.829, 0.829 0.829, 0.829 Japanese 0.604 0.640, 0.646 0.598, 0.604 0.595, 0.604 Russian 0.818 0.813, 0.823 0.820, 0.818 0.814, 0.818 Spanish 0.925 0.917, 0.926 0.921, 0.925 0.919, 0.925 Chinese 0.729 0.636, 0.711 0.606, 0.729 0.589, 0.729 English 0.926 0.925, 0.926 0.925, 0.926 0.925, 0.926 French 0.939 0.938, 0.939 0.938, 0.939 0.938, 0.939 5000 German 0.948 0.948, 0.948 0.948, 0.948 0.948, 0.948 Italian 0.887 0.891, 0.889 0.886, 0.887 0.888, 0.887 Japanese 0.633 0.733, 0.736 0.626, 0.633 0.601, 0.633 Russian 0.852 0.856, 0.853 0.845, 0.852 0.849, 0.852 Spanish 0.928 0.928, 0.928 0.915, 0.928 0.920, 0.928 5216 Russian 0.869 0.874, 0.871 0.863, 0.869 0.867, 0.869 9458 Spanish 0.916 0.921, 0.919 0.903, 0.916 0.909, 0.916 Chinese 0.759 0.576, 0.701 0.626, 0.759 0.598, 0.759 English 0.930 0.931, 0.931 0.930, 0.930 0.930, 0.930 French 0.943 0.943, 0.943 0.942, 0.943 0.942, 0.943 10000 German 0.955 0.955, 0.955 0.955, 0.955 0.955, 0.955 Italian 0.900 0.900, 0.900 0.900, 0.900 0.900, 0.900 Japanese 0.658 0.690, 0.700 0.654, 0.659 0.661, 0.658 BERT-Base-Cased Precision Recall F1-Score Training Size Languages Accuracy (macro, weighted) (macro, weighted) (macro, weighted) Chinese 0.881 0.888, 0.885 0.836, 0.881 0.855, 0.880 English 0.876 0.881, 0.881 0.875, 0.876 0.875, 0.876 French 0.926 0.926, 0.926 0.926, 0.926 0.926, 0.926 1000 German 0.938 0.941, 0.940 0.938, 0.938 0.939, 0.938 Italian 0.819 0.821, 0.822 0.821, 0.819 0.817, 0.819 Japanese 0.795 0.796, 0.796 0.794, 0.795 0.793, 0.795 Russian 0.805 0.798, 0.807 0.792, 0.805 0.794, 0.805 Spanish 0.916 0.912, 0.918 0.910, 0.916 0.910, 0.916 Chinese 0.939 0.942, 0.939 0.915, 0.939 0.926, 0.939 English 0.943 0.944, 0.945 0.943, 0.943 0.943, 0.943 French 0.963 0.963, 0.963 0.963, 0.967 0.963, 0.963 5000 German 0.973 0.974, 0.974 0.973, 0.973 0.973, 0.973 Italian 0.895 0.900, 0.899 0.895, 0.895 0.896, 0.895 Japanese 0.910 0.910, 0.911 0.909, 0.910 0.909, 0.910 Russian 0.882 0.885, 0.884 0.880, 0.882 0.882, 0.882 Spanish 0.962 0.959, 0.963 0.960, 0.962 0.959, 0.962 5216 Russian 0.885 0.890, 0.886 0.878, 0.885 0.883, 0.885 9458 Spanish 0.948 0.952, 0.949 0.937, 0.948 0.943, 0.948 Chinese 0.936 0.938, 0.937 0.912, 0.936 0.923, 0.936 English 0.967 0.967, 0.967 0.967, 0.967 0.967, 0.967 French 0.971 0.971, 0.971 0.971, 0.971 0.971, 0.971 10000 German 0.978 0.978, 0.978 0.978, 0.978 0.978, 0.978 Italian 0.912 0.915, 0.914 0.912, 0.912 0.913, 0.912 Japanese 0.923 0.922, 0.923 0.922, 0.923 0.922, 0.923 mBERT-Base-Cased Precision Recall F1-Score Training Size Languages Accuracy (macro, weighted) (macro, weighted) (macro, weighted) Chinese 0.907 0.891, 0.909 0.901, 0.907 0.895, 0.907 English 0.838 0.852, 0.852 0.836, 0.838 0.834, 0.838 French 0.858 0.868, 0.867 0.856, 0.858 0.855, 0.858 1000 German 0.926 0.926, 0.926 0.927, 0.926 0.926, 0.926 Italian 0.750 0.754, 0.756 0.755, 0.750 0.742, 0.750 Japanese 0.783 0.789, 0.792 0.783, 0.783 0.780, 0.783 Russian 0.792 0.806, 0.797 0.793, 0.792 0.797, 0.792 Spanish 0.851 0.850, 0.853 0.824, 0.851 0.830, 0.851 Chinese 0.929 0.929, 0.930 0.908, 0.929 0.917, 0.929 English 0.896 0.903, 0.904 0.895, 0.896 0.895, 0.896 French 0.951 0.951, 0.951 0.951, 0.951 0.951, 0.951 5000 German 0.961 0.961, 0.961 0.961, 0.961 0.961, 0.961 Italian 0.908 0.911, 0.910 0.908, 0.908 0.909, 0.908 Japanese 0.900 0.901, 0.902 0.899, 0.900 0.900, 0.900 Russian 0.880 0.891, 0.886 0.873, 0.880 0.879, 0.880 Spanish 0.933 0.933, 0.933 0.921, 0.933 0.926, 0.933 5216 Russian 0.886 0.893, 0.889 0.881, 0.886 0.886, 0.886 9458 Spanish 0.933 0.931, 0.937 0.931, 0.933 0.929, 0.933 Chinese 0.937 0.945, 0.939 0.907, 0.937 0.923, 0.937 English 0.970 0.970, 0.970 0.970, 0.970 0.970, 0.970 French 0.962 0.962, 0.962 0.962, 0.962 0.962, 0.962 10000 German 0.968 0.968, 0.968 0.968, 0.968 0.968, 0.968 Italian 0.905 0.909, 0.909 0.905, 0.905 0.905, 0.905 Japanese 0.916 0.915, 0.917 0.915, 0.916 0.916, 0.916