Matters: A Multilingual Language Modeling Analysis Hyunji Hayley Park Katherine J. Zhang Coleman Haley University of Illinois Carnegie Mellon University Johns Hopkins University [email protected] [email protected] [email protected]

Kenneth Steimel Han Liu Lane Schwartz Indiana University University of Chicago∗ University of Illinois [email protected] [email protected] [email protected]

Abstract find that morphological complexity is predictive of language modeling difficulty, while Mielke et al. Prior studies in multilingual language mod- (2019) conclude that simple statistics of a text like eling (e.g., Cotterell et al., 2018; Mielke the number of types explain differences in model- et al., 2019) disagree on whether or not ing difficulty, rather than morphological measures. inflectional morphology makes languages This paper revisits this issue by increasing the harder to model. We attempt to resolve the disagreement and extend those studies. number of languages considered and augment- We compile a larger corpus of 145 Bible ing the kind and number of morphological fea- translations in 92 languages and a larger tures used. We train language models for 92 number of typological features.1 We fill languages from a corpus of Bibles fully aligned in missing typological data for several lan- at the verse level and measure language model- guages and consider corpus-based measures ing performance using surprisal (the negative log- of morphological complexity in addition to likelihood) per verse (see §4.5). We investigate expert-produced typological features. We find that several morphological measures how this measure is correlated with 12 linguist- are significantly associated with higher sur- generated morphological features and four corpus- prisal when LSTM models are trained with based measures of morphological complexity. BPE-segmented data. We also investigate Additionally, we contend that the relation be- linguistically-motivated subword segmenta- tween segmentation method, morphology, and lan- tion strategies like Morfessor and Finite- guage modeling performance needs further inves- State Transducers (FSTs) and find that these tigation. Byte-Pair Encoding (BPE; Shibata et al., segmentation strategies yield better perfor- 1999) is widely used in NLP tasks including ma- mance and reduce the impact of a lan- guage’s morphology on language modeling. chine translation (Sennrich et al., 2016) as an un- supervised information-theoretic method for seg- menting text data into subword units. Variants of 1 Introduction BPE or closely related methods such as WordPiece (Kudo, 2018) are frequently employed by state- With most research in Natural Language Process- of-the-art pretrained language models (Liu et al., ing (NLP) directed at a small subset of the world’s 2019; Radford et al., 2019; Devlin et al., 2019; languages, whether the techniques developed are Yang et al., 2019). However, BPE and other seg- truly language-agnostic is often not known. Be-

arXiv:2012.06262v1 [cs.CL] 11 Dec 2020 mentation methods may vary in how closely they cause the vast majority of research focuses on capture morphological segments for a given lan- English with Chinese a distant second (Mielke, guage, which may affect language modeling per- 2016), neither of which is morphologically rich, formance. the impact of morphology on NLP tasks for vari- Therefore, this paper focuses on the following ous languages is not entirely understood. two research questions: Several studies have investigated this issue in the context of language modeling by comparing 1. Does a language’s morphology influence lan- a number of languages, but found conflicting re- guage modeling difficulty? sults. Gerz et al.(2018) and Cotterell et al.(2018) 2. If so, how do different segmentation methods ∗Work done while at University of Colorado Boulder interact with morphology? 1https://github.com/hayleypark/ MorphologyMatters In order to answer the first question, we train models using data sets segmented by characters guage modeling performance. and BPE units. Our results show that BPE lan- Cotterell et al.(2018) arrived at a similar con- guage modeling surprisal is significantly corre- clusion modeling 21 languages using the Europarl lated with measures of morphological typology corpus (Koehn, 2005). When trained with n-gram and complexity. This suggests that BPE segments and character-based Long Short-Term Memory are ineffective in mitigating the effect of morphol- (LSTM) models, the languages showed different ogy in language modeling. modeling difficulties, which were correlated with As for the second question, we consider more a measure of morphology, Morphological Count- linguistically-motivated segmentation methods to ing Complexity (MCC) or the number of inflec- compare with BPE: Morfessor (Creutz and La- tional categories (Sagot, 2013). gus, 2007) and Finite-State Transducers (FSTs) However, Mielke et al.(2019) failed to repro- (see §4.3). Our comparison of the models us- duce the correlation with MCC when they in- ing the different segmentation methods shows that creased the scope to 69 languages, utilizing a Morfessor reduces the impact of morphology for Bible corpus (Mayer and Cysouw, 2014). They more languages than BPE. FST-based segmenta- also reported no correlation with measures of tion methods outperform the other segmentation morphosyntactic complexity such as head-POS methods when available. These results suggest entropy (Dehouck and Denis, 2018) and other that morphologically motivated segmentations im- linguist-generated features (Dryer and Haspel- prove cross-linguistic language modeling. math, 2013). Rather, they found that simpler statistics, namely the number of types and num- 2 Modeling difficulty across languages ber of characters per , correlate with language model surprisal using BPE and character segmen- Studies have demonstrated that different lan- tation, respectively. guages may be unequally difficult to model and have tested the relations between such model- 3 Morphological measures ing difficulty and morphological properties of lan- guages, using different segmentation methods. Different measures of morphology are used to rep- Vania and Lopez(2017) compared the effective- resent a language’s morphology. ness of word representations based on different segmentation methods in modeling 10 languages 3.1 Linguist-generated measures with various morphological typologies. They The most linguistically-informed measures of trained word-level language models, but utilize morphology involve expert descriptions of lan- segmentation methods to create word embeddings guages. The World Atlas of Language Structures that include segment-level information. Compar- (WALS; Dryer and Haspelmath, 2013) has been ing character, BPE, and Morfessor segmentations, used frequently in the literature to provide typo- they concluded that character-based representa- logical information. WALS is a large database of tions were most effective across languages, with linguistic features gathered from descriptive mate- BPE always outperforming Morfessor. However, rials, such as reference grammars. It contains 144 models based on hand-crafted morphological anal- chapters in 11 areas including phonology, mor- yses outperformed all other segmentation methods phology, and . Each chapter describes a by a wide margin. feature with categorical values and lists languages Gerz et al.(2018) trained n-gram and neural lan- that have each value. However, not all languages guage models over 50 languages and argued that in the database have data for all the features, and the type of morphological system is predictive of for some languages there is no data at all. model performance. Their results show that lan- The studies reviewed in §2 all relied on this guages differ with regard to modeling difficulty. expert-description approach to quantify morpho- They attributed the differences among languages logical properties. Gerz et al.(2018) focused to four types of morphological systems: isolating, on WALS descriptions of inflectional synthesis fusional, introflexive, and agglutinative. While of verbs, fusion, exponence, and flexivity, while they found a significant association between the Mielke et al.(2019) looked at two WALS fea- morphological type and modeling difficulty, Type- tures, 26A “Prefixing vs. Suffixing in Inflectional Token Ratio (TTR) was the most predictive of lan- Morphology” and 81A “Order of Subject, Object and Verb.” Cotterell et al.(2018) used UniMorph Measure Definition (Kirov et al., 2018), instead of WALS, to calcu- late MCC. Vania and Lopez(2017) did not cite Types Number of unique word tokens any databases but provided descriptions of four TTR Number of unique word tokens di- morphological types (fusional, agglutinative, - vided by total number of word to- and-pattern, and reduplication) and categorized 10 kens languages into these types. MATTR Average TTR calculated over a moving window of 500 word tokens A major issue with this approach to represent- MLW Average number of characters per ing morphology is that there is not enough expert word token data available to enable comparisons across many different languages. In fact, Mielke et al.(2019) Table 1: Corpus-based measures of morphology de- chose their two WALS features because data for fined for this study. These measures are calculated on these features existed for most of their languages. tokenized data sets before applying any segmentation Moreover, Bentz et al.(2016) showed that their method. WALS-based measure had lower correlations with other measures of morphological complexity due true for agglutinative languages. to this issue of missing data. Given the previous literature, we utilize these corpus-based measures, as well as expert- 3.2 Corpus-based measures generated WALS features, as a proxy for morpho- In contrast, corpus-based measures of morphol- logical differences among languages in our study. ogy can be easily calculated on a given data 4 Methods set. These measures include the number of types, Type-Token Ratio (TTR), Moving-Average We design our experiments to test if a lan- TTR (MATTR; Covington and McFall, 2010), and guage’s morphology is correlated with language Mean Length of (MLW). The exact defini- model performance, depending on the segmen- tion of the measures may vary depending on stud- tation method. We represent a language’s mor- ies, but we define them as in Table1, where a word phology using WALS features and corpus statis- token is a string separated by spaces in the training tics. We train language models for Bible trans- set after tokenization but before segmentation. lations in 92 languages based on five different While some studies (e.g., Mielke et al., 2019) segmentation methods: character, BPE, Morfes- consider these measures as simple statistics of a sor, and FST with BPE or Morfessor back-off corpus, other studies have found that they can be strategies (FST+BPE & FST+Morfessor). We used as approximate measures of morphological use surprisal per verse (Mielke et al., 2019) as complexity. Kettunen(2014) showed that TTR, the evaluation metric to compare language mod- MATTR, and MLW can capture the overall rank- eling performance across different languages and ing of morphological complexity generated by different segmentation methods. Additionally, we information-theoretic and expert-generated mea- quantify the difference in surprisal per verse be- sures of morphological complexity. Bentz et al. tween segmentation methods to compare the rel- (2016) compared different measures of morpho- ative strength of each segmentation method with logical complexity for 519 languages across 101 regard to morphological complexity. families and showed a strong correlation between 4.1 Data all measures, which were based on corpus statis- tics, linguistic expertise, information theory, and Our data consist of 145 Bible translations in 92 2 translation alignment. They argued that corpus- languages covering 22 language families, fully based measures, including TTR, and other mea- aligned at the verse level. The majority of sures of morphological complexity can be used 2For each language, we report the family assigned by interchangeably. In addition, Gerz et al.(2018) WALS (Dryer and Haspelmath, 2013): 6 Afro-Asiatic, 1 showed that TTR is influenced by the morpholog- Algic, 1 Altaic, 2 Austro-Asiatic, 6 Austronesian, 1 Ay- ical typology of a language. According to them, maran, 3 Dravidian, 4 Eskimo-Aleut, 1 Guaicuruan, 33 Indo- European, 1 Japanese, 1 Korean, 1 Mande, 6 Mayan, 6 Niger- isolating languages tend to have small TTR values Congo, 4 Quechuan, 5 Sino-Tibetan, 1 Songhay, 1 Tai-Kadai, and are often easier to model while the opposite is 2 Tupian, 2 Uralic, 2 Uto-Aztecan, 2 creoles. the data came verse-aligned from Mielke et al. ID Name (2019) (original data from Mayer and Cysouw, 2014). We added more Bibles from another cor- 20A Fusion of Selected Inflectional Forma- pus (Christodoulopoulos and Steedman, 2014) and tives from online Bible resources (see AppendixA for 21A Exponence of Selected Inflectional For- more information). We refer to each language by matives ISO 639-3 code when applicable. 21B Exponence of Tense-Aspect-Mood In- flection We followed Mielke et al.(2019)’s method to 22A Inflectional Synthesis of the Verb split the data into training, development, and test 23A Locus of Marking in the Clause sets: the verse-aligned data were divided into 24A Locus of Marking in Possessive blocks of 30 verses, with the first five verses be- Phrases ing assigned to the development set, the next five 25A Locus of Marking: Whole-language Ty- to the test set and the rest to the training set. The pology resulting training set had 16,926 verses while de- 25B Zero Marking of A and P Arguments velopment and test sets had 4,225 verses each. 26A Prefixing vs. Suffixing in Inflectional It should be noted that both Mielke et al.(2019) Morphology and Christodoulopoulos and Steedman(2014) pro- 27A Reduplication vided tokenized data. We tokenized the newly 28A Case Syncretism added Bibles using Mielke and Eisner(2019)’s 29A Syncretism in Verbal Person/Number tokenizer, following Mielke et al.(2019). When Marking both tokenized and untokenized versions were available, we included the tokenized versions only. Table 2: The 12 morphological features in WALS. We chose to replace characters that only oc- curred one time with a special UNK symbol. Mielke et al.(2019) applied this procedure to char- features to compare, but their conclusions were acters that appear less than 25 times in the training at odds, possibly calling for exploration of other set except for Chinese, where only singleton char- measures. Therefore, we consider all available acters were replaced. Because we added several morphological features described by WALS to ex- languages where the original strategy would have plore which features affect language modeling and resulted in removing too much data, we prepro- how. Instead of making theoretical claims about cessed singleton characters across the board. morphological typology, we explore which typo- We also corrected several errors present in the logical features make a language’s morphology data. For example, the Bible translations in Shona more complex for LSTM language models. (sna) and Telugu (tel) were mis-coded as Shan To that end, we augmented the existing WALS (shn) and Tecpatlán Totonac (tcw), respectively. database by consulting reference grammars for each language. Of the 92 languages in our cor- 4.2 Morphological measures selected pus, six were not in the WALS database.3 In ad- In this paper, we adopt two approaches to repre- dition, many of the languages in the database had senting a language’s morphology. First, we rely on missing data for some features. For example, we expert descriptions of languages in WALS, manu- had no data for any of the morphological features ally augmenting the database to rectify the issue of (afr). We manually assigned missing of missing data. Second, we utilize corpus-based features where possible following the descriptions measures like TTR to represent the morphological in the relevant WALS chapters regarding the pro- complexity of a given language. cedures used to assign feature values to languages. Of the almost 200 features in WALS, the editors WALS features While some previous studies of the database labeled 12 of them as morpholog- (e.g., Gerz et al., 2018; Vania and Lopez, 2017) ical features. Therefore, we considered these 12 categorized relatively well-known languages into features, listed in Table2 and described below, 4 to a small number of morphological types, such cat- test the hypothesis that morphological complexity egorization is not always clear. Some other stud- 3ikt, lat, nch, tbz, wbm, zom ies (e.g., Cotterell et al., 2018; Mielke et al., 2019) 4See https://wals.info/chapter for more de- selected a small number of available typological tails and examples of these features. correlates with modeling difficulty. 4.3 Segmentation methods Feature 20A describes how closely grammati- We chose to train only open-vocabulary lan- cal markers (inflectional formatives) are phono- guage models for fair comparison. Word-level logically connected to a host word or stem. The models will predict UNK for out-of-vocabulary markers can be isolating, concatenative, or even word tokens and cannot be fairly compared with nonlinear (i.e., ablaut and ). character- and subword-level models as a result. Features 21A and 21B measure the exponence Specifically, we trained language models using of selected grammatical markers. Exponence five segmentation methods: character, BPE, Mor- refers to the number of categories that a single fessor, FST+BPE, and FST+Morfessor. These expresses. For 21A, the selected gram- segmentation methods provide a way to segment matical markers were case markers. For 21B, they any given text into smaller pieces, some of which were tense-aspect-mood (TAM) markers. approximate . Feature 22A measures how many grammati- A morpheme is the smallest meaning-bearing cal categories may appear on verbs in a lan- morphological unit while a morph is the sur- guage. These categories include tense-aspect- face representation of one or more morphemes. mood, negation, voice, and . Linguistically-motivated methods like Morfessor and FSTs are designed with the goal of produc- Features 23A through 25B describe the exis- ing subword segments that are closely aligned to tence and locus of marking in different kinds of the true morphs comprising a word. While BPE phrases. A phrase may have marking on either was not designed with morpheme segmentation in its head, its dependent(s), both, or neither. In full mind, its resulting subwords are commonly be- clauses, the verb is the head, and the subject and lieved to align with morphs to some degree due object arguments are dependents. In possessive to morph subsequences being frequent in the data. noun phrases, the possessed noun is the head while Segmenting words into morphs may reduce the the possessor is dependent. impact of rich morphology as highly inflected Feature 26A measures the degree to which lan- words can be broken into smaller pieces that are guages use prefixes versus suffixes in their in- likely to contribute similar meanings across con- flectional morphology. Feature 27A describes texts in the corpus. Table3 provides examples which languages use reduplication productively of the segmentation methods we used to train lan- and whether or not both full and partial redupli- guage models. The original verse is provided for cation are used. reference only and not used to train any models. Both Features 28A and 29A measure syn- Character We trained character-based language cretism. Syncretism occurs when a single in- models, following previous studies (Mielke et al., flected form corresponds to more than one func- 2019; Gerz et al., 2018; Cotterell et al., 2018). tion. 28A measures case syncretism specifically Character language models are trained to predict while 29A measures syncretism in the subject the next character given the preceding context, and agreement marking of verbs. the vocabulary includes an underscore h_i to de- note word boundaries. Types, TTR, MATTR, and MLW We calcu- lated the number of types, TTR, Moving-Average BPE We trained BPE-based language models, TTR, and Mean Length of Word using an adapted following Mielke et al.(2019). Starting with char- script from the Python module LexicalRichness.5 acter segmentation, BPE operations combine char- We used a window size of 500 for Moving- acters into larger chunks based on their frequen- Average TTR, following previous studies (e.g., cies to create units somewhere between characters Kettunen, 2014). The definitions of the measures and words with the number of merge operations are found in Table1. All measures were calculated as the hyperparameter (Sennrich et al., 2016). We based on the word tokens in the training set before used 0.4 × types as the number of merges, as applying any segmentation method. Mielke et al.(2019) reported that to be most ef- fective with their corpus.6 BPE language models

5https://github.com/LSYS/ 6Additional static numbers of merge operations were also LexicalRichness tested, with nearly identical results. Segmentation Example Tokenized Yuhannanın karde¸siYakubu kılıçla öldürdü . Character Yuhannanın_karde¸si_Yakubu_kılıçla_öldürdü. BPE Yuhan@@ nanın karde¸siYakubu kılıçla öldürdü . Morfessor Yuhanna@@ nın karde¸s@@i Yakub@@ u kılıç@@ la öldürdü . FST+BPE Yuhan@@ nanın karde¸s@@i Yakub@@ u kılıç@@ la öl@@ dür@@ dü . FST+Morfessor Yuhanna@@ nın karde¸s@@i Yakub@@ u kılıç@@ la öl@@ dür@@ dü .

Table 3: Turkish examples for different segmentation methods. An English translation is “And he killed James the brother of John with the sword” (Acts 12:2). FST does not produce analyses for Yuhannanın (“John’s”), for which BPE or Morfessor back-off was used. The segmentation created by human experts was the same as FST+Morfessor. h@@i denotes subword segmentation while h_i encodes space between word tokens for character segmentation. are trained to predict the next BPE unit. The dou- surface forms, not morphological segmentations. ble at sign h@@i is used to indicate segments that Fortunately, morpheme boundaries are frequently are not word-final. part of FSTs due to their relevance for lexico- phonological phenomena. By modifying the FST Morfessor Morfessor (Creutz and Lagus, 2007) before the cleanup rules that remove morpheme is a word segmentation method explicitly designed boundaries can apply, we create a morphological for morphological segmentation. The default im- segmenter that takes in a surface form and returns plementation utilizes a unigram language model to the surface form with morpheme boundary mark- find morph-like constructs. While like BPE this ers. If the analyzer provides segmentations, the approach is information-theoretic, it selects seg- transducer is used as-is. ments top-down and includes a prior term for the For example, the Turkish FST produces a mor- length of segments, regularizing segments to be phological analysis for the surface form kılıçla more plausible morphemes. (“with the sword") in the example in Table3: Using the default settings with Morfessor 2.0 kılıç. In- (Virpioja et al., 2013), we trained Morfessor on stead of producing such an analysis for the given the training set and applied the segmentation to all word, the segmenter instead produces the seg- data sets. Just like BPE, the language models are mented surface form kılıç@@ la, which is trained to predict the next morph unit. used in the FST segmentation methods. FST While segmentation based on BPE and Because a FST may return multiple analyses Morfessor may or may not resemble actual mor- or segmentations given a single word, a heuristic phemes, morpheme segmentation from Finite- method was used to determine which segmenta- State Transducers (FSTs) provides a knowledge- tion to select. In general, we chose the segmenta- based method to segment a text into morphemes. tion with the fewest segments. However, the En- Finite-state morphological analyzers are rule- glish segmenter based on Axelson et al.(2015) al- based systems that take a surface string as in- ways returns the input string itself as a possible put and produce all possible morphological anal- segmentation if covered by the analyzer. For ex- yses as output. To use FSTs for segmenta- ample, walks would produce two segmentations in tion, we changed existing morphological analyz- the English segmenter: walks and walk@@ s. ers into segmenters and developed a heuristic For this segmenter, we selected the fewest number to select one analysis for a given word token. of segments excluding the input string itself (e.g., FSTs for Plains Cree (Arppe et al., 2014–2019), choosing walk@@ s over walks). German (Schmid et al., 2004), English (Axel- When a FST produces no analyses for a given son et al., 2015), Finnish (Pirinen, 2015), Indone- word, as in the case of Yuhannanın (John’s) in sian (Larasati et al., 2011), Cuzco Quechua (Vilca Table3, we adopt the FST-augmented BPE seg- et al., 2012), and Turkish (Çagrı˘ Çöltekin, 2014, mentation (FST+BPE) and FST-augmented Mor- 2010) were used as morphological segmenters. fessor segmentation (FST+Morfessor), where we Most FSTs are designed to provide analyses for fall back to BPE or Morfessor segmentation when- ever FST segmentation is unavailable. As shown strength of one segmentation method to another. in the table, FST+BPE and FST+Morfessor only Lj1 − Lj2 differ in the segmentation of the unanalyzed word. ∆S ,S = (1) j1 j2 1 (L + L ) For this particular verse, the human segmenta- 2 j1 j2 tion agrees with the FST+Morfessor segmenta- Sj1 and Sj2 are two segmentation methods to tion. FST+BPE and FST+Morfessor models are compare and Lj1 and Lj2 represent the surprisal trained just like BPE or Morfessor models to pre- per verse for the language models based on the two dict the next subword unit. segmentation methods. If ∆Sj1,Sj2 is positive, Sj1 resulted in a higher surprisal than Sj2 and Sj2 was 4.4 Models more effective in modeling a given language. Following Mielke et al.(2019), we trained Long 5 Results Short-Term Memory (LSTM) models introduced by Merity et al.(2018) for each of the segmenta- We now present results from our experiments. We tion methods. Three LSTM models using char- report the strong association between several mor- acter, BPE, and Morfessor segmentation were phological features and surprisal per verse for BPE trained for all languages. For a select group of lan- language models, compared to language models guages, we also trained models using FST+BPE based on other segmentation methods. Then, we and FST+Morfessor units. The neural architec- show the trade offs between different segmenta- ture consisted of an initial embedding layer, mul- tion methods and how they interact with morpho- tiple LSTM layers, and a linear decoder layer. For logical complexity. Our assumption is that, if a our particular experiments, we adopted the hyper- segmentation method reduces the impact of mor- parameters from Mielke et al.(2019) (see Merity phology, the surprisal values of language models et al., 2018, for their character PTB setttings). The based on that segmentation will have weaker cor- batch size used for character models was 128 with relations with measures of morphology. 500 epochs of training. All other models used a 5.1 Correlation studies with character and batch size of 40 and were trained for 200 epochs. BPE models We investigated correlations between surprisal per 4.5 Metrics verse and various measures of morphology (i.e., Surprisal per verse One major evaluation met- WALS features, number of types, TTR, Moving- ric for language models is the negative log- Average TTR, Mean Length of Word). Benjamini likelihood on a test set. The negative log- and Hochberg(1995)’s procedure was used to con- 8 likelihood, or surprisal, is the amount of informa- trol the false discovery rate, so only p ≤ 15 · 0.05 tion a language model needs to generate the next (≈ 0.027) is considered significant. unit. Following Mielke et al.(2019), we define WALS features We tested for association be- the surprisal at the verse level, where NLL(vij) = tween surprisal and each selected WALS feature − log2 p(vij) with a verse vij (for ith verse in lan- with the Kruskal–Wallis test, or one-way ANOVA guage j). Since each verse is intended to express on ranks. This non-parametric test was chosen be- the same meaning across languages, differences cause the distribution of surprisal values did not in per-verse surprisal across languages primarily meet the assumption of normality. A significant indicate differences in cross-linguistic language test result in this context means that there are sig- model quality (rather than differences in meaning nificant differences in the median surprisal values content). between categories for a given feature. In order for For each language j, we average the negative the test to be effective, only feature values with a log-likelihood across the 4,225 verses in the test sample size ≥ 5 were tested. 1 P4225 set, making Lj = 4225 i=1 NLL(vij). For the character models, no features showed significant association with surprisal. However, Surprisal difference Additionally, we quan- for the BPE models, half of the morphological tify the difference between segmentation methods features had significant association with surprisal. in language modeling performance as shown in These features were 21A “Exponence of Selected Equation1. This quantity compares the relative Inflectional Formatives,” 23A “Locus of Marking Segmentation ID p-value η2 Segmentation Measure Spearman’s ρ 21A 1.3e-05 0.28 Types 0.19∗ 23A 6.7e-06 0.28 TTR 0.15 BPE Character 24A 2.2e-04 0.228 MATTR 0.17∗ 25A 6.5e-05 0.253 MLW 0.06 25B 0.014 0.06 Types 0.80∗∗∗ 29A 2.0e-04 0.198 TTR 0.76∗∗∗ BPE 21A 0.009 0.109 MATTR 0.68∗∗∗ 23A 0.002 0.135 MLW 0.61∗∗∗ Morfessor 26A 0.022 0.064 Types 0.50∗∗∗ 29A 0.024 0.072 TTR 0.44∗∗∗ Morfessor MATTR 0.39∗∗∗ Table 4: p-values and effect sizes of WALS features ∗∗∗ that showed significant effect on surprisal per verse. MLW 0.30 Large effect sizes (≥ 0.14) are in bold. Table 5: Correlation between surprisal per verse per segmentation method and morphological complexity measures. ∗p < 0.027, ∗∗∗p < 0.0005. in the Clause,” 24A “Locus of Marking in Pos- sessive Noun Phrases,” 25A “Locus of Marking: Whole-language Typology,” 25B “Zero Marking Corpus-based measures A similar trend of A and P Arguments,” and 29A “Syncretism in emerged for corpus-based measures of morpho- Verbal Person/Number Marking.” logical complexity. The surprisal per verse of BPE models was highly correlated with type For the features shown to have an effect on count, TTR, Moving-Average TTR (MATTR), the BPE surprisal, we calculated the effect sizes and Mean Length of Word (MLW). Yet with and performed post-hoc comparisons to determine character models, the strength of the correlation which categories were significantly different. In was weak and often insignificant. These results this context, effect size (η2) indicates the propor- suggest that BPE segmentation was ineffective in tion of variance in surprisal per verse explained by reducing the impact of morphological complexity. each WALS feature, and η2 ≥ 0.14 is considered Table5 summarizes the correlation coefficients a large effect (Tomczak and Tomczak, 2014). The and corresponding p-values. For the character- p-values and effect sizes are summarized in Table based models, only the number of types and 4. The effect size was large for all of the signifi- MATTR showed a significant correlation in Spear- cant features except for 25B. man’s rank-order correlation, and those correla- For Feature 21A, the median surprisal value for tions were rather weak. In contrast, the BPE languages with no case was significantly lower models presented strong correlations with all of than the median value for other types. Similarly, the corpus-based measures at any reasonable al- for 23A, the median surprisal value for languages pha value (p < 10−16). The number of types with no marking was significantly lower than the showed the strongest correlation, followed by value for other types. In the cases of both 24A and TTR, MATTR, and MLW in that order. 25A, languages with double marking had higher surprisal values than those with single or no mark- 5.2 Comparison with Morfessor and ing. For 25B, languages with non-zero marking Finite-State Transducer models had slightly higher surprisal values than those with We trained language models using three additional zero-marking. Lastly, for 29A, languages without segmentation methods: Morfessor, FST+BPE, syncretism had higher surprisal values than those and FST+Morfessor. Because Morfessor is an un- with syncretism or with no marking. supervised method, we were able to utilize it to In general, less inflectional morphology was as- segment all languages, but we were able to gen- sociated with lower surprisal while more inflec- erate FST segmentation for only a few languages. tional morphology was associated with higher sur- As such, we compare the character, BPE, and Mor- prisal. fessor models for all languages before looking into Figure 1: Pairwise comparisons of surprisal per verse values for character, BPE, and Morfessor models. For the majority of the languages, Morfessor segmentation resulted in lower surprisal per verse than character or BPE segmentation. a subset of them where the FST methods were tures 21A, 23A, and 29A. For 26A, there was only available. a significant difference between weakly suffixing languages and strongly prefixing languages, with Morfessor models Morfessor segmentation per- strongly prefixing languages having a lower me- formed better than both character and BPE seg- dian surprisal per verse. mentation for the majority of languages. Figure1 As shown in Table5, corpus-based statistics still shows the pairwise comparisons of the surprisal showed significant correlations with the surprisal per verse values of a given language on differ- per verse value of Morfessor models, but the cor- ent segmentation strategies. As shown in the plot relations were moderate compared to those of the on the left, the relative strength between BPE and BPE models. character segmentation methods is not clear. BPE segmentation produced slightly better results for FST models When available, a FST segmen- 49 of the 92 languages, but character segmen- tation method resulted in the best performance. tation produced much lower surprisal values for The graph in Figure2 displays the surprisal of the rest of the languages. In contrast, Morfes- FST+BPE and FST+Morfessor models in com- sor clearly outperformed character and BPE for parison to the segmentation methods discussed most of the languages, as shown in the plots in above. For all seven languages, either FST+BPE the middle and right. Only 12 out of the 92 lan- or FST+Morfessor segmentation (or both) shows guages had higher surprisal values for Morfessor a clear decrease in the surprisal per verse com- segmentation than character, while a total of 66 pared to the BPE and Morfessor segmentations. languages performed better with Morfessor seg- mentation than with BPE. 5.3 Surprisal difference and morphological In addition, Morfessor models’ surprisal per complexity verse showed weaker correlations with measures In order to look into the effect of morphological of morphology. Only four WALS features showed complexity on the relative strength of a given seg- significant association with the Morfessor mod- mentation method, we conducted correlation stud- els: 21A “Exponence of Selected Inflectional For- ies with the difference between the surprisal per matives,” 23A “Locus of Marking in the Clause,” verse for pairs of segmentation methods (the ∆ 26A “Prefixing vs. Suffixing in Inflectional Mor- values as defined in §4.5). We considered only phology,” and 29A “Syncretism in Verbal Per- the measures of morphological complexity that son/Number Marking.” The effect sizes were also were continuous variables (i.e., number of types, much smaller than those for the BPE models as TTR, Moving-Average TTR, and Mean Length of shown in Table4. Word). Just as with the BPE models, the median sur- As shown in Table6, all of the corpus-based prisal for languages with no marking was much statistics were highly correlated to the ∆ values. lower than the surprisal for other types for Fea- The correlations range from moderate to high us- Difference Measure Spearman’s ρ Types 0.95∗∗∗ TTR 0.92∗∗∗ ∆ BPE, char MATTR 0.77∗∗∗ MLW 0.74∗∗∗ Types 0.71∗∗∗ TTR 0.66∗∗∗ ∆ Morfessor, char MATTR 0.50∗∗∗ MLW 0.53∗∗∗ Types 0.86∗∗∗ TTR 0.86∗∗∗ ∆ BPE, Morfessor MATTR 0.80∗∗∗ MLW 0.75∗∗∗

Table 6: Correlation between surprisal differences Figure 2: Surprisal per verse per segmentation method and morphological complexity measures for character, including FST segmentation methods. FST+BPE or BPE, and Morfessor models. All p-values < 10−11. FST+Morfessor models outperform all other models. ever, for languages with higher MATTR, character ing Spearman’s ρ (0.50 < ρ < 0.95). Even and Morfessor models outperform BPE. though the strength of correlations varied slightly, number of types, TTR, MATTR, and MLW all 6 Discussion showed a similar correlation with the difference Our results show that BPE models’ surprisal statistics. They all had a positive correlation with per verse is highly correlated with a language’s ∆ BPE, char. This indicates that the more morpho- morphology, represented by several WALS fea- logically complex a language is, the better it is tures and corpus-based measures. Morfessor modeled with character segmentation compared to shows weaker correlations with such measures BPE segmentation. Similarly, there were posi- and records better performance for most of the tive correlations between the morphological mea- languages. FST-based models outperform others sures and ∆ Morfessor, char, suggesting that char- when available. In this section, we discuss the im- acter segmentation works better than Morfessor plications of these findings in the context of previ- in modeling morphologically complex languages. ous work and future research. ∆ BPE, Morfessor also had positive correlations with complexity measures. This means that languages 6.1 Morphology and surprisal with higher morphological complexity tend to In accordance with the prior work discussed in record lower surprisal values with Morfessor seg- §2, we found differences in modeling difficulty mentation than BPE. While BPE and Morfessor between languages. The correlation studies in models outperformed character models on average §5 provide evidence that morphology is a sub- as shown in §5.2, the positive correlations with stantial contributing factor to these differences. ∆ Morfessor, char and ∆ BPE, char suggest that char- Six WALS (Dryer and Haspelmath, 2013) mor- acter segmentation outperformed BPE and Mor- phology features showed association with the sur- fessor segmentation for languages with very rich prisal per verse of BPE language models. Corpus- morphology. based statistics like number of types and MATTR These results are supported by Figure3, where showed strong correlations with BPE surprisal, the surprisal per verse for different segmenta- supporting the relationship between modeling dif- 7 tion models is plotted against MATTR. For lan- ficulty and morphological complexity. guages with lower MATTR, BPE and Morfessor Our conclusion that a language’s morphology perform better than character segmentation. How- impacts language modeling difficulty agrees with 7The same trend was captured when we plotted with the Cotterell et al.(2018) and Gerz et al.(2018), but other corpus-based measures. is at odds with Mielke et al.(2019). We included Figure 3: Surprisal per verse plotted against Moving-Average TTR for character, BPE, and Morfessor segmentation methods. Lines indicate the regression estimate with 95% confidence intervals. languages known for their rich morphology, such et al.(2019) reported high correlations between as Western Canadian (ikt) and Central language model performance and such statistics, Alaskan Yup’ik (esu), which may have increased they considered them only as simple statistics of the variation in morphological complexity in the the data. In fact, our results replicate Mielke corpus. We also augmented the WALS data by et al.(2019) in that the number of types was the consulting reference grammars, so we were able to most predictive of BPE language model surprisal consider 11 more morphological WALS features among all the variables considered. However, we than Mielke et al.(2019). We found that the mor- argue that corpus-based statistics can be used as an phological feature Mielke et al.(2019) considered, approximate measure of morphological complex- 26A “Prefixing vs. Suffixing in Inflectional Mor- ity based on previous studies. These corpus-based phology,” indeed showed no correlation with BPE measures of morphology are reported to capture surprisal. However, our results show that there the overall ranking of morphological complexity are aspects of morphology that affect surprisal that (Kettunen, 2014; Bentz et al., 2016) and can be were not considered before. interpreted in relation to morphological typology Previous work, such as Gerz et al.(2018), fo- (Gerz et al., 2018). We also believe our results cused only on aspects of morphology that they be- indicate that TTR and the WALS features cap- lieved a priori would predict language model per- ture similar information. For example, the posi- formance. In contrast, our study tested all of the tive correlation of ∆ BPE, Morfessor for corpus-based morphological features listed in WALS and also measures corresponds to the smaller effect sizes tested each of them individually. We found that of WALS features found for Morfessor compared two of the four features in Gerz et al.(2018), 20A to BPE. This indicates a lesser effect of rich mor- “Fusion of Selected Inflectional Formatives” and phology on Morfessor models compared to BPE. 22A “Inflectional Synthesis of the Verb,” showed no association with language model performance. 6.2 Segmentation methods Additionally, we found several features that af- While the primary goal of this work is to analyze fected language modeling performance, specifi- the relation of a language’s morphology to lan- cally locus of marking and syncretism, which were guage modeling performance, we found this to be not mentioned in the literature. These results show entangled with the level and method of segmenta- that the features tied to morphological complexity tion. Our results show that there is significant vari- in previous work are not necessarily the same fea- ation in the effectiveness of segmentation meth- tures that affect language modeling. ods cross-linguistically, and suggest challenges In addition to differences in results, our in- to the status quo methods of subword segmenta- terpretation of corpus-based statistics like TTR tion in particular. While the subword segmen- also diverges from previous work. While Mielke tation methods we used generally outperformed character-level segmentation, the higher the TTR, lations with corpus-based measures of morphol- the smaller the difference in surprisal for both BPE ogy such as TTR further suggest that the more and Morfessor, suggesting that these methods are types available in a language (often by means of less effective at segmenting languages with highly rich morphology), the harder it is to model based complex morphology. Of pre-existing methods, on BPE units. Morfessor, which was designed we found Morfessor to have the lowest surprisal with morpheme induction in mind, performs bet- per verse for most of the languages considered. ter for most languages and shows less association Morfessor’s weaker correlations with WALS fea- with morphological features. When available, the tures and other measures like TTR suggest that its linguistically-informed method of FST-augmented better performance may be due to a better ability BPE or Morfessor segmentation performs best, to model languages with a wider range of mor- indicating a further promise for using linguistic phological attributes. This is in line with Bostrom knowledge to combat the effects of morphology and Durrett(2020), which showed that Unigram on language model surprisal. LM (Kudo, 2018), a segmentation algorithm sim- These conclusions were only possible through ilar to Morfessor, often outperforms BPE and pro- manual augmentation of typological databases and duces more morph-like segmentation in the con- expansion of studied languages. Future efforts text of language model pretraining in English and could adopt our approach for other areas of lan- Japanese. guage. Using linguistically-informed resources However, Morfessor was significantly outper- across many languages is an avenue for improving formed by character segmentation for a small sub- neural models in NLP in both design and analysis. set of languages.8 Many of these languages have been classified as polysynthetic, suggesting that Acknowledgments perhaps Morfessor is ill-suited for such languages (see Klavans, 2018; Tyers and Mishchenkova, This paper builds on our prior work for the 2020; Mager et al., 2018, for discussions on 2019 Sixth Frederick Jelinek Memorial Summer challenges polysynthetic languages pose for NLP Workshop on Speech and Language Technology tasks). (JSALT 2019) (Schwartz et al., 2020). We thank Additionally, for a typologically diverse sub- the organizers of the workshop and the mem- set of languages for which we could obtain bers of our workshop team on Neural Polysyn- FST morphological segmenters, we considered thetic Language Modeling for inspiring us to pur- novel segmentation methods: FST+BPE and sue this research direction. Our special thanks to FST+Morfessor. We found this simple exten- Rebecca Knowles, Christo Kirov, Lori Levin, Chi- sion of BPE and Morfessor with morphological in- kiu (Jackie) Lo, and TACL reviewers and editors formation achieved the lowest surprisal per verse for their feedback on our manuscript. We thank in all available languages. The overall success Ata Tuncer for his assistance with Turkish seg- of combining statistical segmentations with FSTs mentation. This work utilizes resources supported further confirms the impact of morphology on lan- by the National Science Foundation’s Major Re- guage modeling and yields significant promise for search Instrumentation program, grant #1725729, the use of segmentation based on linguistic mor- as well as the University of Illinois at Urbana- phological information. Champaign.

7 Conclusion References A language’s morphology is strongly associ- ated with language modeling surprisal for BPE- Antti Arppe, Atticus Harrigan, Katherine Schmir- segmented language models. BPE model sur- ler, Lene Antonsen, Trond Trosterud, Sjur prisal is associated with 6 out of the 12 stud- Nørstebø Moshagen, Miikka Silfverberg, Arok ied WALS morphology features, indicating that Wolvengrey, Conor Snoek, Jordan Lach- there are aspects of some languages’ morphology ler, Eddie Antonio Santos, Jean Okimasis,¯ that BPE does not help mitigate. Strong corre- and Dorothy Thunder. 2014–2019. Finite- 8amh, arz, ayr, cmn, esu, heb, ike, ikt, kal, quh, tel, xho. state transducer-based computational model of BPE outperformed Morfessor for cmn and heb. Plains Cree morphology. Eric Axelson, Sam Hardwick, Krister Lindén, type–token ratio (MATTR). Journal of Quanti- Kimmo Koskenniemi, Flammie Pirinen, Mikka tative Linguistics, 17(2):94–100. Silfverberg, and Senka Drobac. 2015. Helsinki finite-state technology resources. Mathias Creutz and Krista Lagus. 2007. Unsu- pervised models for morpheme segmentation Yoav Benjamini and Yosef Hochberg. 1995. Con- and morphology learning. ACM Transactions trolling the false discovery rate: A practical and on Speech and Language Processing, 4(1):3:1– powerful approach to multiple testing. Jour- 3:34. nal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300. Mathieu Dehouck and Pascal Denis. 2018.A framework for understanding the role of mor- Christian Bentz, Tatyana Ruzsics, Alexander Ko- phology in universal dependency parsing. In plenig, and Tanja Samardžic.´ 2016. A compar- Proceedings of the 2018 Conference on Empir- ison between morphological complexity mea- ical Methods in Natural Language Processing, sures: Typological data vs. language corpora. pages 2864–2870, Brussels, Belgium. Associa- In Proceedings of the Workshop on Compu- tion for Computational Linguistics. tational Linguistics for Linguistic Complexity (CL4LC), pages 142–153, Osaka, Japan. The Jacob Devlin, Ming-Wei Chang, Kenton Lee, and COLING 2016 Organizing Committee. Kristina Toutanova. 2019. BERT: Pre-training Kaj Bostrom and Greg Durrett. 2020. Byte pair of deep bidirectional transformers for language encoding is suboptimal for language model pre- understanding. In Proceedings of the 2019 training. CoRR, cs.CL/2004.03720v1. Conference of the North American Chapter of the Association for Computational Linguis- Çagrı˘ Çöltekin. 2010. A freely available morpho- tics: Human Language Technologies, Volume logical analyzer for Turkish. In Proceedings of 1 (Long and Short Papers), pages 4171–4186, the Seventh International Conference on Lan- Minneapolis, Minnesota. Association for Com- guage Resources and Evaluation (LREC’10), putational Linguistics. Valletta, Malta. European Language Resources Association (ELRA). Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online. Max Planck Institute for Çagrı˘ Çöltekin. 2014. A set of open source tools Evolutionary Anthropology, Leipzig. for Turkish natural language processing. In Proceedings of the Ninth International Confer- Daniela Gerz, Ivan Vulic,´ Edoardo Maria Ponti, ence on Language Resources and Evaluation Roi Reichart, and Anna Korhonen. 2018. On (LREC’14), Reykjavik, Iceland. European Lan- the relation between and guage Resources Association (ELRA). (limitations of) multilingual language model- ing. In Proceedings of the 2018 Conference on Christos Christodoulopoulos and Mark Steedman. Empirical Methods in Natural Language Pro- 2014. A massively parallel corpus: The Bible in cessing, pages 316–327, Brussels, Belgium. As- 100 languages. Language Resources and Eval- sociation for Computational Linguistics. uation, 49:1–21. Ryan Cotterell, Sabrina J. Mielke, Jason Eisner, Kimmo Kettunen. 2014. Can type-token ratio be and Brian Roark. 2018. Are all languages used to show morphological complexity of lan- equally hard to language-model? In Proceed- guages? Journal of Quantitative Linguistics, ings of the 2018 Conference of the North Amer- 21(3):223–245. ican Chapter of the Association for Computa- tional Linguistics: Human Language Technolo- Christo Kirov, Ryan Cotterell, John Sylak- gies, Volume 2 (Short Papers), pages 536–541, Glassman, Géraldine Walther, Ekaterina Vylo- New Orleans, Louisiana. Association for Com- mova, Patrick Xia, Manaal Faruqui, Sebastian putational Linguistics. Mielke, Arya McCarthy, Sandra Kübler, David Yarowsky, Jason Eisner, and Mans Hulden. Michael A. Covington and Joe D. McFall. 2010. 2018. UniMorph 2.0: Universal morphology. Cutting the Gordian knot: The moving-average In Proceedings of the Eleventh International Conference on Language Resources and Evalu- (LREC’14), pages 3158–3163, Reykjavik, Ice- ation (LREC 2018), Miyazaki, Japan. European land. European Language Resources Associa- Language Resources Association (ELRA). tion (ELRA).

Judith L. Klavans. 2018. Computational chal- Stephen Merity, Nitish Shirish Keskar, and lenges for polysynthetic languages. In Proceed- Richard Socher. 2018. An analysis of neural ings of the Workshop on Computational Mod- language modeling at multiple scales. CoRR, eling of Polysynthetic Languages, pages 1–11, cs.CL/1803.08240v1. Santa Fe, New Mexico, USA. Association for Computational Linguistics. Sabrina J. Mielke. 2016. Language diversity in ACL 2004 - 2016. Philipp Koehn. 2005. Europarl: A parallel corpus Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, for statistical machine translation. In Proceed- Brian Roark, and Jason Eisner. 2019. What ings of the Tenth Machine Translation Summit, kind of language is hard to language-model? pages 79–86, Phuket, Thailand. AAMT. In Proceedings of the 57th Annual Meeting of Taku Kudo. 2018. Subword regularization: Im- the Association for Computational Linguistics, proving neural network translation models with pages 4975–4989, Florence, Italy. Association multiple subword candidates. In Proceedings of for Computational Linguistics. the 56th Annual Meeting of the Association for Sabrina J. Mielke and Jason Eisner. 2019. Spell Computational Linguistics (Volume 1: Long Pa- once, summon anywhere: A two-level open- pers), pages 66–75, Melbourne, Australia. As- vocabulary language model. Proceedings of sociation for Computational Linguistics. the AAAI Conference on Artificial Intelligence, Septina Dian Larasati, Vladislav Kubon,ˇ and 33:6843–6850. Daniel Zeman. 2011. Indonesian morphol- Tommi A. Pirinen. 2015. Omorfi — free and ogy tool (MorphInd): Towards an indonesian open source morphological lexical database corpus. In Cerstin Mahlow and Michael Pi- for Finnish. In Proceedings of the 20th otrowski, editors, Systems and Frameworks for Nordic Conference of Computational Linguis- Computational Morphology, pages 119–129. tics (NODALIDA 2015), pages 313–315, Vil- Springer Berlin Heidelberg, Berlin, Heidelberg. nius, Lithuania. Linköping University Elec- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei tronic Press, Sweden. Du, Mandar Joshi, Danqi Chen, Omer Levy, Alec Radford, Jeff Wu, Rewon Child, David Luan, Mike Lewis, Luke Zettlemoyer, and Veselin Dario Amodei, and Ilya Sutskever. 2019. Lan- Stoyanov. 2019. RoBERTa: A robustly op- guage models are unsupervised multitask learn- timized BERT pretraining approach. CoRR, ers. cs.CL/1907.11692v1. Benoît Sagot. 2013. Comparing complexity mea- Manuel Mager, Elisabeth Mager, Alfonso Medina- sures. In Computational Approaches to Mor- Urrea, Ivan Vladimir Meza Ruiz, and Katha- phological Complexity, Paris, France. Surrey rina Kann. 2018. Lost in translation: Analy- Morphology Group. sis of information loss during machine trans- lation between polysynthetic and fusional lan- Helmut Schmid, Arne Fitschen, and Ulrich Heid. guages. In Proceedings of the Workshop on 2004. SMOR: A German computational mor- Computational Modeling of Polysynthetic Lan- phology covering derivation, composition and guages, pages 73–83, Santa Fe, New Mexico, inflection. In Proceedings of the Fourth In- USA. Association for Computational Linguis- ternational Conference on Language Resources tics. and Evaluation (LREC’04), pages 1–263, Lis- bon, Portugal. European Language Resources Thomas Mayer and Michael Cysouw. 2014. Cre- Association (ELRA). ating a massively parallel Bible corpus. In Proceedings of the Ninth International Confer- Lane Schwartz, Francis Tyers, Lori Levin, Christo ence on Language Resources and Evaluation Kirov, Patrick Littell, Chi-kiu Lo, Emily Prud’hommeaux, Hyunji Hayley Park, Ken- Python implementation and extensions for Mor- neth Steimel, Rebecca Knowles, Jeffrey Micher, fessor baseline. Technical report, Aalto Univer- Lonny Strunk, Han Liu, Coleman Haley, sity; Aalto-yliopisto. Katherine J. Zhang, Robbie Jimerson, Vasil- isa Andriyanets, Aldrian Obaja Muis, Naoki Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Otani, Jong Hyuk Park, and Zhisong Zhang. Carbonell, Ruslan Salakhutdinov, and Quoc V. 2020. Neural mod- Le. 2019. XLNet: Generalized autoregres- elling. CoRR, cs.CL/2005.05477v2. sive pretraining for language understanding. In Hanna Wallach, Hugo Larochelle, Alina Rico Sennrich, Barry Haddow, and Alexandra Beygelzimer, Florence d’Alché Buc, Emily Birch. 2016. Neural machine translation of rare Fox, and Roman Garnett, editors, Advances words with subword units. In Proceedings of in Neural Information Processing Systems 32, the 54th Annual Meeting of the Association for pages 5753–5763. Curran Associates, Inc. Computational Linguistics (Volume 1: Long Pa- pers), pages 1715–1725, Berlin, Germany. As- A Data sociation for Computational Linguistics. We began with the data used in Mielke et al. Yusuxke Shibata, Takuya Kida, Shuichi Fuka- (2019). This was originally a subset of a Bible machi, Masayuki Takeda, Ayumi Shinohara, corpus (Mayer and Cysouw, 2014), which is no Takeshi Shinohara, and Setsuo Arikawa. 1999. longer publically available. We excluded con- Byte pair encoding: A text compression scheme structed languages (epo, tlh) from the data, keep- that accelerates pattern matching. Technical re- ing a total of 104 verse-aligned Bibles in 60 lan- port, Department of Informatics, Kyushu Uni- guages9 in 12 language families. To increase the versity. number of the languages and language families represented, we added 41 Bibles in 32 languages Maciej Tomczak and Ewa Tomczak. 2014. The to the data. 13 Bible translations in 13 languages10 need to report effect size estimates revisited. An were sourced from Christodoulopoulos and Steed- overview of some recommended measures of man(2014). In addition, we included 28 Bible effect size. Trends in Sport Sciences, 1(21):19– translations in 21 languages scraped from vari- 25. ous online sources. Two of the Bibles scraped Francis Tyers and Karina Mishchenkova. 2020. were in Spanish (spa) and Telugu (tel), languages Dependency annotation of noun incorporation which were already included in the Bible cor- in polysynthetic languages. In Proceedings of pora (Mayer and Cysouw, 2014; Christodoulopou- the Fourth Workshop on Universal Dependen- los and Steedman, 2014). These translations cies (UDW 2020), pages 195–204, Barcelona, were included because the new Spanish Bible Spain (Online). Association for Computational was a parallel source of the Paraguayan Guaraní Linguistics. (gug) translation, and the Telugu Bible obtained from Mielke et al.(2019) was originally misla- Clara Vania and Adam Lopez. 2017. From char- beled as Tecpatlán Totonac (tcw). The Central acters to words to in between: Do we capture Alaskan Yup’ik (esu) Bible was from https: morphology? In Proceedings of the 55th An- //bibles.org. 26 Bibles in 19 languages11 nual Meeting of the Association for Compu- were from http://bible.com. The Green- tational Linguistics (Volume 1: Long Papers), landic (kal) Bible was obtained from http:// pages 2016–2027, Vancouver, Canada. Associ- old.bibelselskabet.dk. ation for Computational Linguistics. Hugo David Calderon Vilca, Flor Cagniy Cár- denas Mariñó, and Edwin Fredy Mamani 9afr, aln, arb, arz, ayr, bba, ben, bqc, bul, cac, cak, ceb, Calderon. 2012. Analizador morfólogico de ces, cmn, cnh, cym, dan, deu, ell, eng, fin, fra, guj, gur, hat, hrv, hun, ind, ita, kek, kjb, lat, lit, mah, mam, mri, mya, nld, la lengua Quechua basado en software libre nor, plt, poh, por, qub, quh, quy, quz, ron, rus, som, tbz, tel, Helsinkifinite-statetransducer (HFST). tgl, tpi, tpm, ukr, vie, wal, wbm, xho, zom 10als, amh, dje, heb, isl, jpn, kor, pck, slk, slv, spa, swe, tha Sami Virpioja, Peter Smit, Stig-Arne Grönroos, 11crk, gug, gui, hin, ike, ikt, kan, mal, mar, nch, nep, nhe, and Mikko Kurimo. 2013. Morfessor 2.0: pes, pol, sna, spa, tel, tob, tur