Morphology Matters: a Multilingual Language Modeling Analysis Hyunji Hayley Park Katherine J

Morphology Matters: A Multilingual Language Modeling Analysis Hyunji Hayley Park Katherine J. Zhang Coleman Haley University of Illinois Carnegie Mellon University Johns Hopkins University [email protected] [email protected] [email protected] Kenneth Steimel Han Liu Lane Schwartz Indiana University University of Chicago∗ University of Illinois [email protected] [email protected] [email protected] Abstract find that morphological complexity is predictive of language modeling difficulty, while Mielke et al. Prior studies in multilingual language mod- (2019) conclude that simple statistics of a text like eling (e.g., Cotterell et al., 2018; Mielke the number of types explain differences in model- et al., 2019) disagree on whether or not ing difficulty, rather than morphological measures. inflectional morphology makes languages This paper revisits this issue by increasing the harder to model. We attempt to resolve the disagreement and extend those studies. number of languages considered and augment- We compile a larger corpus of 145 Bible ing the kind and number of morphological fea- translations in 92 languages and a larger tures used. We train language models for 92 number of typological features.1 We fill languages from a corpus of Bibles fully aligned in missing typological data for several lan- at the verse level and measure language model- guages and consider corpus-based measures ing performance using surprisal (the negative log- of morphological complexity in addition to likelihood) per verse (see §4.5). We investigate expert-produced typological features. We find that several morphological measures how this measure is correlated with 12 linguist- are significantly associated with higher sur- generated morphological features and four corpus- prisal when LSTM models are trained with based measures of morphological complexity. BPE-segmented data. We also investigate Additionally, we contend that the relation be- linguistically-motivated subword segmenta- tween segmentation method, morphology, and lan- tion strategies like Morfessor and Finite- guage modeling performance needs further inves- State Transducers (FSTs) and find that these tigation. Byte-Pair Encoding (BPE; Shibata et al., segmentation strategies yield better perfor- 1999) is widely used in NLP tasks including ma- mance and reduce the impact of a language’s morphology on language modeling. chine translation (Sennrich et al., 2016) as an un- supervised information-theoretic method for seg- menting text data into subword units. Variants of 1 Introduction BPE or closely related methods such as WordPiece (Kudo, 2018) are frequently employed by state- With most research in Natural Language Process- of-the-art pretrained language models (Liu et al., ing (NLP) directed at a small subset of the world’s 2019; Radford et al., 2019; Devlin et al., 2019; languages, whether the techniques developed are Yang et al., 2019). However, BPE and other seg- truly language-agnostic is often not known. Be- arXiv:2012.06262v1 [cs.CL] 11 Dec 2020 mentation methods may vary in how closely they cause the vast majority of research focuses on capture morphological segments for a given lan- English with Chinese a distant second (Mielke, guage, which may affect language modeling per- 2016), neither of which is morphologically rich, formance. the impact of morphology on NLP tasks for vari- Therefore, this paper focuses on the following ous languages is not entirely understood. two research questions: Several studies have investigated this issue in the context of language modeling by comparing 1. Does a language’s morphology influence lan- a number of languages, but found conflicting re- guage modeling difficulty? sults. Gerz et al.(2018) and Cotterell et al.(2018) 2. If so, how do different segmentation methods ∗Work done while at University of Colorado Boulder interact with morphology? 1https://github.com/hayleypark/ MorphologyMatters In order to answer the first question, we train models using data sets segmented by characters guage modeling performance. and BPE units. Our results show that BPE lan- Cotterell et al.(2018) arrived at a similar con- guage modeling surprisal is significantly corre- clusion modeling 21 languages using the Europarl lated with measures of morphological typology corpus (Koehn, 2005). When trained with n-gram and complexity. This suggests that BPE segments and character-based Long Short-Term Memory are ineffective in mitigating the effect of morphol- (LSTM) models, the languages showed different ogy in language modeling. modeling difficulties, which were correlated with As for the second question, we consider more a measure of morphology, Morphological Count- linguistically-motivated segmentation methods to ing Complexity (MCC) or the number of inflec- compare with BPE: Morfessor (Creutz and La- tional categories (Sagot, 2013). gus, 2007) and Finite-State Transducers (FSTs) However, Mielke et al.(2019) failed to repro- (see §4.3). Our comparison of the models us- duce the correlation with MCC when they in- ing the different segmentation methods shows that creased the scope to 69 languages, utilizing a Morfessor reduces the impact of morphology for Bible corpus (Mayer and Cysouw, 2014). They more languages than BPE. FST-based segmenta- also reported no correlation with measures of tion methods outperform the other segmentation morphosyntactic complexity such as head-POS methods when available. These results suggest entropy (Dehouck and Denis, 2018) and other that morphologically motivated segmentations im- linguist-generated features (Dryer and Haspel- prove cross-linguistic language modeling. math, 2013). Rather, they found that simpler statistics, namely the number of types and num- 2 Modeling difficulty across languages ber of characters per word, correlate with language model surprisal using BPE and character segmen- Studies have demonstrated that different lan- tation, respectively. guages may be unequally difficult to model and have tested the relations between such model- 3 Morphological measures ing difficulty and morphological properties of languages, using different segmentation methods. Different measures of morphology are used to rep- Vania and Lopez(2017) compared the effective- resent a language’s morphology. ness of word representations based on different segmentation methods in modeling 10 languages 3.1 Linguist-generated measures with various morphological typologies. They The most linguistically-informed measures of trained word-level language models, but utilize morphology involve expert descriptions of lan- segmentation methods to create word embeddings guages. The World Atlas of Language Structures that include segment-level information. Compar- (WALS; Dryer and Haspelmath, 2013) has been ing character, BPE, and Morfessor segmentations, used frequently in the literature to provide typo- they concluded that character-based representa- logical information. WALS is a large database of tions were most effective across languages, with linguistic features gathered from descriptive mate- BPE always outperforming Morfessor. However, rials, such as reference grammars. It contains 144 models based on hand-crafted morphological anal- chapters in 11 areas including phonology, mor- yses outperformed all other segmentation methods phology, and word order. Each chapter describes a by a wide margin. feature with categorical values and lists languages Gerz et al.(2018) trained n-gram and neural lan- that have each value. However, not all languages guage models over 50 languages and argued that in the database have data for all the features, and the type of morphological system is predictive of for some languages there is no data at all. model performance. Their results show that lan- The studies reviewed in §2 all relied on this guages differ with regard to modeling difficulty. expert-description approach to quantify morpho- They attributed the differences among languages logical properties. Gerz et al.(2018) focused to four types of morphological systems: isolating, on WALS descriptions of inflectional synthesis fusional, introflexive, and agglutinative. While of verbs, fusion, exponence, and flexivity, while they found a significant association between the Mielke et al.(2019) looked at two WALS fea- morphological type and modeling difficulty, Type- tures, 26A “Prefixing vs. Suffixing in Inflectional Token Ratio (TTR) was the most predictive of lan- Morphology” and 81A “Order of Subject, Object and Verb.” Cotterell et al.(2018) used UniMorph Measure Definition (Kirov et al., 2018), instead of WALS, to calcu- late MCC. Vania and Lopez(2017) did not cite Types Number of unique word tokens any databases but provided descriptions of four TTR Number of unique word tokens di- morphological types (fusional, agglutinative, root- vided by total number of word to- and-pattern, and reduplication) and categorized 10 kens languages into these types. MATTR Average TTR calculated over a moving window of 500 word tokens A major issue with this approach to represent- MLW Average number of characters per ing morphology is that there is not enough expert word token data available to enable comparisons across many different languages. In fact, Mielke et al.(2019) Table 1: Corpus-based measures of morphology de- chose their two WALS features because data for fined for this study. These measures are calculated on these features existed for most of their languages. tokenized data sets before applying any segmentation Moreover, Bentz et al.(2016) showed that their method. WALS-based measure had lower

Load more