Past, Present, Future: a Computational Investigation of the Typology of Tense in 1000 Languages
Total Page:16
File Type:pdf, Size:1020Kb
Past, Present, Future: A Computational Investigation of the Typology of Tense in 1000 Languages Ehsaneddin Asgari1,2 and Hinrich Schutze¨ 1 1Center for Information and Language Processing, LMU Munich, Germany 2Applied Science and Technology, University of California, Berkeley, CA, USA, [email protected] Abstract We address this challenge by proposing a new method for analyzing what we call superparallel We present SuperPivot, an analysis corpora, corpora that are by an order of magnitude method for low-resource languages that more parallel than corpora that have been available occur in a superparallel corpus, i.e., in a in NLP to date. The corpus we work with in this corpus that contains an order of magni- paper is the Parallel Bible Corpus (PBC) that con- tude more languages than parallel corpora sists of translations of the New Testament in 1169 currently in use. We show that SuperPivot languages. Given that no NLP analysis tools are performs well for the crosslingual analysis available for most of these 1169 languages, how of the linguistic phenomenon of tense. can we extract the rich information that is poten- We produce analysis results for more than tially hidden in such superparallel corpora? 1000 languages, conducting – to the best The method we propose is based on two hy- of our knowledge – the largest crosslin- potheses. H1 Existence of overt encoding. For gual computational study performed to any important linguistic distinction f that is fre- date. We extend existing methodology for quently encoded across languages in the world, leveraging parallel corpora for typological there are a few languages that encode f overtly analysis by overcoming a limiting as- on the surface. H2 Overt-to-overt and overt-to- sumption of earlier work: We only require non-overt projection. For a language l that en- that a linguistic feature is overtly marked codes f, a projection of f from the “overt lan- in a few of thousands of languages as guages” to l in the superparallel corpus will iden- opposed to requiring that it be marked in tify the encoding that l uses for f, both in cases all languages under investigation. in which the encoding that l uses is overt and in cases in which the encoding that l uses is non- 1 Introduction overt. Based on these two hypotheses, our method Significant linguistic resources such as machine- proceeds in 5 steps. readable lexicons and part-of-speech (POS) tag- 1. Selection of a linguistic feature. We select a gers are available for at most a few hundred lan- linguistic feature f of interest. Running example: guages. This means that the majority of the We select past tense as feature f. arXiv:1704.08914v2 [cs.CL] 14 Sep 2017 languages of the world are low-resource. Low- 2. Heuristic search for head pivot. Through resource languages like Fulani are spoken by tens a heuristic search, we find a language lh that con- of millions of people and are politically and eco- tains a head pivot ph that is highly correlated with nomically important; e.g., to manage a sudden the linguistic feature of interest. refugee crisis, NLP tools would be of great ben- Running example: ”ti” in Seychelles Creole efit. Even “small” languages are important for (CRS). CRS “ti” meets our requirements for a the preservation of the common heritage of hu- head pivot well as will be verified empirically in mankind that includes natural remedies and lin- x3. First, ”ti” is a surface marker: it is easily guistic and cultural diversity that can potentially identifable through whitespace tokenization and it enrich everybody. Thus, developing analysis is not ambiguous, e.g., it does not have a second methods for low-resource languages is one of the meaning apart from being a grammatical marker. most important challenges of NLP today. Second, ”ti” is a good marker for past tense in terms of both “precision” and “recall”. CRS has beyond the scope of this paper to verify that we mandatory past tense marking (as opposed to lan- can produce such an analysis for all linguistic fea- guages in which tense marking is facultative) and tures, but a promising example of this type of map, ”ti” is highly correlated with the general notion of including distinctions like progressive and perfec- past tense. tive aspect, is given in x4. This does not mean that every clause that a lin- Running example: We compute the correlation guist would regard as past tense is marked with of “ti” with words in other languages and select the ”ti” in CRS. For example, some tense-aspect con- 100 highest correlated words as pivots. Examples figurations that are similar to English present per- of pivots we find this way are Torres Strait Cre- fect are marked with ”in” in CRS, not with ”ti” ole “bin” (from English “been”) and Tzotzil “laj”. (e.g., ENG “has commanded” is translated as “in “laj” is a perfective marker, e.g., “Laj meltzaj -uk” ordonn”). ‘LAJ be-made subj’ means “It’s done being built” (Aissen, 1987). Our goal is not to find a head language and a head pivot that is a perfect marker of f. Such a 4. Projection of pivot set to all languages. head pivot probably does not exist; or, more pre- Now that we have a large pivot set, we project the cisely, linguistic features are not completely rig- pivots to all other languages to search for linguis- orously defined. In a sense, one of the significant tic devices that express the linguistic feature f. Up contributions of this work is that we provide more to this point, we have made the assumption that it rigorous definitions of past tense across languages; is easy to segment text in all languages into pieces e.g., ”ti” in CRS is one such rigorous definition of a size that is not too small (individual charac- of past tense and it automatically extends (through ters of the Latin alphabet would be too small) and projection) to 1000 languages in the superparallel not too large (entire sentences as tokens would be corpus. too large). Segmentation on standard delimiters is a good approximation for the majority of lan- 3. Projection of head pivot to larger pivot guages – but not for all: it undersegments some set. Based on an alignment of the head language (e.g., the polysynthetic language Inuit) and over- to the other languages in the superparallel corpus, segments others (e.g., languages that use punctua- we project the head pivot to all other languages tion marks as regular characters). and search for highly correlated surface markers, i.e., we search for additional pivots in other lan- For this reason, we do not employ tokenization guages. This projection to more pivots achieves in this step. Rather we search for character n- three goals. First, it makes the method more ro- grams (2 ≤ n ≤ 6) to find linguistic devices that bust. Relying on a single pivot would result in express f. This implementation of the search pro- many errors due to the inherent noisiness of lin- cedure is a limitation – there are many linguistic guistic data and because several components we devices that cannot be found using it, e.g., tem- use (e.g., alignment of the languages in the su- plates in templatic morphology. We leave address- perparallel corpus) are imperfect. Second, as we ing this for future work (x7). discussed above, the head pivot does not neces- Running example: We find “-ed” for English sarily have high “recall”; our example was that and “-te” for German as surface features that are CRS “ti” is not applied to certain clauses that highly correlated with the 100 past tense pivots. would be translated using present perfect in En- 5. Linguistic analysis. The result of the previ- glish. Thus, moving to a larger pivot set increases ous steps is a superparallel corpus that is richly an- recall. Third, as we will see below, the pivot set notated with information about linguistic feature can be leveraged to create a fine-grained map of f. This structure can be exploited for the analysis the linguistic feature. Consider clauses referring of a single language li that may be the focus of to eventualities in the past that English speakers a linguistic investigation. Starting with the char- would render in past progressive, present perfect acter n-grams that were found in the step “projec- and simple past tense. Our hope is that the pivot tion of pivot set to all languages”, we can explore set will cover these distinctions, i.e., one of the their use and function, e.g, for the mined n-gram pivots marks past progressive, but not present pre- “-ed” in English (assuming English is the language fect and simple past, another pivot marks present li and it is unfamiliar to us). Many of the other perfect, but not the other two and so on. It is 1000 languages provide annotations of linguistic feature f for li: both the languages that are part of lel corpora by overcoming several limiting factors. the pivot set (e.g., Tzotzil “laj”) and the mined n- The most important is that Cysouw’s method is grams in other languages that we may have some only applicable if markers of the relevant linguistic knowledge of (e.g., “-te” in German). feature are recognizable on the surface in all lan- We can also use the structure we have gen- guages. In contrast, we only assume that markers erated for typological analysis across languages of the relevant linguistic feature are recognizable following the work of Michael Cysouw. He on the surface in a small number of languages.