Past, Present, Future: a Computational Investigation of the Typology of Tense in 1000 Languages
Total Page:16
File Type:pdf, Size:1020Kb
Past, Present, Future: A Computational Investigation of the Typology of Tense in 1000 Languages Ehsaneddin Asgari1,2 and Hinrich Schutze¨ 1 1Center for Information and Language Processing, LMU Munich, Germany 2Applied Science and Technology, University of California, Berkeley, CA, USA, [email protected] [email protected] Abstract We address this challenge by proposing a new method for analyzing what we call superparallel We present SuperPivot, an analysis corpora, corpora that are by an order of magnitude method for low-resource languages that more parallel than corpora that have been available occur in a superparallel corpus, i.e., in a in NLP to date. The corpus we work with in this corpus that contains an order of magni- paper is the Parallel Bible Corpus (PBC) that con- tude more languages than parallel corpora sists of translations of the New Testament in 1169 currently in use. We show that SuperPivot languages. Given that no NLP analysis tools are performs well for the crosslingual analysis available for most of these 1169 languages, how of the linguistic phenomenon of tense. can we extract the rich information that is poten- We produce analysis results for more than tially hidden in such superparallel corpora? 1000 languages, conducting – to the best The method we propose is based on two hy- of our knowledge – the largest crosslin- potheses. H1 Existence of overt encoding. For gual computational study performed to any important linguistic distinction f that is fre- date. We extend existing methodology for quently encoded across languages in the world, leveraging parallel corpora for typological there are a few languages that encode f overtly analysis by overcoming a limiting as- on the surface. H2 Overt-to-overt and overt-to- sumption of earlier work: We only require non-overt projection. For a language l that en- that a linguistic feature is overtly marked codes f, a projection of f from the “overt lan- in a few of thousands of languages as guages” to l in the superparallel corpus will iden- opposed to requiring that it be marked in tify the encoding that l uses for f, both in cases all languages under investigation. in which the encoding that l uses is overt and in cases in which the encoding that l uses is non- 1 Introduction overt. Based on these two hypotheses, our method Significant linguistic resources such as machine- proceeds in 5 steps. readable lexicons and part-of-speech (POS) tag- 1. Selection of a linguistic feature. We select a gers are available for at most a few hundred lan- linguistic feature f of interest. Running example: guages. This means that the majority of the We select past tense as feature f. languages of the world are low-resource. Low- 2. Heuristic search for head pivot. Through resource languages like Fulani are spoken by tens a heuristic search, we find a language lh that con- of millions of people and are politically and eco- tains a head pivot ph that is highly correlated with nomically important; e.g., to manage a sudden the linguistic feature of interest. refugee crisis, NLP tools would be of great ben- Running example: “ti” in Seychelles Creole efit. Even “small” languages are important for (CRS). CRS “ti” meets our requirements for a the preservation of the common heritage of hu- head pivot well as will be verified empirically in mankind that includes natural remedies and lin- 3. First, “ti” is a surface marker: it is easily § guistic and cultural diversity that can potentially identifable through whitespace tokenization and it enrich everybody. Thus, developing analysis is not ambiguous, e.g., it does not have a second methods for low-resource languages is one of the meaning apart from being a grammatical marker. most important challenges of NLP today. Second, “ti” is a good marker for past tense in 113 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 113–124 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics terms of both “precision” and “recall”. CRS has progressive and perfective aspect, is given in 4. § mandatory past tense marking (as opposed to lan- Running example: We compute the correla- guages in which tense marking is facultative) and tion of “ti” with words in other languages and se- “ti” is highly correlated with the general notion of lect the 100 most highly correlated words as piv- past tense. ots. Examples of pivots we find this way are Tor- This does not mean that every clause that a lin- res Strait Creole “bin” (from English “been”) and guist would regard as past tense is marked with Tzotzil “laj”. “laj” is a perfective marker, e.g., “ti” in CRS. For example, some tense-aspect con- “Laj meltzaj -uk” ‘LAJ be-made subj’ means “It’s figurations that are similar to English present per- done being built” (Aissen, 1987). fect are marked with “in” in CRS, not with “ti” 4. Projection of pivot set to all languages. (e.g., ENG “has commanded” is translated as “in Now that we have a large pivot set, we project the ordonn”). pivots to all other languages to search for linguis- Our goal is not to find a head language and a tic devices that express the linguistic feature f. Up head pivot that is a perfect marker of f. Such a to this point, we have made the assumption that it head pivot probably does not exist; or, more pre- is easy to segment text in all languages into pieces cisely, linguistic features are not completely rigor- of a size that is not too small (individual charac- ously defined. In a sense, one of the contributions ters of the Latin alphabet would be too small) and of this work is that we provide more rigorous defi- not too large (entire sentences as tokens would be nitions of past tense across languages; e.g., “ti” in too large). Segmentation on standard delimiters CRS is one such rigorous definition of past tense is a good approximation for the majority of lan- and it automatically extends (through projection) guages – but not for all: it undersegments some to 1000 languages in the superparallel corpus. (e.g., the polysynthetic language Inuit) and over- segments others (e.g., languages that use punctua- 3. Projection of head pivot to larger pivot tion marks as regular characters). set. Based on an alignment of the head language to the other languages in the superparallel corpus, For this reason, we do not employ tokenization we project the head pivot to all other languages in this step. Rather we search for character n- grams (2 n 6) to find linguistic devices that and search for highly correlated surface markers, ≤ ≤ i.e., we search for additional pivots in other lan- express f. This implementation of the search pro- guages. This projection to more pivots achieves cedure is a limitation – there are many linguistic three goals. First, it makes the method more ro- devices that cannot be found using it, e.g., tem- bust. Relying on a single pivot would result in plates in templatic morphology. We leave address- ing this for future work ( 7). many errors due to the inherent noisiness of lin- § guistic data and because several components we Running example: We find “-ed” for English use (e.g., alignment of the languages in the su- and “-te” for German as surface features that are perparallel corpus) are imperfect. Second, as we highly correlated with the 100 past tense pivots. discussed above, the head pivot does not neces- 5. Linguistic analysis. The result of the previ- sarily have high “recall”; our example was that ous steps is a superparallel corpus that is richly an- CRS “ti” is not applied to certain clauses that notated with information about linguistic feature would be translated using present perfect in En- f. This structure can be exploited for the analysis glish. Thus, moving to a larger pivot set increases of a single language li that may be the focus of recall. Third, as we will see below, the pivot set a linguistic investigation. Starting with the char- can be leveraged to create a fine-grained map of acter n-grams that were found in the step “projec- the linguistic feature. Consider clauses referring tion of pivot set to all languages”, we can explore to eventualities in the past that English speakers their use and function, e.g, for the mined n-gram would render in past progressive, present perfect “-ed” in English (assuming English is the language and simple past tense. Our hope is that the pivot li and it is unfamiliar to us). Many of the other set will cover these distinctions, i.e., one of the 1000 languages provide annotations of linguistic pivots marks past progressive, but not present pre- feature f for li: both the languages that are part of fect and simple past, another pivot marks present the pivot set (e.g., Tzotzil “laj”) and the mined n- perfect, but not the other two and so on. An exam- grams in other languages that we may have some ple of this type of map, including distinctions like knowledge about (e.g., “-te” in German). 114 We can also use the structure we have gener- 2 SuperPivot: Description of method ated for typological analysis across languages fol- The linguistic lowing the work of Michael Cysouw ((Cysouw, 1. Selection of a linguistic feature. feature of interest f is selected by the person who 2014), 5). Our method is an advancement com- § performs a SuperPivot analysis, i.e., by a linguist, putationally over Cysouw’s work because our NLP researcher or data scientist.