Past, Present, Future: A Computational Investigation of the Typology of Tense in 1000 Languages

Ehsaneddin Asgari1,2 and Hinrich Schutze¨ 1

1Center for Information and Language Processing, LMU Munich, Germany 2Applied Science and Technology, University of California, Berkeley, CA, USA, [email protected] [email protected]

Abstract We address this challenge by proposing a new method for analyzing what we call superparallel We present SuperPivot, an analysis corpora, corpora that are by an order of magnitude method for low-resource languages that more parallel than corpora that have been available occur in a superparallel corpus, i.e., in a in NLP to date. The corpus we work with in this corpus that contains an order of magni- paper is the Parallel Bible Corpus (PBC) that con- tude more languages than parallel corpora sists of translations of the New Testament in 1169 currently in use. We show that SuperPivot languages. Given that no NLP analysis tools are performs well for the crosslingual analysis available for most of these 1169 languages, how of the linguistic phenomenon of tense. can we extract the rich information that is poten- We produce analysis results for more than tially hidden in such superparallel corpora? 1000 languages, conducting – to the best The method we propose is based on two hy- of our knowledge – the largest crosslin- potheses. H1 Existence of overt encoding. For gual computational study performed to any important linguistic distinction f that is fre- date. We extend existing methodology for quently encoded across languages in the world, leveraging parallel corpora for typological there are a few languages that encode f overtly analysis by overcoming a limiting as- on the surface. H2 Overt-to-overt and overt-to- sumption of earlier work: We only require non-overt projection. For a language l that en- that a linguistic feature is overtly marked codes f, a projection of f from the “overt lan- in a few of thousands of languages as guages” to l in the superparallel corpus will iden- opposed to requiring that it be marked in tify the encoding that l uses for f, both in cases all languages under investigation. in which the encoding that l uses is overt and in cases in which the encoding that l uses is non- 1 Introduction overt. Based on these two hypotheses, our method Significant linguistic resources such as machine- proceeds in 5 steps. readable and part-of-speech (POS) tag- 1. Selection of a linguistic feature. We select a gers are available for at most a few hundred lan- linguistic feature f of interest. Running example: guages. This means that the majority of the We select past tense as feature f. languages of the world are low-resource. Low- 2. Heuristic search for head pivot. Through resource languages like Fulani are spoken by tens a heuristic search, we find a language lh that con- of millions of people and are politically and eco- tains a head pivot ph that is highly correlated with nomically important; e.g., to manage a sudden the linguistic feature of interest. refugee crisis, NLP tools would be of great ben- Running example: “ti” in Seychelles Creole efit. Even “small” languages are important for (CRS). CRS “ti” meets our requirements for a the preservation of the common heritage of hu- head pivot well as will be verified empirically in mankind that includes natural remedies and lin- 3. First, “ti” is a surface marker: it is easily § guistic and cultural diversity that can potentially identifable through whitespace tokenization and it enrich everybody. Thus, developing analysis is not ambiguous, e.g., it does not have a second methods for low-resource languages is one of the meaning apart from being a grammatical marker. most important challenges of NLP today. Second, “ti” is a good marker for past tense in

113 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 113–124 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics terms of both “precision” and “recall”. CRS has progressive and perfective aspect, is given in 4. § mandatory past tense marking (as opposed to lan- Running example: We compute the correla- guages in which tense marking is facultative) and tion of “ti” with in other languages and se- “ti” is highly correlated with the general notion of lect the 100 most highly correlated words as piv- past tense. ots. Examples of pivots we find this way are Tor- This does not mean that every clause that a lin- res Strait Creole “bin” (from English “been”) and guist would regard as past tense is marked with Tzotzil “laj”. “laj” is a perfective marker, e.g., “ti” in CRS. For example, some tense-aspect con- “Laj meltzaj -uk” ‘LAJ be-made subj’ means “It’s figurations that are similar to English present per- done being built” (Aissen, 1987). fect are marked with “in” in CRS, not with “ti” 4. Projection of pivot set to all languages. (e.g., ENG “has commanded” is translated as “in Now that we have a large pivot set, we project the ordonn”). pivots to all other languages to search for linguis- Our goal is not to find a head language and a tic devices that express the linguistic feature f. Up head pivot that is a perfect marker of f. Such a to this point, we have made the assumption that it head pivot probably does not exist; or, more pre- is easy to segment text in all languages into pieces cisely, linguistic features are not completely rigor- of a size that is not too small (individual charac- ously defined. In a sense, one of the contributions ters of the Latin alphabet would be too small) and of this work is that we provide more rigorous defi- not too large (entire sentences as tokens would be nitions of past tense across languages; e.g., “ti” in too large). Segmentation on standard delimiters CRS is one such rigorous definition of past tense is a good approximation for the majority of lan- and it automatically extends (through projection) guages – but not for all: it undersegments some to 1000 languages in the superparallel corpus. (e.g., the Inuit) and over- segments others (e.g., languages that use punctua- 3. Projection of head pivot to larger pivot tion marks as regular characters). set. Based on an alignment of the head language to the other languages in the superparallel corpus, For this reason, we do not employ tokenization we project the head pivot to all other languages in this step. Rather we search for character n- grams (2 n 6) to find linguistic devices that and search for highly correlated surface markers, ≤ ≤ i.e., we search for additional pivots in other lan- express f. This implementation of the search pro- guages. This projection to more pivots achieves cedure is a limitation – there are many linguistic three goals. First, it makes the method more ro- devices that cannot be found using it, e.g., tem- bust. Relying on a single pivot would result in plates in templatic . We leave address- ing this for future work ( 7). many errors due to the inherent noisiness of lin- § guistic data and because several components we Running example: We find “-ed” for English use (e.g., alignment of the languages in the su- and “-te” for German as surface features that are perparallel corpus) are imperfect. Second, as we highly correlated with the 100 past tense pivots. discussed above, the head pivot does not neces- 5. Linguistic analysis. The result of the previ- sarily have high “recall”; our example was that ous steps is a superparallel corpus that is richly an- CRS “ti” is not applied to certain clauses that notated with information about linguistic feature would be translated using present perfect in En- f. This structure can be exploited for the analysis glish. Thus, moving to a larger pivot set increases of a single language li that may be the focus of recall. Third, as we will see below, the pivot set a linguistic investigation. Starting with the char- can be leveraged to create a fine-grained map of acter n-grams that were found in the step “projec- the linguistic feature. Consider clauses referring tion of pivot set to all languages”, we can explore to eventualities in the past that English speakers their use and function, e.g, for the mined n-gram would render in past progressive, present perfect “-ed” in English (assuming English is the language and simple past tense. Our hope is that the pivot li and it is unfamiliar to us). Many of the other set will cover these distinctions, i.e., one of the 1000 languages provide annotations of linguistic pivots marks past progressive, but not present pre- feature f for li: both the languages that are part of fect and simple past, another pivot marks present the pivot set (e.g., Tzotzil “laj”) and the mined n- perfect, but not the other two and so on. An exam- grams in other languages that we may have some ple of this type of map, including distinctions like knowledge about (e.g., “-te” in German).

114 We can also use the structure we have gener- 2 SuperPivot: Description of method ated for typological analysis across languages fol- The linguistic lowing the work of Michael Cysouw ((Cysouw, 1. Selection of a linguistic feature. feature of interest f is selected by the person who 2014), 5). Our method is an advancement com- § performs a SuperPivot analysis, i.e., by a linguist, putationally over Cysouw’s work because our NLP researcher or data scientist. Henceforth, we method scales to thousands of languages as we will refer to this person as the linguist. demonstrate below. In this paper, f F = past, present, future . ∈ { } Running example: We sketch the type of analy- 2. Heuristic search for head pivot. There are sis that our new method makes possible in 4. § several ways for finding the head language and the The above steps “1. heuristic search for head head pivot. Perhaps the linguist knows a language pivot” and “2. projection of head pivot to larger that has a good head pivot. Or she is a trained ty- pivot set” are based on H1: we assume the exis- pologist and can find the head pivot by consulting tence of overt coding in a subset of languages. the typological literature. The above steps “2. projection of head pivot to In this paper, we use our knowledge of English larger pivot set” and “3. projection of pivot set and an alignment from English to all other lan- to all languages” are based on H2: we assume guages to find head pivots. (See below for details that overt-to-overt and overt-to-non-overt pro- on alignment.) We define a “query” in English jection is possible. and search for words that are highly correlated to the query in other languages. For future tense, the In the rest of the paper, we will refer to the query is simply the “will”, so we search for method that consists of steps 1 to 5 as SuperPivot: words in other languages that are highly correlated “linguistic analysis of SUPERparallel corpora us- with “will”. For present tense, the query is the ing surface PIVOTs”. union of “is”, “are” and “am”. So we search for We make three contributions. (i) Our basic hy- words in other languages that are highly correlated potheses are H1 and H2. (H1) For an important with the “merger” of these three words. For past linguistic feature, there exist a few languages that tense, we POS tag the English part of PBC and mark it overtly and easily recognizably. (H2) It merge all words tagged as past tense into one past is possible to project overt markers to overt and tense word.2 We then search for words in other non-overt markers in other languages. Based on languages that are highly correlated with this arti- these two hypotheses we design SuperPivot, a new ficial past tense word. method for analyzing highly parallel corpora, and As an additional constraint, we do not select the show that it performs well for the crosslingual most highly correlated word as the head pivot, but analysis of the linguistic phenomenon of tense. the most highly correlated word in a Creole lan- (ii) Given a superparallel corpus, SuperPivot can guage. Our rationale is that Creole languages are be used for the analysis of any low-resource lan- more regular than other languages because they guage represented in that corpus. In the supple- are young and have not accumulated “historical mentary material, we present results of our analy- baggage” that may make computational analysis sis for three tenses (past, present, future) for 11631 more difficult. languages. An evaluation of accuracy is presented Table1 lists the three head pivots for F . in Table 3.2. (iii) We extend Michael Cysouw’s 3. Projection of head pivot to larger pivot set. method of typological analysis using parallel cor- We first use fast align (Dyer et al., 2013) to align pora by overcoming several limiting factors. The the head language to all other languages in the cor- most important is that Cysouw’s method is only pus. This alignment is on the word level. applicable if markers of the relevant linguistic fea- We compute a score for each word in each lan- ture are recognizable on the surface in all lan- guage based on the number of times it is aligned guages. In contrast, we only assume that markers to the head pivot, the number of times it is aligned of the relevant linguistic feature are recognizable to another word and the total frequencies of head on the surface in a small number of languages. pivot and word. We use χ2 (Casella and Berger, 2008) as the score throughout this paper. Finally,

1We exclude six of the 1169 languages because they do 2Past tense is defined as tags BED, BED*, BEDZ, not share enough verses with the rest. BEDZ*, DOD*, VBD, DOD. We use NLTK (Bird, 2006).

115 we select the k words as pivots that have the high- 0.066 (0.066 is the density of a Gaussian with σ = est association score with the head pivot. 6 at x = 0) and center a bell curve around x. The We impose the constraint that we only select total score for a position x is then the sum of the one pivot per language. So as we go down the filter values at x summed over all occurring pivots. list, we skip pivots from languages for which we Finally, we select the positions xmin and xmax with already have found a pivot. We set k = 100 in this lowest and highest values for each verse. 2 paper. Table1 gives the top 10 pivots. χ is then computed based on the number of 4. Projection of pivot set to all languages. times a character n-gram occurs in a window of As discussed above, the process so far has been size w around xmax (positive count) and in a win- based on tokenization. To be able to find markers dow of size w around xmin (negative count). Verses that cannot be easily detected on the surface (like in which no pivot occurs are used for the negative “-ed” in English), we identify non-tokenization- count in their entirety. The top-ranked character n- based character n-gram features in step 4. grams are then output for analysis by the linguist. We set w = 20. The immediate challenge is that without tokens, 5. Linguistic analysis. We now have created a we have no alignment between the languages any- structure that contains rich information about the more. We could simply assume that the occur- linguistic feature: for each verse we have relative rence of a pivot has scope over the entire verse. positions of pivots that can be projected across lan- But this is clearly inadequate, e.g., for the sen- guages. We also have maximum positions within tence “I arrived yesterday, I’m staying today, and a verse that allow us to pinpoint the most likely I will leave tomorrow”, it is incorrect to say that place in the vicinity of which linguistic feature f it is marked as past tense (or future tense) in its is marked in all languages. This structure can be entirety. Fortunately, the verses in the New Testa- used for the analysis of individual low-resource ment mostly have a simple structure that limits the languages as well as for typological analysis. We variation in where a particular piece of content oc- will give an example of such an analysis in 4. curs in the verse. We therefore make the assump- § tion that a particular relative position in language 3 Data, experiments and results l1 (e.g., the character at relative position 0.62) is aligned with the same relative position in l2 (i.e., 3.1 Data the character at relative position 0.62). This is We use a New Testament subset of the Parallel likely to work for a simple example like “I arrived Bible Corpus (PBS) (Mayer and Cysouw, 2014) yesterday, I’m staying today, and I will leave to- that consists of 1556 translations of the Bible in morrow” across languages. 1169 unique languages. We consider two lan- In our analysis of errors, we found many cases guages to be different if they have different ISO where this assumption breaks down. A well- 639-3 codes. known problematic phenomenon for our method The translations are aligned on the verse level. is the difference between, say, VSO and SOV lan- However, many translations do not have complete guages: the first class puts the verb at the begin- coverage, so that most verses are not present in at ning, the second at the end. However, keep in least one translation. One reason for this is that mind that we accumulate evidence over k = 100 sometimes several consecutive verses are merged, pivots and then compute aggregate statistics over so that one verse contains material that is in real- the entire corpus. As our evaluation below shows, ity not part of it and the merged verses may then the “linear alignment” assumption does not seem be missing from the translation. Thus, there is a to do much harm given the general robustness of trade-off between number of parallel translations our method. and number of verses they have in common. Al- One design element that increases robustness is though some preprocessing was done by the au- that we find the two positions in each verse that are thors of the resource, many translations are not most highly (resp. least highly) correlated with the preprocessed. For example, Japanese is not tok- linguistic feature f. Specifically, we compute the enized. We also observed some incorrectness and relative position x of each pivot that occurs in the sparseness in the metadata. One example is that verse and apply a Gaussian filter (σ = 6 where the one Fijian translation (see 4) is tagged fij , § unit of length is the character), i.e., we set p(x) but it is Fijian, not Fiji Hindi. ≈

116 We use the 7958 verses with the best coverage are free, e.g., they change the sequence of clauses. across languages. This deteriorates mined n-grams. See 7. § A reviewer points out that simple baselines may 3.2 Experiments be available if all we want to do is compute fea- 1. Selection of a linguistic feature. We conduct tures highly associated with past tense as evaluated three experiments for the linguistic features past in Table 3.2. As one such baseline, they suggested tense, present tense and future tense. to first perform a word alignment with the head 2. Heuristic search for head pivot. We use the pivot and then search for highly associated fea- queries described in 2 for finding the following tures in the words that were aligned with the head § three head pivots. (i) Past tense head pivot: “ti” pivot. We implemented this baseline and mea- in Seychellois Creole (CRS) (McWhorter, 2005). sured its performance. Indeed, the results were (ii) Present tense head pivot: “ta” in Papiamentu roughly comparable to the more complex method (PAP) (Andersen, 1990). (iii) Future tense head that we evaluate in Table 3.2. pivot: “bai” in Tok Pisin (TPI) (Traugott, 1978; However, our evaluation was not designed to be Sankoff, 1990). a direct evaluation of our method, but only meant 3. Projection of head pivot to larger pivot set. as a relatively easy way of getting a quantitative Using the method described in 2, we project each sense of the accuracy of our results. The core § head pivot to a set of k = 100 pivots. Table1 gives result of our method is a corpus in which each the top 10 pivots for each tense. language annotates each other language. This is 4. Projection of pivot set to all languages. Us- only meaningful on the token or context level, not ing the method described in 2, we compute highly on the word level. For example, recognizing “- § correlated character n-gram features, 2 n 6, ed” as a possible past tense marker in English ≤ ≤ for all 1163 languages. and applying it uniformly throughout the corpus would result in the incorrect annotation of the ad- See 4 for the last step of SuperPivot: 5. Lin- § jective “red” as a past tense form. In our pro- guistic analysis. posed method, this will not happen since the anno- 3.3 Evaluation tation proceeds from reliable pivots to less reliable features, not the other way round. Nevertheless, We rank n-gram features and retain the top 10, for we agree with the reviewer that we do not make each linguistic feature, for each language and for enough use of “type-level” features in our method each n-gram size. We process 1556 translations. (type-level features of non-pivot languages) and Thus, in total, we extract 1556 5 10 n-grams. × × this is something we plan to address in the future. Table 3.2 shows Mean Reciprocal Rank (MRR) for 10 languages. The rank for a particular rank- 4 A map of past tense ing of n-grams is the first n-gram that is highly correlated with the relevant tense; e.g., character To illustrate the potential of our method we select subsequences of the name “Paulus” are evaluated five out of the 100 past tense pivots that give rise as incorrect, the subsequence “-ed” in English as to large clusters of distinct combinations. Specifi- correct for past. MRR is averaged over all n-gram cally, starting with CRS, we find other pivots that sizes, 2 n 6. Chinese has consistent tense “split” the set of verses that contain the CRS past ≤ ≤ marking only for future, so results are poor. Rus- tense pivot “ti” into two parts that have about the sian and Polish perform poorly because their cen- same size. This gives us two sets. We now look tral grammatical category is aspect, not tense. The for a pivot that splits one of these two sets about poor performance on is due to the limits evenly and so on. After iterating four times, we of character n-gram features for a “templatic” lan- arrive at five pivots: CRS “ti”, Fijian (FIJ) “qai”, guage. Hawaiian Creole (HWC) “wen”, Torres Strait Cre- During this evaluation, we noticed a surprising ole (TCS) “bin” and Tzotzil (TZO) “laj”. amount of variation within translations of one lan- Figure1 shows a t-SNE (Maaten and Hinton, guage; e.g., top-ranked n-grams for some German 2008) visualization of the large clusters of com- translations include names like “Paulus”. We sus- binations that are found for these five languages, pect that for literal translations, linear alignment including one cluster of verses that do not contain ( 2) yields good n-grams. But many translations any of the five pivots. §

117 Verses marked in CRS Verses marked in FIJ

crs:ti hwc:wen tcs:bin crs:ti hwc:wen tcs:bin crs:ti fij:qai tcs:bin tzo:laj crs:ti fij:qai tcs:bin tzo:laj crs:ti tcs:bin crs:ti tcs:bin crs:ti hwc:wen tcs:bin tzo:laj crs:ti hwc:wen tcs:bin tzo:laj crs:ti fij:qai hwc:wen tcs:bin tzo:laj crs:ti fij:qai hwc:wen tcs:bin tzo:laj crs:ti fij:qai tcs:bin crs:ti fij:qai tcs:bin no_marker no_marker

Verses marked in HWC Verses marked in TCS

crs:ti hwc:wen tcs:bin crs:ti hwc:wen tcs:bin crs:ti fij:qai tcs:bin tzo:laj crs:ti fij:qai tcs:bin tzo:laj crs:ti tcs:bin crs:ti tcs:bin crs:ti hwc:wen tcs:bin tzo:laj crs:ti hwc:wen tcs:bin tzo:laj crs:ti fij:qai hwc:wen tcs:bin tzo:laj crs:ti fij:qai hwc:wen tcs:bin tzo:laj crs:ti fij:qai tcs:bin crs:ti fij:qai tcs:bin no_marker no_marker

Verses marked in TZO Verses not marked in neither of languages

crs:ti hwc:wen tcs:bin crs:ti hwc:wen tcs:bin crs:ti fij:qai tcs:bin tzo:laj crs:ti fij:qai tcs:bin tzo:laj crs:ti tcs:bin crs:ti tcs:bin crs:ti hwc:wen tcs:bin tzo:laj crs:ti hwc:wen tcs:bin tzo:laj crs:ti fij:qai hwc:wen tcs:bin tzo:laj crs:ti fij:qai hwc:wen tcs:bin tzo:laj crs:ti fij:qai tcs:bin crs:ti fij:qai tcs:bin no_marker no_marker

Figure 1: A map of past tense based on the largest clusters of verses with particular combinations of the past tense pivots from Seychellois Creole (CRS), Fijian (FIJ), Hawaiian Creole (HWC), Torres Strait Creole (TCS) and Tzotzil (TZO). For each of the five languages, we present a subfigure that highlights the subset of verse clusters that are marked by the pivot of that language. The sixth subfigure highlights verses not marked by any of the five pivots.

118 past present future code language pivot code language pivot code language pivot HPs CRS Seychelles C. ti PAP Papiamentu ta TPI Tok Pisin bai GUX Gourmanchema´ den NOB Norwegian Bokmal˚ er LID Nyindrou kameh MAW Mampruli daa HIF Fiji Hindi hei GUL Sea Island C. gwine GFK Patpatar ga AFR is TGP Tangoa pa YAL Yalunka yi DAN Danish er BUK Bugawac oc TOH Gitonga di SWE Swedish ar¨ BIS Bislama bambae DGI Northern Dagara tι EPO estas PIS Pijin bae BUM Bulu (Cameroon) nga ELL Greek ´ιναι APE Bukiyip eke TCS Torres Strait C. bin HIN Hindi haai HWC Hawaiian C. goin NDZ Ndogo gi`ι NAQ Khoekhoe ra NHR Nharo gha

Table 1: Top ten past, present, and future tense pivots extracted from 1163 languages. HPs = head pivots. C. = Creole

language past present future all CRS “ti”. CRS has a set of markers that can be Arabic 1.00 0.39 0.77 0.72 systematically combined, in particular, a progres- Chinese 0.00 0.00 0.87 0.29 sive marker “pe” that can be combined with the English 1.00 1.00 1.00 1.00 past tense marker “ti”. As a result, past progres- French 1.00 1.00 1.00 1.00 sive sentences in CRS are generally marked with German 1.00 1.00 1.00 1.00 “ti”. Example: “43004031 Meanwhile, the disci- Italian 1.00 1.00 1.00 1.00 ples were urging Jesus, ‘Rabbi, eat something.”’ Persian 0.77 1.00 1.00 0.92 “crs bible 43004031 Pandan sa letan, bann disip ti Polish 1.00 1.00 0.58 0.86 pe sipliy Zezi, ‘Met! Manz en pe.”’ Russian 0.90 0.50 0.62 0.67 The other four languages do not consistently use Spanish 1.00 1.00 1.00 1.00 the pivot for marking the past progressive; e.g., all 0.88 0.79 0.88 0.85 HWC uses “was begging” in 43004031 (instead of “wen”) and TCS uses “kip tok strongwan” ‘keep Table 2: MRR results for step 4. See text for de- talking strongly’ in 43004031 (instead of “bin”). tails. FIJ “qai”. This pivot means “and then”. It is highly correlated with past tense in the New This figure is a map of past tense for all 1163 Testament because most sequential descriptions languages, not just for CRS, FIJ, HWC, TCS and of events are descriptions of past events. But TZO: once the interpretation of a particular clus- there are also some non-past sequences. Example: ter has been established based on CRS, FIJ, HWC, “eng newliving 44009016 And I will show him TCS and TZO, we can investigate this cluster in how much he must suffer for my name’s sake.” the 1164 other languages by looking at the verses “fij hindi 44009016 Au na qai vakatakila vua na that are members of this cluster. This methodol- levu ni ka e na sota kaya e na vukuqu.” This ogy supports the empirical investigation of ques- verse is future tense, but it continues a temporal se- tions like “how is progressive past tense expressed quence (it starts in the preceding verse) and there- in language X”? We just need to look up the clus- fore FIJ uses “qai”. The pivots of the other four ter(s) that correspond to progressive past tense, languages are not general markers of temporal se- look up the verses that are members and retrieve quentiality, so they are not used for the future. the text of these verses in language X. HWC “wen”. HWC is less explicit than the To give the reader a flavor of the distinctions other four languages in some respects and more that are reflected in these clusters, we now list phe- explicit in others. It is less explicit in that not nomena that are characteristic of verses that con- all sentences in a sequence of past tense sentences tain only one of the five pivots; these phenomena need to be marked explicitly with “wen”, resulting identify properties of one language that the other in some sentences that are indistinguishable from four do not have. present tense. On the other hand, we found many

119 cases of phrases in the other four languages Similar maps for present and future tenses are that refer implicitly to the past, but are trans- presented in the supplementary material. lated as a verb with explicit past tense marking in HWC. Examples: “hwc 2000 40026046 Da guy 5 Related work who wen set me up . . . ” ‘the guy who WEN set me up’, “eng newliving 40026046 . . . my betrayer Our work is inspired by (Cysouw, 2014; Cysouw . . . ”; “hwc 2000 43008005 . . . Moses wen tell us and Walchli¨ , 2007); see also (Dahl, 2007;W alchli¨ , in da Rules . . . ” ‘Moses WEN tell us in the rules’, 2010). Cysouw creates maps like Figure1 by “eng newliving 43008005 The law of Moses says manually identifying occurrences of the proper . . . ”; “hwc 2000 47006012 We wen give you guys noun “Bible” in a parallel corpus of Jehovah’s our love . . . ”, “eng newliving 47006012 There is Witnesses’ texts. Areas of the map correspond no lack of love on our part . . . ”. In these cases, the to semantic roles, e.g., the Bible as actor (it tells other four languages (and English too) use a noun you to do something) or as object (it was printed). phrase with no tense marking that is translated as This is a definition of semantic roles that is com- a tense-marked clause in HWC. plementary to and different from prior typologi- cal research because it is empirically grounded in While preparing this analysis, we realized that real language use across a large number of lan- HWC “wen” unfortunately does not meet one of guages. It allows typologists to investigate tradi- the criteria we set out for pivots: it is not unam- tional questions from a new perspective. biguous. In addition to being a past tense marker The field of typology is important for both the- (derived from standard English “went”), it can also oretical (Greenberg, 1960; Whaley, 1996; Croft, be a conjunction, derived from “when”. This am- 2002) and computational (Heiden et al., 2000; biguity is the cause for some noise in the clusters Santaholma, 2007; Bender, 2009, 2011) linguis- marked for presence of HWC “wen” in the figure. tics. Typology is concerned with all areas of lin- TCS “bin”. Conditionals is one pattern we guistics: morphology (Song, 2014), syntax (Com- found in verses that are marked with TCS “bin”, rie, 1989; Croft, 2001; Croft and Poole, 2008; but are not marked for past tense in the other four Song, 2014), semantic roles (Hartmann et al., languages. Example: “tcs bible 46015046 Wanem 2014; Cysouw, 2014), semantics (Koptjevskaja- i bin kam pas i da nomal bodi ane den da spir- Tamm et al., 2007; Dahl, 2014;W alchli¨ and itbodi i bin kam apta.” ‘what came first is the Cysouw, 2012; Sharma, 2009), etc. Typologi- normal body and then the spirit body came af- cal information is important for many NLP tasks ter’, “eng newliving 46015046 What comes first including discourse analysis (Myhill and My- is the natural body, then the spiritual body comes hill, 1992), information retrieval (Pirkola, 2001), later.” Apparently, “bin” also has a modal aspect POS tagging (Bohnet and Nivre, 2012), pars- in TCS: generic statements that do not refer to ing (Bohnet and Nivre, 2012; McDonald et al., specific events are rendered using “bin” in TCS 2013), machine translation (Hajicˇ et al., 2000; whereas the other four languages (and also En- Kunchukuttan and Bhattacharyya, 2016) and mor- glish) use the default unmarked tense, i.e., present phology (Bohnet et al., 2013). tense. Tense is a central phenomenon in linguistics TZO “laj”. This pivot indicates perfective as- and the languages of the world differ greatly in pect. The other four past tense pivots are not per- whether and how they express tense (Traugott, fective markers, so that there are verses that are 1978; Bybee and Dahl, 1989; Dahl, 2000, 1985; marked with “laj”, but not marked with the past Santos, 2004; Dahl, 2007; Santos, 2004; Dahl, tense pivots of the other four languages. Exam- 2014). ple: “tzo huixtan 40010042 . . . ja’ch-ac’bat ben- Low resource. Even resources with the widest dicion´ yu’un hech laj spas . . . ” (literally “a bless- coverage like World Atlas of Linguistic Structures ing . . . LAJ make”), “eng newliving 40010042 (WALS) (Dryer et al., 2005) have little informa- . . . you will surely be rewarded.” Perfective aspect tion for hundreds of languages. Many researchers and past are correlated in the real world since most have taken advantage of parallel information for events that are viewed as simple wholes are in the extracting linguistic knowledge in low-resource past. But future events can also be viewed this way settings (Resnik et al., 1997; Resnik, 2004; Mihal- as the example shows. cea and Simard, 2005; Mayer and Cysouw, 2014;

120 Christodouloupoulos and Steedman, 2015; Lison 7 Conclusion and Tiedemann, 2016). We presented SuperPivot, an analysis method for Parallel projection. Parallel projection across low-resource languages that occur in a superpar- languages has been used for a variety of NLP allel corpus, i.e., in a corpus that contains an or- tasks. Machine translation aside, which is the der of magnitude more languages than parallel most natural task on parallel corpora (Brown corpora currently in use. We showed that Su- et al., 1993), parallel projection has been used for perPivot performs well for the crosslingual anal- sense disambiguation (Ide, 2000), parsing (Hwa ysis of the linguistic phenomenon of tense. We et al., 2005), paraphrasing (Bannard and Callison- produced analysis results for more than 1000 lan- Burch, 2005), part-of-speech tagging (Mukerjee guages, conducting – to the best of our knowledge et al., 2006), coreference resolution (de Souza and – the largest crosslingual computational study per- Orasan˘ , 2011), event marking (Nordrum, 2015), formed to date. We extended existing methodol- morphological segmentation (Chung et al., 2016), ogy for leveraging parallel corpora for typological bilingual analysis of linguistic marking (McEnery analysis by overcoming a limiting assumption of and Xiao, 1999; Xiao and McEnery, 2002), as well earlier work. We only require that a linguistic fea- as language classification (Asgari and Mofrad, ture is overtly marked in a few of thousands of lan- ¨ 2016; Ostling and Tiedemann, 2017). guages as opposed to requiring that it be marked in all languages under investigation.

6 Discussion 8 Future directions

Our motivation is not to develop a method that can There are at least two future directions that seem then be applied to many other corpora. Rather, promising to us. our motivation is that many of the more than 1000 Creating a common map of tense along the • languages in the Parallel Bible Corpus are low- lines of Figure1, but unifying the three tenses resource and that providing a method for creat- Addressing shortcomings of the way we ing the first richly annotated corpus (through the • compute alignments: (i) generalizing char- projection of annotation we propose) for many of acter n-grams to more general features, so these languages is a significant contribution. that templates in templatic morphology, redu- The original motivation for our approach is plication and other more complex manifesta- provided by the work of the typologist Michael tions of linguistic features can be captured; Cysouw. He created the same type of annotation (ii) use n-gram features of different lengths as we, but he produced it manually whereas we use to account for differences among languages, automatic methods. But the structure of the anno- e.g., shorter ones for Chinese, longer ones for tation and its use in linguistic analysis is the same English; (iii) segmenting verses into clauses as what we provide. and performing alignment not on the verse The basic idea of the utility of the final out- level (which caused many errors in our exper- come of SuperPivot is that the 1163 languages all iments), but on the clause level instead; (iv) richly annotate each other. As long as there are a using global information more effectively, few among the 1163 languages that have a clear e.g., by extracting alignment features from marker for linguistic feature f, then this marker automatically induced bi- or multilingual lex- can be projected to all other languages to richly icons. annotate them. For any linguistic feature, there is a good chance that a few languages clearly mark Acknowledgments it. Of course, this small subset of languages will We gratefully acknowledge financial support from be different for every linguistic feature. Volkswagenstiftung and fruitful discussions with Thus, even for extremely resource-poor lan- Fabienne Braune, Michael Cysouw, Alexander guages for which at present no annotated resources Fraser, Annemarie Friedrich, Mohsen Mahdavi exist, SuperPivot will make available richly an- Mazdeh, Mohammad R.K. Mofrad, and Benjamin notated corpora that should advance linguistic re- Roth. We are indebted to Michael Cysouw for the search on these languages. Parallel Bible Corpus.

121 References Christos Christodouloupoulos and Mark Steedman. 2015. A massively parallel corpus: the bible in Judith L. Aissen. 1987. Tzotzil Clause Structure. 100 languages. Language resources and evaluation, Springer. 49(2):375–395. Roger W Andersen. 1990. Papiamentu tense-aspect, Junyoung Chung, Kyunghyun Cho, and Yoshua Ben- with special attention to discourse. Pidgin and cre- gio. 2016. A character-level decoder without ex- ole tense-mood-aspect systems, pages 59–96. plicit segmentation for neural machine translation. arXiv preprint arXiv:1603.06147. Ehsaneddin Asgari and Mohammad R. K. Mofrad. 2016. Comparing fifty natural languages and twelve Bernard Comrie. 1989. Language universals and lin- genetic languages using word embedding language guistic typology: Syntax and morphology. Univer- divergence (WELD) as a quantitative measure of sity of Chicago press. language distance. In Proceedings of the NAACL- HLT Workshop on Multilingual and Cross-lingual William Croft. 2001. Radical construction grammar: Methods in NLP, pages 65–74. Syntactic theory in typological perspective. Oxford University Press on Demand. Colin Bannard and Chris Callison-Burch. 2005. Para- phrasing with bilingual parallel corpora. In Pro- William Croft. 2002. Typology and universals. Cam- ceedings of the 43rd Annual Meeting on Association bridge University Press. for Computational Linguistics, pages 597–604. As- sociation for Computational Linguistics. William Croft and Keith T Poole. 2008. Inferring universals from grammatical variation: Multidimen- Emily M Bender. 2009. Linguistically na¨ıve!= lan- sional scaling for typological analysis. Theoretical guage independent: why nlp needs linguistic typol- linguistics, 34(1):1–37. ogy. In Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Compu- Michael Cysouw. 2014. Inducing semantic roles. Per- tational Linguistics: Virtuous, Vicious or Vacuous?, spectives on semantic roles, pages 23–68. pages 26–32. Association for Computational Lin- Michael Cysouw and Bernhard Walchli.¨ 2007. Paral- guistics. lel texts: using translational equivalents in linguistic Emily M Bender. 2011. On achieving and evaluating typology. STUF-Sprachtypologie und Universalien- language-independence in nlp. Linguistic Issues in forschung, 60(2):95–99. Language Technology, 6(3):1–26. Osten¨ Dahl. 1985. Tense and aspect systems. Basil Blackwell. Steven Bird. 2006. Nltk: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive Osten¨ Dahl. 2000. Tense and Aspect in the Languages presentation sessions, pages 69–72. Association for of Europe. Walter de Gruyter. Computational Linguistics. Osten¨ Dahl. 2007. From questionnaires to parallel cor- Bernd Bohnet and Joakim Nivre. 2012. A transition- pora in typology. STUF-Sprachtypologie und Uni- based system for joint part-of-speech tagging and versalienforschung, 60(2):172–181. labeled non-projective dependency parsing. In Pro- ceedings of the 2012 Joint Conference on Empiri- Osten¨ Dahl. 2014. The perfect map: Investigating cal Methods in Natural Language Processing and the cross-linguistic distribution of tame categories Computational Natural Language Learning, pages in a parallel corpus. Aggregating Dialectology, Ty- 1455–1465. Association for Computational Linguis- pology, and Register Contents Analysis. Linguistic tics. Variation in Text and Speech. Linguae & litterae, 28:268–289. Bernd Bohnet, Joakim Nivre, Igor Boguslavsky, Richard´ Farkas, Filip Ginter, and Jan Hajic.ˇ 2013. Matthew S Dryer, David Gil, Bernard Comrie, Hagen Joint morphological and syntactic analysis for richly Jung, Claudia Schmidt, et al. 2005. The world atlas inflected languages. Transactions of the Association of language structures. Oxford University Press. for Computational Linguistics, 1:415–428. Chris Dyer, Victor Chahuneau, and Noah A. Smith. Peter F Brown, Vincent J Della Pietra, Stephen A Della 2013. A simple, fast, and effective reparameteriza- Pietra, and Robert L Mercer. 1993. The mathemat- tion of IBM model 2. In Human Language Tech- ics of statistical machine translation: Parameter esti- nologies: Conference of the North American Chap- mation. Computational linguistics, 19(2):263–311. ter of the Association of Computational Linguis- tics, Proceedings, June 9-14, 2013, Westin Peachtree Joan L Bybee and Osten¨ Dahl. 1989. The creation Plaza Hotel, Atlanta, Georgia, USA, pages 644–648. of tense and aspect systems in the languages of the world. John Benjamins Amsterdam. Joseph H Greenberg. 1960. A quantitative approach to the morphological typology of language. Inter- George Casella and Roger L. Berger. 2008. Statistical national journal of American linguistics, 26(3):178– Inference. Thomson. 194.

122 Jan Hajic,ˇ Jan Hric, and Vladislav Kubon.ˇ 2000. Ma- John H McWhorter. 2005. Defining creole. Oxford chine translation of very close languages. In Pro- University Press. ceedings of the sixth conference on Applied natural language processing, pages 7–12. Association for Rada Mihalcea and Michel Simard. 2005. Parallel Computational Linguistics. texts. Natural Language Engineering, 11(03):239– 246. Iren Hartmann, Martin Haspelmath, and Michael Cysouw. 2014. Identifying semantic role clusters Amitabha Mukerjee, Ankit Soni, and Achla M Raina. and alignment types via microrole coexpression ten- 2006. Detecting complex predicates in hindi using dencies. Studies in Language. International Journal pos projection across parallel corpora. In Proceed- sponsored by the Foundation “Foundations of Lan- ings of the Workshop on Multiword Expressions: guage”, 38(3):463–484. Identifying and Exploiting Underlying Properties, pages 28–35. Association for Computational Lin- Serge Heiden, Sophie Prevost,´ Benoit Habert, Helka guistics. Folch, Serge Fleury, Gabriel Illouz, Pierre Lafon, and Julien Nioche. 2000. Typtex: Inductive typo- John Myhill and Myhill. 1992. Typological discourse logical text classification by multivariate statistical analysis: Quantitative approaches to the study of analysis for nlp systems tuning/evaluation. In Maria linguistic function. Blackwell Oxford. Gavrilidou, George Carayannis, Stella Markanto- natou, Stelios Piperidis, Gregory Stainhaouer (eds)´ Lene Nordrum. 2015. Exploring spontaneous-event Second International Conference on Language Re- marking though parallel corpora: Translating en- sources and Evaluation, pages p–141. glish ergative intransitive constructions into nor- wegian and swedish. Languages in Contrast, Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara 15(2):230–250. Cabezas, and Okan Kolak. 2005. Bootstrapping parsers via syntactic projection across parallel texts. Robert Ostling¨ and Jorg¨ Tiedemann. 2017. Continuous Natural language engineering, 11(03):311–325. multilinguality with language vectors. In Proceed- ings of the 15th Conference of the European Chap- Nancy Ide. 2000. Cross-lingual sense determination: ter of the Association for Computational Linguistics, Can it work? Computers and the Humanities, page 644–649. Association for Computational Lin- 34(1):223–234. guistics.

Maria Koptjevskaja-Tamm, Martine Vanhove, and Pe- Ari Pirkola. 2001. Morphological typology of lan- ter Koch. 2007. Typological approaches to lexical guages for ir. Journal of Documentation, 57(3):330– semantics. , 11(1):159–185. 348.

Anoop Kunchukuttan and Pushpak Bhattacharyya. Philip Resnik. 2004. Exploiting hidden meanings: Us- 2016. Faster decoding for subword level phrase- ing bilingual text for monolingual annotation. Com- based smt between related languages. arXiv preprint putational Linguistics and Intelligent Text Process- arXiv:1611.00354. ing, pages 283–299.

Pierre Lison and Jorg¨ Tiedemann. 2016. Opensub- Philip Resnik, Mari Broman Olsen, and Mona Diab. titles2016: Extracting large parallel corpora from 1997. Creating a parallel corpus from the book of movie and tv subtitles. In Proceedings of the 10th 2000 tongues. In Proceedings of the Text Encoding International Conference on Language Resources Initiative 10th Anniversary User Conference (TEI- and Evaluation. 10). Citeseer.

Laurens van der Maaten and Geoffrey Hinton. 2008. Gillian Sankoff. 1990. The of tense Visualizing data using t-sne. Journal of Machine and aspect in tok pisin and sranan. Language Varia- Learning Research, 9(Nov):2579–2605. tion and Change, 2(03):295–312.

Thomas Mayer and Michael Cysouw. 2014. Creat- Marianne Elina Santaholma. 2007. Grammar sharing ing a massively parallel bible corpus. Oceania, techniques for rule-based multilingual nlp systems. 135(273):40. Proceedings of the 16th Nordic Conference of Com- putational Linguistics (NODALIDA). Ryan T McDonald, Joakim Nivre, Yvonne Quirmbach- Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Diana Santos. 2004. Translation-based corpus stud- Ganchev, Keith B Hall, Slav Petrov, Hao Zhang, Os- ies: Contrasting English and Portuguese tense and car Tackstr¨ om,¨ et al. 2013. Universal dependency aspect systems. 50. Rodopi. annotation for multilingual parsing. In ACL (2), pages 92–97. Devyani Sharma. 2009. Typological diversity in new englishes. English World-Wide, 30(2):170–195. Tony McEnery and Richard Xiao. 1999. Domains, text types, aspect marking and english-chinese transla- Jae Jung Song. 2014. Linguistic typology: Morphology tion. Languages in Contrast, 2(2):211–229. and syntax. Routledge.

123 Jose´ Guilherme Camargo de Souza and Constantin Orasan.˘ 2011. Can projected chains in parallel corpora help coreference resolution? In Dis- course Anaphora and Anaphor Resolution Collo- quium, pages 59–69. Springer. Elizabeth Closs Traugott. 1978. On the expression of spatio-temporal relations in language. Universals of human language, 3:369–400. Bernhard Walchli.¨ 2010. The consonant template in synchrony and diachrony. Baltic linguistics, 1. Bernhard Walchli¨ and Michael Cysouw. 2012. Lex- ical typology through similarity semantics: To- ward a semantic map of motion verbs. Linguistics, 50(3):671–710. Lindsay J Whaley. 1996. Introduction to typology: the unity and diversity of language. Sage Publications.

RZ Xiao and AM McEnery. 2002. A corpus-based ap- proach to tense and aspect in english-chinese trans- lation. In The 1st International Symposium on Con- trastive and Translation Studies between Chinese and English.

124