<<

Past, Present, Future: A Computational Investigation of the Typology of Tense in 1000 Languages

Ehsaneddin Asgari1,2 and Hinrich Schutze¨ 1

1Center for Information and Language Processing, LMU Munich, 2Applied Science and Technology, University of California, Berkeley, CA, USA, [email protected]

Abstract We address this challenge by proposing a new method for analyzing what we call superparallel We present SuperPivot, an analysis corpora, corpora that are by an order of magnitude method for low-resource languages that more parallel than corpora that have been available occur in a superparallel corpus, i.e., in a in NLP to date. The corpus we work with in this corpus that contains an order of magni- paper is the Parallel Bible Corpus (PBC) that con- tude more languages than parallel corpora sists of translations of the New Testament in 1169 currently in use. We show that SuperPivot languages. Given that no NLP analysis tools are performs well for the crosslingual analysis available for most of these 1169 languages, how of the linguistic phenomenon of tense. can we extract the rich information that is poten- We produce analysis results for more than tially hidden in such superparallel corpora? 1000 languages, conducting – to the best The method we propose is based on two hy- of our knowledge – the largest crosslin- potheses. H1 Existence of overt encoding. For gual computational study performed to any important linguistic distinction f that is fre- date. We extend existing methodology for quently encoded across languages in the world, leveraging parallel corpora for typological there are a few languages that encode f overtly analysis by overcoming a limiting as- on the surface. H2 Overt-to-overt and overt-to- sumption of earlier work: We only require non-overt projection. For a language that en- that a linguistic feature is overtly marked codes f, a projection of f from the “overt lan- in a few of thousands of languages as guages” to l in the superparallel corpus will iden- opposed to requiring that it be marked in tify the encoding that l uses for f, both in cases all languages under investigation. in which the encoding that l uses is overt and in cases in which the encoding that l uses is non- 1 Introduction overt. Based on these two hypotheses, our method Significant linguistic resources such as machine- proceeds in 5 steps. readable lexicons and part-of-speech (POS) tag- 1. Selection of a linguistic feature. We select a gers are available for at most a few hundred lan- linguistic feature f of interest. Running example: guages. This means that the majority of the We select as feature f. arXiv:1704.08914v2 [cs.CL] 14 Sep 2017 languages of the world are low-resource. Low- 2. Heuristic search for head pivot. Through resource languages like Fulani are spoken by tens a heuristic search, we find a language lh that con- of millions of people and are politically and eco- tains a head pivot ph that is highly correlated with nomically important; e.., to manage a sudden the linguistic feature of interest. refugee crisis, NLP tools would be of great ben- Running example: ”ti” in Seychelles Creole efit. Even “small” languages are important for (CRS). CRS “ti” meets our requirements for a the preservation of the common heritage of hu- head pivot well as will be verified empirically in mankind that includes natural remedies and lin- §3. First, ”ti” is a surface marker: it is easily guistic and cultural diversity that can potentially identifable through whitespace tokenization and it enrich everybody. Thus, developing analysis is not ambiguous, e.g., it does not have a second methods for low-resource languages is one of the meaning apart from being a grammatical marker. most important challenges of NLP today. Second, ”ti” is a good marker for past tense in terms of both “precision” and “recall”. CRS has beyond the scope of this paper to verify that we mandatory past tense marking (as opposed to lan- can produce such an analysis for all linguistic fea- guages in which tense marking is facultative) and tures, but a promising example of this type of map, ”ti” is highly correlated with the general notion of including distinctions like progressive and perfec- past tense. tive aspect, is given in §4. This does not mean that every clause that a lin- Running example: We compute the correlation guist would regard as past tense is marked with of “ti” with words in other languages and select the ”ti” in CRS. For example, some tense-aspect con- 100 highest correlated words as pivots. Examples figurations that are similar to English present per- of pivots we find this way are Torres Strait Cre- fect are marked with ”in” in CRS, not with ”ti” ole “bin” (from English “been”) and Tzotzil “laj”. (e.g., “has commanded” is translated as “in “laj” is a perfective marker, e.g., “Laj meltzaj -uk” ordonn”). ‘LAJ be-made subj’ means “It’ done being built” (Aissen, 1987). Our goal is not to find a head language and a head pivot that is a perfect marker of f. Such a 4. Projection of pivot set to all languages. head pivot probably does not exist; or, more pre- Now that we have a large pivot set, we project the cisely, linguistic features are not completely rig- pivots to all other languages to search for linguis- orously defined. In a sense, one of the significant tic devices that express the linguistic feature f. Up contributions of this work is that we provide more to this point, we have made the assumption that it rigorous definitions of past tense across languages; is easy to segment text in all languages into pieces e.g., ”ti” in CRS is one such rigorous definition of a size that is not too small (individual charac- of past tense and it automatically extends (through ters of the Latin alphabet would be too small) and projection) to 1000 languages in the superparallel not too large (entire sentences as tokens would be corpus. too large). Segmentation on standard delimiters is a good approximation for the majority of lan- 3. Projection of head pivot to larger pivot guages – but not for all: it undersegments some set. Based on an alignment of the head language (e.g., the Inuit) and over- to the other languages in the superparallel corpus, segments others (e.g., languages that use punctua- we project the head pivot to all other languages tion marks as regular characters). and search for highly correlated surface markers, i.e., we search for additional pivots in other lan- For this reason, we do not employ tokenization guages. This projection to more pivots achieves in this step. Rather we search for character - three goals. First, it makes the method more ro- grams (2 ≤ n ≤ 6) to find linguistic devices that bust. Relying on a single pivot would result in express f. This implementation of the search pro- many errors due to the inherent noisiness of lin- cedure is a limitation – there are many linguistic guistic data and because several components we devices that cannot be found using it, e.g., tem- use (e.g., alignment of the languages in the su- plates in templatic morphology. We leave address- perparallel corpus) are imperfect. Second, as we ing this for future work (§7). discussed above, the head pivot does not neces- Running example: We find “-ed” for English sarily have high “recall”; our example was that and “-te” for German as surface features that are CRS “ti” is not applied to certain clauses that highly correlated with the 100 past tense pivots. would be translated using in En- 5. Linguistic analysis. The result of the previ- glish. Thus, moving to a larger pivot set increases ous steps is a superparallel corpus that is richly an- recall. Third, as we will see below, the pivot set notated with information about linguistic feature can be leveraged to create a fine-grained map of f. This structure can be exploited for the analysis the linguistic feature. Consider clauses referring of a single language li that may be the focus of to eventualities in the past that English speakers a linguistic investigation. Starting with the char- would render in past progressive, present perfect acter n-grams that were found in the step “projec- and simple past tense. Our hope is that the pivot tion of pivot set to all languages”, we can explore set will cover these distinctions, i.e., one of the their use and function, e.g, for the mined n-gram pivots marks past progressive, but not present pre- “-ed” in English (assuming English is the language fect and simple past, another pivot marks present li and it is unfamiliar to us). Many of the other perfect, but not the other two and so on. It is 1000 languages provide annotations of linguistic feature f for li: both the languages that are part of lel corpora by overcoming several limiting factors. the pivot set (e.g., Tzotzil “laj”) and the mined n- The most important is that Cysouw’s method is grams in other languages that we may have some only applicable if markers of the relevant linguistic knowledge of (e.g., “-te” in German). feature are recognizable on the surface in all lan- We can also use the structure we have gen- guages. In contrast, we only assume that markers erated for typological analysis across languages of the relevant linguistic feature are recognizable following the work of Michael Cysouw. He on the surface in a small number of languages. has pioneered a new methodology for typology ((Cysouw, 2014), §5). We do not contribute any 2 SuperPivot: Description of method innovations to typology in this paper, but our 1. Selection of a linguistic feature. The linguistic method is a significant advancement computation- feature of interest f is selected by the person who ally over Cysouw’s work because we overcome performs a SuperPivot analysis, i.e., by a linguist, many of his limiting assumptions. Most impor- NLP researcher or data scientist. Henceforth, we tantly, our method scales to thousands of lan- will refer to this person as the linguist. guages as we demonstrate below whereas Cysouw In this paper, f ∈ F = {past, present, future}. worked on a few dozen. 2. Heuristic search for head pivot. There are Running example: We sketch the type of analy- several ways for finding the head language and the sis that our new method makes possible in §4. head pivot. Perhaps the linguist knows a language The above steps “1. heuristic search for head that has a good head pivot. Or she is a trained ty- pivot” and “2. projection of head pivot to larger pologist and can find the head pivot by consulting pivot set” are based on H1: we assume the exis- the typological literature. tence of overt coding in a subset of languages. In this paper, we use our knowledge of English The above steps “2. projection of head pivot to and an alignment from English to all other lan- larger pivot set” and “3. projection of pivot set guages to find head pivots. (See below for details to all languages” are based on H2: we assume on alignment.) We define a “query” in English that overt-to-overt and overt-to-non-overt pro- and search for words that are highly correlated to jection is possible. the query in other languages. For , the In the rest of the paper, we will refer to the query is simply the word “will”, so we search for method that consists of steps 1 to 5 as SuperPivot: words in other languages that are highly correlated “linguistic analysis of SUPERparallel corpora us- with “will”. For , the query is the ing surface PIVOTs”. union of “is”, “are” and “am”. So we search for We make three contributions. (i) Our basic hy- words in other languages that are highly correlated potheses are H1 and H2. (H1) For an important with the “merger” of these three words. For past linguistic feature, there exist a few languages that tense, we POS tag the English part of PBC and mark it overtly and easily recognizably. (H2) It merge all words tagged as past tense into one past is possible to project overt markers to overt and tense word.2 We then search for words in other non-overt markers in other languages. Based on languages that are highly correlated with this arti- these two hypotheses we design SuperPivot, a new ficial past tense word. method for analyzing highly parallel corpora, and As an additional constraint, we do not select the show that it performs well for the crosslingual most highly correlated word as the head pivot, but analysis of the linguistic phenomenon of tense. the most highly correlated word in a Creole lan- (ii) Given a superparallel corpus, SuperPivot can guage. Our rationale is that Creole languages are be used for the analysis of any low-resource lan- more regular than other languages because they guage represented in that corpus. In the supple- are young and have not accumulated “historical mentary material, we present results of our analy- baggage” that may make computational analysis sis for three tenses (past, present, future) for 11631 more difficult. languages. An evaluation of accuracy is presented Table 1 lists the three head pivots for F . in Table 3.2. (iii) We extend Michael Cysouw’s pi- 3. Projection of head pivot to larger pivot set. oneering work on typological analysis using paral- We first use fast align (Dyer et al., 2013) to align

1We exclude six of the 1169 languages because they do 2Past tense is defined as tags BED, BED*, BEDZ, not share enough verses with the rest. BEDZ*, DOD*, VBD, DOD. We use NLTK (Bird, 2006). the head language to all other languages in the cor- our method. pus. This alignment is on the word level. One design element that increases robustness is We compute a score for each word in each lan- that we find the two positions in each verse that are guage based on the number of times it is aligned most highly (resp. least highly) correlated with the to the head pivot, the number of times it is aligned linguistic feature f. Specifically, we compute the to another word and the total frequencies of head relative position x of each pivot that occurs in the pivot and word. We use χ2 (Casella and Berger, verse and apply a Gaussian filter (σ = 6 where the 2008) as the score throughout this paper. Finally, unit of length is the character), i.e., we set p(x) ≈ we select the words as pivots that have the high- 0.066 (0.066 is the density of a Gaussian with σ = est association score with the head pivot. 6 at x = 0) and center a bell curve around x. The We impose the constraint that we only select total score for a position x is then the sum of the one pivot per language. So as we go down the filter values at x summed over all occurring pivots. list, we skip pivots from languages for which we Finally, we select the positions xmin and xmax with already have found a pivot. We set k = 100 in this lowest and highest values for each verse. 2 paper. Table 1 gives the top 10 pivots. χ is then computed based on the number of times a character n-gram occurs in a window of 4. Projection of pivot set to all languages. size around x (positive count) and in a win- As discussed above, the process so far has been max dow of size w around x (negative count). Verses based on tokenization. To be able to find markers min in which no pivot occurs are used for the negative that cannot be easily detected on the surface (like count in their entirety. The top-ranked character n- “-ed” in English), we identify non-tokenization- grams are then output for analysis by the linguist. based character n-gram features in step 4. We set w = 20. The immediate challenge is that without tokens, 5. Linguistic analysis. We now have created a we have no alignment between the languages any- structure that contains rich information about the more. We could simply assume that the occur- linguistic feature: for each verse we have relative rence of a pivot has scope over the entire verse. positions of pivots that can be projected across lan- But this is clearly inadequate, e.g., for the sen- guages. We also have maximum positions within tence “I arrived yesterday, I’ staying today, and a verse that allow us to pinpoint the most likely I will leave tomorrow”, it is incorrect to say that place in the vicinity of which linguistic feature f it is marked as past tense (or future tense) in its is marked in all languages. This structure can be entirety. Fortunately, the verses in the New Testa- used for the analysis of individual low-resource ment mostly have a simple structure that limits the languages as well as for typological analysis. We variation in where a particular piece of content oc- will give an example of such an analysis in §4. curs in the verse. We therefore make the assump- 6. Hierarchical clusterings of markers and tion that a particular relative position in language languages. As an additional evaluation, we l1 (e.g., the character at relative position 0.62) is worked on hierarchical clusterings of past, present aligned with the same relative position in l2 (i.e., and future pivots. As detailed in §2.4, we repre- the character at relative position 0.62). This is sent each verse by a vector of length 100 show- likely to work for a simple example like “I arrived ing which pivot markers are used to express this yesterday, I’m staying today, and I will leave to- verse. The other way of looking at these data is morrow” across languages. that for each marker we have an occurrence dis- In our analysis of errors, we found many cases tribution over verses and we may exploit these where this assumption breaks down. A well- data to demonstrate the distance between mark- known problematic phenomenon for our method ers. For the purpose of comparing two markers, is the difference between, say, VSO and SOV lan- we propose calculation of the Jensen-Shannon di- guages: the first class puts the at the begin- vergence between the normalized occurrence dis- ning, the second at the end. However, keep in tribution over verses: mind that we accumulate evidence over k = 100 D = JSD(m ˆ , mˆ ), pivots and then compute aggregate statistics over mpi ,mpj pi pj the entire corpus. As our evaluation below shows, where mˆpi and mˆpi , are the normalized occur- the “linear alignment” assumption does not seem rence distributions over verses. We compare the to do much harm given the general robustness of obtained distance between markers with genetic distance of their corresponding languages using 2. Heuristic search for head pivot. We use the WALS information (Dryer et al., 2005). For visu- queries described in §2 for finding the following alization purposes, we perform Unweighted Pair three head pivots. (i) Past tense head pivot: “ti” Group Method with Arithmetic Mean (UPGMA) in Seychellois Creole (CRS) (McWhorter, 2005). hierarchical clustering on the pairwise distance (ii) Present tense head pivot: “ta” in Papiamentu matrix of the marker for each tense separately (PAP) (Andersen, 1990). (iii) Future tense head (Johnson, 1967). pivot: “bai” in (TPI) (Traugott, 1978; In addition to clustering of pivot markers for Sankoff, 1990). each tense separately, we performed the same 3. Projection of head pivot to larger pivot set. comparison for all top markers of 1107 languages3 Using the method described in §2, we project each and take the average distances of languages in head pivot to a set of k = 100 pivots. Table 1 gives past, present, and future marking. This allows the top 10 pivots for each tense. us to compare the average tense behavior of lan- 4. Projection of pivot set to all languages. Us- guages. ing the method described in §2, we compute highly 1 correlated character n-gram features, 2 ≤ n ≤ 6, D = (JSD +JSD +JSD ), li,lj 3 past present future for all 1163 languages. 3 Data, experiments and results See §4 for the last step of SuperPivot: 5. Lin- guistic analysis. 3.1 Data We use a New Testament subset of the Parallel 3.3 Evaluation Bible Corpus (PBS) (Mayer and Cysouw, 2014) We rank n-gram features and retain the top 10, for that consists of 1556 translations of the the Bible each linguistic feature, for each language and for in 1169 unique languages. We consider two lan- each n-gram size. We process 1556 translations. guages to be different if they have different ISO Thus, in total, we extract 1556 × 5 × 10 n-grams. 639-3 codes. Table 3.2 shows Mean Reciprocal Rank (MRR) The translations are aligned on the verse level. for 10 languages. The rank for a particular rank- However, many translations do not have complete ing of n-grams is the first n-gram that is highly coverage, so that most verses are not present in at correlated with the relevant tense; e.g., character least one translation. One reason for this is that subsequences of the name “Paulus” are evaluated sometimes several consecutive verses are merged, as incorrect, the subsequence “-ed” in English as so that one verse contains material that is in real- correct for past. MRR is averaged over all n-gram ity not part of it and the merged verses may then sizes, 2 ≤ n ≤ 6. Chinese has consistent tense be missing from the translation. Thus, there is a marking only for future, so results are poor. Rus- trade-off between number of parallel translations sian and Polish perform poorly because their cen- and number of verses they have in common. Al- tral grammatical category is aspect, not tense. The though some preprocessing was done by the au- poor performance on is due to the limits thors of the resource, many translations are not of character n-gram features for a “templatic” lan- preprocessed. For example, Japanese is not tok- guage. enized. We also observed some incorrectness and sparseness in the metadata. One example is that During this evaluation, we noticed a surprising one Fijian translation (see §4) is tagged fij hindi, amount of variation within translations of one lan- but it is Fijian, not Fiji Hindi. guage; e.g., top-ranked n-grams for some German We use the 7958 verses with the best coverage translations include names like “Paulus”. We sus- across languages. pect that for literal translations, linear alignment (§2) yields good n-grams. But many translations 3.2 Experiments are free, e.g., they change the sequence of clauses. 1. Selection of a linguistic feature. We conduct This deteriorates mined n-grams. See §7. three experiments for the linguistic features past Hierarchical clusterings of markers. Hierar- tense, present tense and future tense. chical clusterings of past, present and future pivots using JSD between the normalized occurrence dis- 3We exclude languages that have fewer than 7000 verses in common with the pivot language to ensure quality of tribution over verses are shown in Figure 1, Fig- marker. ure 2, and Figure 3 for past, present, and future past present future code language pivot code language pivot code language pivot head pivots CRS Seychelles C. ti PAP Papiamentu ta TPI Tok Pisin bai GUX Gourmanchema´ den NOB Norwegian Bokmal˚ er LID Nyindrou kameh MAW Mampruli daa HIF Fiji Hindi hei GUL Sea Island C. gwine GFK Patpatar ga AFR Afrikaans is TGP Tangoa pa YAL Yalunka yi DAN Danish er BUK Bugawac oc TOH Gitonga di SWE Swedish ar¨ BIS Bislama bambae DGI Northern Dagara tι EPO Esperanto estas PIS Pijin bae BUM Bulu (Cameroon) nga ELL Greek ´ιναι APE Bukiyip eke TCS Torres Strait C. bin HIN Hindi haai HWC Hawaiian C. goin NDZ Ndogo gi`ι NAQ Khoekhoe ra NHR Nharo gha Table 1: Top ten past, present, and future tense pivots extracted from 1163 languages. C. = Creole

language past present future all for comparison of each pair of languages/markers; Arabic 1.00 0.39 0.77 0.72 Chinese 0.00 0.00 0.87 0.29 this allows us to investigate whether a simple English 1.00 1.00 1.00 1.00 threshold of 0.5 accurately predicts whether two French 1.00 1.00 1.00 1.00 languages are genetically related or not. The re- German 1.00 1.00 1.00 1.00 Italian 1.00 1.00 1.00 1.00 sults are summarized in Table 3.3. Although the Persian 0.77 1.00 1.00 0.92 average tense marking divergence has a low re- Polish 1.00 1.00 0.58 0.86 call, it expresses a high precision of 0.36, where Russian 0.90 0.50 0.62 0.67 1 Spanish 1.00 1.00 1.00 1.00 the random chance is 103 ≈ 0.01. Thus, it means all 0.88 0.79 0.88 0.85 that if divergence of tense marking is low the lan- guages are very likely to be genetically related. Table 2: MRR results for step 4. See text for de- This conclusion is supported by Figure 4 where tails. many small clusters of nodes have the same color. This suggests that our method may help in com- tenses respectively. In addition to markers clus- pletion of WALS. terings, the average tense behavior clustering of 1107 languages is depicted in Figure 4. In these 4 A map of past tense figures languages are colored based on their lan- guage families using WALS (Dryer et al., 2005), To illustrate the potential of our method we select languages without family information on WALS five out of the 100 past tense pivots that give rise are uncolored. We observed that most of pivot past to large clusters of distinct combinations. Starting and future markers belong to Niger Congo fam- with CRS, we find other pivots that “split” the set ily and present markers are mostly within Indo- of verses that contain the CRS past tense pivot “ti” European family. It can be seen that in many into two parts that have about the same size. This cases languages with the same family behave ac- gives us two sets. We now look for a pivot that cordingly in tense marking. For instance, in past splits one of these two sets about evenly and so tense marking Oto-Manguean languages use al- on. After iterating four times, we arrive at five piv- most the same marker of ni with small writing ots: CRS “ti”, Fijian (FIJ) “qai”, Hawaiian Creole variations (Figure 1). Although Tezoatlan´ Mixtec (HWC) “wen”, Torres Strait Creole (TCS) “bin” did not have a record on WALS, since its marker and Tzotzil (TZO) “laj”. is the same as other Oto-Manguean languages and Figure 5 shows a -SNE (Maaten and Hinton, works almost identical to ni in Oto-Manguean lan- 2008) visualization of the large clusters of com- guages, we may guess this language is also Oto- binations that are found for these five languages, Manguean, which turned out to be true when we including one cluster of verses that do not contain performed further searches.4 There were many of any of the five pivots. such cases for which we could guess the family of This figure is a map of past tense for all 1163 language based on their tense marking similarities languages, not just for CRS, FIJ, HWC, TCS and in Figure 1, Figure 2 and Figure 3. TZO: once the interpretation of a particular clus- We use normalized JSD (0 ≤ JSD ≤ 1) ter has been established based on CRS, FIJ, HWC, TCS and TZO, we can investigate this cluster in 4https://www.ethnologue.com/language/mxb the 1164 other languages by looking at the verses Arawakan avg tense past present future Oto-Manguean ( 696 lang. - 103 fam.) ( 55 lang. - 15 fam.) ( 70 lang. - 17 fam.) ( 44 Mlang.ayan - 16 fam.) 1107 100 100 100 Niger-Congo accuracy 0.93 0.55 0.81 0.58 unknown precision 0.36 0.18 0.75 0.16 Indo-European Khoe-Kwadi recall 0.01 0.59 0.37 0.61 Mande TNR 0.99 0.55 0.96 0.58 Tupian Algic Austronesian Table 3: similarity prediction results based on coordinated marking Guaicuru ofan verses. Only Misumalpan languages with records on WALS are included in this evaluation. TNR: true negative Easte rate.rn Sudanic Eskimo-Aleut

i n

>

c e

t ³

i x i i ì

i u n n

n q M

i

> >

> n > e

c d ī c n c

c > í e

e n n

e t e t n

t c

t a k x

x > i

x i e a i >

a t

> a

z G c M

y x

e M

n i c

c

o M l e a

a e

e n n

j t

ë a u e

i i

k i > l g

a M t E g

n n

x >

a

M

q o

g s

o i n a l b

c x l

t

á i

o n e > s

d a l

g > e t

n i i

> u e > i

i

a

> >

> t M

> k

t

>

l )

i ú i

e

> M u

o b d h u

t e

>

i )

a

P a e

n

n l

> c a a

s

u l >

ó d l h

i

d i

g

d á P n u

a

g n í

k o s y a m i a w l

e > > e o

d n

z b

n n o n

k t e

t h

k

r o n o v a

> i

c z e > t i

r

a i n e l

p

l

> o > d c

a o

h M

a a o

> r r u

c a o

u i

> i e

o

a t

> n r

> o e s m n

a H s o

l U

r r

m i d

c e a t

c >

i T o C o C

) z i K

i n e a e

o a

H n

r n i b

a A o

e r n r u

a o h

o a a t

a n a

i a a T

k i

m t

b a

S o u

a Y e N m

g K c F

o

a d

q a

L d

r

K

a i (

u a

a u

a

g

M S

M m

o K a

> n P a

w c r r

I y i

w e

i o

q O

o k t

> l

N i a > é a

k a

s C

o

a

K

n Y

g o

r i

M

> S ( S h

g

o v l

g f

n

m

o S e a

> c

u ɩ

a s r t

a a

U A a u >

g

g r n >

i l

(

a

l

j a h e

g C

i

i p a >

n N u r x

a

a g i

n A a o m r

a C j

u

F a a

m B m m

y

a a

m

g o r

n >

r

T w i >

> a a

a T

n

l S

a r

> u

w y l

)

g y

l

k a e a e a

o M o

K t

a > a

n a g s

k

a b

L

r a

n G

a

w e D o

t a j

N o

> p

a

n t k

w S n

C n

n r e n

f a

w a

K a

a

o u e a x

e u

C P K

b l

h h

c a

c

n

i

á t

D a

e l

n e

l r

b Y r m

> >

> o

N u

> e ix a

r t > p

a e k

n N

a l a e s e

n e

g > y

b a > u e

R b a

u '

n j

f >

i e m b c

e

N u m

i >

r W

t

a

o h jo ' a

N

a

a

>

u

k d

r a

r C n i é

n r G

c

g

t

a a

' e D o o a

n

> b

i v

e

K w é a m

o Q

h ú

s u c

d > e

k K

e

g

n l o m

o

i

D i d >

e

n

e (

b r

a

t x > B

a

a I > F o

d

o >

i

n a

c l

e

M n >

l

g

e a

o a

l o

o

n k

a

o Y F

e n r

a

n a

e

M u

n r

y h

g

C t o f

e M t

a C

n

n b

u >

d

a

i >

a

n

M o

i >

d

y i a

a

c Arawakan

i S

a o

> p

t

u m

i l i

g o

h

L M

a d n u

s h i t

l

o p

H >

n

g

T C i

n m Oto-Manguean

a a

n

E

a a

g

a

S m

d

e T n ọ l

n n h o > o

a e t

r

e i > k

r Mayan S

m

C G a >

d

k

n k

w I

a u a a j

c

i m

N a

a k > Niger-Congo

m

n E

i

a m

b

J

e w

a

> s a

a a r o d

i

> unknown

r K

> i

r K

e u a

r p x a

f u a

e n

e T r

ɛ

n Indo-European

l > a

a

F

>

m u

t s

>

u u ì

f i é

r

a

n S g

i a

B

d

d > a

a a Khoe-Kwadi

> o y

M

a g

o >

b o

b )

o

o d a

B i

M N n

b

n a

r h >

e i Mande

h G

t ( m

u i o

e l t

o

i

n t

u

S O

>

> B

)

o

a n r

i i

e

n ta b a y

z >

s é i

n r i

r Tupian

a e o

T m

u o i M

d t b

n Q O

a >

l i

a a

y it m

n u to

e q O

K

( z

o

o e ic Algic

u x

n M L u é

t

M >

a e

r d a

b o

m d

a ta

B Es n > wen Austronesian n i

u g t

d

> i

a P

l i

u i'

y a

D aw

y

t

j H a

n

> Guaicuruan

e

x

i è

M e n b á l

t > a

z a a n

M Goka Misumalpan

q u t

>

o k e t i t ò

k k e

T Naro >

ˊ

a m

>

c e

t n a n i Eastern Sudanic h C

n í c a m u a z l O Tzeltal >

upiatun > tavra e ɓ > r a k i T North Alaskan In Eskimo-Aleut

i n

>

c e

t ³

i x i i ì

i u n n

n q M

i

> >

> n > e

c d ī c n c

c > í e

e n n

o e t e t n

t c

t a k x

x r > i

j x i e a i >

a t

> a

z G c M

y x

e M

n i c

c

o M l e a

a e

e n n

j t

ë a u e

i i

k i > l g

a M t E g

n n

x >

a

ﮯ h M b

q o

g s

o i n a l b

c x l

t

á i

o n e > s

d a l

Figure 1: Clustering of 100 pivot past tense markers. Each node is coloredg based> on itse t family informa-

n i i

> u e > i

i

a

> >

> t M

> k

t

>

l )

i ú i

e

> M u

o b d h u

t e

>

i )

a

P a e

n

n l

> c a a

s

u l >

ó d l h

i

d i

g

d á P n u

a

g n í

k o s y a m i a w l

e > > e o

d n

z b

n n o n

k t e

t h

k

r o n o v a

> i

c z e > t i

r

a i n e l

p

l

> o > d c

a o

h M

a a o

> r r u

c a o

u i

> i e

o

a t

> n r

> o e s m n

a H s o

l U

r r

m i d

c e a t

c >

i T o C o C

) z i K

i n e a e

o a

H n

r n i b

a A o ñ

e r n r u

a o h

o a a t

a n a

i a a T

k i

m t

b a

S o u

a Y e N m

g K c F

o

a d

q a

L d

r

K

a i (

u a

a u

a

g

M S

M m

o K a

> n P a

w c r r

I y i

w e

i o

q O

o k t

> l

N i a > é a

k a

s C

o

a

K

n Y

g o

r i

M

> S ( S h

g

o v l

g f

n

tion. Languages with no record on WALS remained white. This clustering is based onm JSD of markers

o S e a

> c

u ɩ

a s r t

a a

U A a u >

g

g r n >

i l

(

a

l

j a h e

g C

i

i p a >

n N u r x

a

a g i

n A a o m r

a C j

u

F a a

m B m m

y

a a

m

g o r

n >

r

T w i >

> a a

a T

n

l S

a r

> u

w y l

)

g y

l

k a e a e a

o M o

K t

a > a

n a g s

k

a b

L

r a

n G

a

w e D o

t a j

N o

> p

a

n t k

w S n

C n

n r e n

f a

w a

K a

a

o u e a x

e u

C P K

b l

h h

in marking 5960 verses in bible. We observed that most ofc pivot past and future markers belong to Niger

a

c

n

i

á t

D a

e l

n e

l r

b Y r m

> >

> o

N u

> e ix a

r t > p

a e k

n N

a l a e s e

n e

g > y

b a > u e

R b a

u '

n j

f >

i e m b c

e

N u m

i >

r W

t

a

o h jo ' a

N

a

a

>

u

k d

r a

r C n i é

n r G

c

g

t

a a

' e D o o a

n

> b

i v

e

K w é a m

o Q

h ú

s u c

d > e

k K

e

g

n l o m

o

i

D i d >

e

n

e (

b r

a

Congo family and present markers are mostly within Indo-Europeant family. It can be seen that in many

x > B

a

a I > F o

d

o >

i

n a

c l

e

M n >

l

g

e a

o a

l o

o

n k

a

o Y F

e n r

a

n a

e

M u

n r

y h

g

C t o f

e M t

a C

n

n b

u >

d

a

i >

a

n M o

i >

d

y i a

a

c

i S

a o

> p

t

u m

i l i

g o

h

L M

a d n u

s h i t

l

o p

H >

n

g

T C i

n m

a a

n

E

a a

g

a

S m

d

e T n ọ l

n n h o > o

a e t

r

e i > k cases languages with the same family behave accordinglyr in tense marking. For instance, in past tense

S

m

C G a >

d

k

n k

w I

a u a a j

c

i m

N a

a k >

m

n E

i

a m

b

J

e w

a

> s a

a a r o d

i

>

r K

> i

r K

e u a

r p x a

f u a

e n

e T r

ɛ

n

l > a

a

F

>

m u

t s

>

u u ì

f i é

r

a

n S g

i a

B

d

d > a

a ni a

marking Oto-Manguean languages use almost the same marker> of with small writing variations (Fig- o y

M

a g

o >

b o

b )

o

o d a

B i

M N n

b

n a

r h >

e i

h G

t ( m

u i o

e l t

o

i

n t

u

S O

>

> B )

o

a n r

i i

e

n ta b a y

z >

s é i

n r i

r

a e o

T m

u o i M

d t b

n Q O

a >

l i

a a

y it m

n u to

e q O

K

( z

o

o e ic

u x

n M L u é

t

M >

a e

r d a

b o

m d

a ta

ure 1). Although Tezoatlan´ Mixtec did not have a record onB WALS, since its marker is the same asEs other > wen n n i

u g t

d

> i

a P

l i

u i'

y a

D aw

y

t

j H a

n

>

e

x

i è

M

e n b á

l t > a z a

a n

M Goka

q u t

>

o k e t i t ò

k k e

T Naro >

ˊ

a m

>

c e

t n a n i h C

n í c a m u a z l O Tzeltal >

upiatun > tavra e ɓ > r a k i Oto-Manguean languages and works almost identical to niT in Oto-Manguean languages, we mayNorth Alas guesskan In this language is also Oto-Manguean, which turned out to be true when we performed further searches. that are members of this cluster. This methodol- past tense marker “ti”. As a result, past progres- ogy supports the empirical investigation of ques- sive sentences in CRS are generally marked with tions like “how is progressive past tense expressed “ti”. Example: “43004031 Meanwhile, the disci- in0.9 language3 X”? We just need to look up the clus- ples were urging Jesus, ‘Rabbi, eat something.”’ ter(s) that correspond to progressive past tense, “crs bible 43004031 Pandan sa letan, bann disip ti look up the verses that are members and retrieve pe sipliy Zezi, ‘Met! Manz en pe.”’ the text of these verses in language X. The other four languages do not consistently use To give the reader a flavor of the distinctions0.93 the pivot for marking the past progressive; e.g., that are reflected in these clusters, we now list phe- HWC uses “was begging” in 43004031 (instead of nomena that are characteristic of verses that con- “wen”) and TCS uses “kip tok strongwan” ‘keep tain only one of the five pivots; these phenomena talking strongly’ in 43004031 (instead of “bin”). identify properties of that the other FIJ “qai”. This pivot means “and then”. It four do not have. is highly correlated with past tense in the New CRS “ti”. CRS has a set of markers that can be Testament because most sequential descriptions systematically combined, in particular, a progres- of events are descriptions of past events. But sive marker “pe” that can be combined with the there are also some non-past sequences. Example: Sino-Tibetan Cariban Arawakan Mayan Basque Songhay Niger-Congo unknown Indo-European Khoe-Kwadi Mande Uralic Central Sudanic Austro-Asiatic Austronesian Chibchan Afro-Asiatic

ι

α

ν

ί

ε s

ë i

>

t

)

h >

-

s )

3

ë 0 5 0

a > 4 s 5 i

e 1 e 1 l s

n t

( - ë > s e a

t 0

i

> k e

h n 0 i n > s e t

s s a 1 e n e a > s i i

e

j

t m i n

e 1

r ë

o j s b r

i (

s i ν

a l o n

> n

t > ι

> > r r

e

n G >

e τ

x

>

o s

> a

S A h

j e t

i h

F

h

σ

h a n e s >

> n n s

c i

n k n ε

> g s

n

> c r a > l i n

S e

r a s

l a

c a a e r

>

a i

t g e e i i >

h

e

h e i v

g m e

d h n >

z n

e d ) n

d w

s

o m r t

i n

s s

h s

n

i n o

i

n

i a i 3

t

o l o

n o E

d s i

l G o C e t ь

d e h t n

r

i

a 5

t E b т >

a L n

n e

i c

o S

a l o > s s

l

s

R e

n M

o G 4 d

u l с >

n ا i r ው

w n é ь

L e

e

e

P A 1

>

n r

>

E h е

ﺳ a W d ነ

N a т

i c

F

>

e l

c i

l

d

i a

a r d s > > с

ت I o

) i o F

> i > a

a t

> n

P n l d y

d

е

e

F (

i h

e

z h n > a

o c

o m

n n m g

c M i

s

n i >

e i s a

w > k

k

a a

g e i i

u n

r

m l n a

a r

a e

u K

>

a c

e o n

> n a

a o d

R i

g F e A r

>

w n u i

m a a

e a v r -

h

P s i M n

l a i

h r l e

q p a o s

a

a

M l > > o s

> l G l

m

s e

a

k

e w

n )

e S

l a s S t > n

k r a a

g

A P a

a e

a V u n N i a

u

n i g y u C

B

m d n

i e n m

l

R n ẹ

a i v a d

a a t

i L

r a

n r

> c i a u e

w

> i O

v a

ī e a

i

i a

r m g

W n

l n

i n

s L

i

t

d

>

a y n

P o > i

A a >

>

n

D r s d a i

r a

l

n I R i ) k W r

(

i s

e

o s n u

H i n

y a r

s x r

l A >

u

i o

n o r

e d c G

t a

h l o e e

n n n a n e

i

> r

a y n

V a

T a h >

t i m e

o

w ( N l j

> v

C

m S t å

S

n y n

a

n a s

ñ a

u i

i a m

s a

a L l C i k d

V

u

( g >

e

N a o i

>

a

t e s

u B

n

i M l

u

i

H > r w n

n

b u r

u a

a

M o ia

p a b B >

m r

g

u N ik é > e

o e

T r

h f

b

e > r

c w

r

g A r ä s I

n

e o h a

a s > t n

w i g

à s

l N

m o k

n n h r e I

i ɗ j

> a s u

i > > o e

> D d

s

i e o e G t

e

a ï d r w n l

m d

a S a fu

a r l

J n

n e u o

t >

y p F e

t Sino-Tibetan

i s y

a a

V E a ĕ a

> w k

b g

ä a > s

r

í Cariban

M m > h

c

> c

e a n

r n

r

e d a e a i

a

b r

c A n F Arawakan P

> é

a e e

l ﭤ b u

n

ú و t a m o

o r

o k e

n

C A > r

a

M i > Mayan

C h i

h y

r p

i n e

> d e

t a d

H n i

a i u c >

n u m S Basque

ɛ K

i L n y i

u

a

t g

ù

>

l

n d

W i i

o

>

f

a P

u

o S n Songhay o

g

n a a n i t

e r

e

S e >

V

i ig o

n

i N an là Niger-Congo

m

i c

j >

a

D

> v )

i a ia

h r

t h e

a unknown

r C ig

a

i N

a (

M a

a a d

h

d >

>

a h

i is Indo-European d

i M l

a

n g

i

h n n

i

H

> E

i > e n l ﮯ

a o t m

>

s

a re Khoe-Kwadi

u

u

d M

d C r n

i d

U

H n

a

n l

a s

e I

b a Mande

b i e r

a S

C

छ o

अ k

> ra

>

i o Uralic l >

o r

i

d )

h a

t a i > i

N a b

n i

u M

g m

n a

u

l N

a (

m a

i Central Sudanic

S m

k a

a

t N

a

B

o

d

> ﻣ

ﯽ a

b Austro-Asiatic

o >

T i

r

k a

a

t D a

B

o

d

>

l

o

k e

i b B

Austronesian

a > d

a la

n

o u

c y

n

i D R

é

n

>

ɛ

a l

i b

b > Chibchan

m

a ra M

a n o b

o r m

e a m B a

C Afro-Asiatic

h a

o

g >

l > o H a - n a rm g n I a h l Z t

ι o > ku i c e h e t >

i a d α n i c H u i j i F Y

ν

ί

ε

s

ë > yeke i > ango t S a t > o t n e m a i p a P

)

h >

-

s )

3

ë 0 5 0

a > 4 s 5 i

e 1 e 1 l s

n t

( - ë > s e a

t 0

i

> k e

h n 0 i n > s e t

s s a 1 e n e a > s i i

e

j

t m i n

e 1

r ë

o j s b r

i (

s i ν

a l o n

> n

t > ι

> > r r

e

n G >

e τ

x

>

o s

> a

S A h

j e t

i h

F

h

σ

h a n e s >

> n n s

c i

n k n ε

> g s

n

> c r a > l i n

S e

r a s

l a

c a a e r

>

a i

t g e e i i >

h

e

h e i v

g m e

d h n >

z n

e d ) n

d w

s

o m r t

i n

s s

h s

n

i n o

i

n

i a i 3

t

o l o

n o E

d s i

l G o C e t ь

d e h t n

r

i

a 5

t E b т >

a L n

n e

i c

o S

a l o > s s

l

s

R e

n M

o G 4 d

u l с >

n ا i r ው

w n é ь

L e

e

e

P A 1

>

n r

>

E h е

ﺳ a W d ነ

N a т

i c

F

>

e l

c i

l

d

i a

a r d s > > с

ت I o

) i o F

> i > a

a t

> n

P n l d y

d

е

e

F (

i h

e

z h n > a

o c

o m

n n m g

c M i

s

n i >

e i s a

w > k

k

a a

g e i i

u n

r

m l n a

a r

a e

u K

>

a c

e o n

> n a

a o d

R i

g F e A r

>

w n u i

m a a

e a v r -

h

P s i M n

l a i

h r l e

q p a o s

a

a

M l > > o s

> l G l

m

s e

a

k

e w

Figure 2: Clustering of 100 pivot present tense markers. Eachn node is colored based on its family

)

e S

l a s S t > n

k r a a

g

A P a

a e

a V u n N i a

u

n i g y u C

B

m d n

i e n m

l

R n ẹ

a i v a d

a a t

i L

r a

n r

> c i a u e

w

> i O

v a

ī e a

i

i a

r m g

W n

l n

i n

s L

i

t

d

>

a y n

P o > i

A a >

>

n

D r s d a i

r a

l

n I R i ) k W r

(

i s

e

o s n u

H i n

y a r

s x r

l A >

u

i o

n o r

e d c G

t a

h l o e e

n n n a n e

i

> r

a y n

V a

T a h >

t i m e

o

w ( N l j

> v

C

m S t å

S

information. Languages with no record on WALS remained white.n This clustering is based onn JSD of y

a

n a s

ñ a

u i

i a m

s a

a L l C i k d

V

u

( g >

e

N a o i

>

a

t e s

u B

n

i M l

u

i

H > r w n

n

b u r

u a

a

M o ia

p a b B >

m r

g

u N ik é > e

o e

T r

h f

b

e > r

c w

r

g A r ä s I

n

e o h a

a s > t n

w i g

à s

l N

m o k

n n h r e I

i ɗ j

> a s u

i > > o e

> D d

s

markers in marking 6590 verses in bible. It can be seen thati in many cases languages with theo samee e G t

e

a ï d r w n l

m d

a S a fu

a r l

J n

n e u o

t >

y p F e

t

i s y

a a

V E a ĕ a

> w k

b g

ä a > s

r

í

M m > h

c

> c

e a n

r n

r

e d a e a i

a

b r

c A n F P

> é

a e e

l ﭤ b u

n

ú و t a m o

o r

o k e

n

C A > r

a

M i >

C h i

h y

r p

family behave accordingly in tense marking. i n e

> d e

t a d

H n i

a i u c >

n u m S

ɛ K

i L n y i

u

a

t g

ù

>

l

n d

W i i

o

>

f

a P

u

o S n o

g

n a a n i t

e r

e

S e >

V

i ig o

n

i N an là

m

i c

j >

a

D

> v )

i a ia

h r

t h e

a

r C ig

a

i N

a (

M a

a a d

h

d >

>

a h

i is d

i M l

a

n g

i

h n n

i

H

> E

i > e n l ﮯ

a o t m

>

s

a re

u

u

d M

d C r n

i d

U

H n

a

n l

a s

e I

b a

b i e r

a S

C

छ o

अ k

> ra

>

i o

l >

o r

i

d )

h a

t a i > i

N a b

n i

u

M

g m

n a

u

l N

a (

m a

i

S m

k a

a

t N

a

B

o

d

> ﻣ

ﯽ a b

o >

T i

r

k a

a

t D a

B

o

d

>

l

o

k e

i b B

a > d

a la

n

o u

c y

n

i D

R

é

n

>

ɛ

a l

i b

“eng newliving 44009016 And I will show him b a tense-marked clause in HWC. >

m

a ra M

a

n o b

o r m e a m B a C

h a

o

g >

l > o H a - n a rm g n I a h l Z

t o > ku i c e h e t >

i a d n i c H u i j i F Y

Sango > yeke a t > o t n e m a i p a how much he must suffer for my name’s sake.” WhileP preparing this analysis, we realized that “fij hindi 44009016 Au na qai vakatakila vua na HWC “wen” unfortunately does not meet one of levu ni ka e na sota kaya e na vukuqu.” This the criteria we set out for pivots: it is not unam- verse is future tense, but it continues a temporal se- biguous. In addition to being a past tense marker quence (it starts in the preceding verse) and there- (derived from standard English “went”), it can also fore FIJ uses “qai”. The pivots of the other four be a conjunction, derived from “when”. This am- languages are not general markers of temporal se- biguity is the cause for some noise in the clusters quentiality,0.39 so they are not used for the future. marked for presence of HWC “wen” in the figure. HWC “wen”. HWC is less explicit than the TCS “bin”. Conditionals is one pattern we other four languages in some respects and more found in verses that are marked with TCS “bin”, explicit in others. It is less explicit in that not0.39 but are not marked for past tense in the other four all sentences in a sequence of past tense sentences languages. Example: “tcs bible 46015046 Wanem need to be marked explicitly with “wen”, resulting i bin kam pas i da nomal bodi ane den da spir- in some sentences that are indistinguishable from itbodi i bin kam apta.” ‘what came first is the present tense. On the other hand, we found many normal body and then the spirit body came af- cases of noun phrases in the other four languages ter’, “eng newliving 46015046 What comes first that refer implicitly to the past, but are trans- is the natural body, then the spiritual body comes lated as a verb with explicit past tense marking in later.” Apparently, “bin” also has a modal aspect HWC. Examples: “hwc 2000 40026046 Da guy in TCS: generic statements that do not refer to who wen set me up . . . ” ‘the guy who WEN set specific events are rendered using “bin” in TCS me up’, “eng newliving 40026046 . . . my betrayer whereas the other four languages (and also En- . . . ”; “hwc 2000 43008005 . . . Moses wen tell us glish) use the default unmarked tense, i.e., present in da Rules . . . ” ‘Moses WEN tell us in the rules’, tense. “eng newliving 43008005 The law of Moses says TZO “laj”. This pivot indicates perfective as- . . . ”; “hwc 2000 47006012 We wen give you guys pect. The other four past tense pivots are not per- our love . . . ”, “eng newliving 47006012 There is fective markers, so that there are verses that are no lack of love on our part . . . ”. In these cases, the marked with “laj”, but not marked with the past other four languages (and English too) use a noun tense pivots of the other four languages. Exam- phrase with no tense marking that is translated as ple: “tzo huixtan 40010042 . . . ja’ch-ac’bat ben- Sino-Tibetan Oto-Manguean Mayan Mosetenan Niger-Congo unknown Kuot Indo-European Khoe-Kwadi Torricelli Mande Tai-Kadai Austro-Asiatic Austronesian Afro-Asiatic

Hmong-Mien

a

u

c

e

>

a

n

i

a

y y c

a w a

e n t

>

t

g k

h x

>

i e

a > >

)

h

M w r

a h

e h

c

e s x

A

o

> i

a

- l

n n

o

i d u

>

a g e h

l

g u

a c r

l

n s

r

e

ə

i F

l G a l

>

v a E o >

t p ə

i

l

g k

a e

e k

> n g

n z

w e l o c

i n

l

u

n

) o ə

e o a

u

ŋ >

O e

g

y

C a o >

E

r t

-

a z

h n

a i e

o

N i

d > j

l e

n x r

i e h

r

a

d

> o y n i

p n r

a

l C

t >

o

r k

a > p

c

e a C r d i

P a a

a > o

o C

> t r a

> M a w

u

m

g o a

e é h

o

l u a

i

l

g

a n t

n > g

l a a m

p é c K

> d

n g >

e

n J

u o

N

p

s a

a a a

N a b

a o

s i n

h

j D

> > i n

a u a

e c (

p s y

e

a a w

r

ɩ > t

n c e c a > P

e k

u a a

y R

u m > i

s i a

z

g l

(

s

o e e

u e

n m

i o l

a u g i a >

m

a

i b

s

h t r

i m y

l o s a u i

b

n x ะ >

ʋ r a

g a

b

d a I L

S >

r a

H

r a

> a

i s

S

o

a a e M > n

o

v จ

o

C g

a >

r y y

w t a

K Q I l

T

a l a T

o T M

>

L

i a g

g h a l

N K o n

B

a a n k a a h m i r

> > y l

> n V i n i

M

N r e a

a d

M a

C u

a

K I i > n r a

u

H S a k a

a

l s

e S y s H

m a a a a T

r

b S n ' M

u a a a é

a h

t d

a a l e d

h > f m

a S i

m T n a l

c

> i o

e r l

P

G M n t

u a B i е o

r t n

k t l e

a n

a a

T

a w

a n i

k щ

n h a n S

g >

a s

h

F e

o a n

m

n >

n

n t >

i e > a

r o

K C

t

o ẽ n a >

i

> b n h

u n

i

n s s

s o s

o a y

>

é i

> e T B i

s

l r n

a >

i d r l l G

a o

a g

k m

i n a a e

u

d

n I n u

s a r j

g M

j i s

a l M o

p E

> e a

>

u p

M

i

k

> a a

ŋ

m m i

o B

n

l o

ɔ b i

E e

a h

k P a

k a n

b č s

e a

n n b

r

o o

M t

>

a B >

>

o

e r B j e

K i

a i i

d l

n u

i

b a

n V

a

> r r

K

D

m i

i a

e K

s

n

L J n

h

e

a

n t

l u k r k k

n t

u

a a

y

e a

r t l > e е t o

t t

ќ

e T s

n o a >

S >

k

e

e ḓ >

n

s k

a

C a > n

a u W

-

t

w a

e W a a i

M s

b r

v n

d

e

a a o

> n

n

P d e

W e

n l

a a

a e a

o V a a

k w

b c t ɛ

a > e

w a

B

y

>

u >

p n

>

M

)

e K e

t o

e >

k k g

o

g

o a a u a

a

L y p

u -

b

K K g

g n p

o > n

t

n e

o u a

a

r a r Sino-Tibetan l

D

A u

o u

> t r

h l l

c l

e a a >

a a

g k

w K n Oto-Manguean

m

I s > ( a

a

n h > s

y

o e

a h > l

o N s a

b di ga Mayan

a

M i m

k e z y

e i

i a

> w g

S G > a

N th o Mosetenan

m e o

r u n a b

e o a b a

c

C S m

>

i a

a

p v

u b Niger-Congo

a a

a

h > u

>

S C

u a

k

m m

u e

t

u a

a

r l h

I unknown

> s b

- e

i

a

n > m

i B a h

m in k g

C

u j

i n

i > T

n P d

u

w Kuot

> o a

r

n g

i d i h

N in a

C y

b

i N

m > a

l

a

l n

> i Indo-European a

is n F

i P

h

C k

g o n a

i T h

d

k m

>

a

e

) t Khoe-Kwadi

H a

i

r >

e ê

g l

i a N

( c M o

k

a >

M wac Torricelli

g a

n

i g d

u

>

B

u

o e

Z ek

u Mande t

>

>

p

n i i

h iy

C k

n u i

y B i

S a

g l

n

i > Tai-Kadai d

> n

e

n u

i g o

h n p

C e > m i M h d c

e n

T e

Creole Fr Austro-Asiatic

i

a a h

k w l

> e n s i

h e C S n ü

M goin

ai'i > Austronesian â r w

> a

a H w i

a

Z ans > sal Afro-Asiatic s a a d k i > r g f n a A h c

A kwarra > ba Hmong-Mien a w > i h s a

L Be

a

u

c

e

>

a

n

i

a

y y c

a w a

e n t

>

t

g k

h x

>

i e

a > >

)

h

M w r

a h

e h

c

e s x

A

o

> i

a

- l

n n

o

i d u

>

a g e h

l

g u

a c r

l

n s

r

e

ə

i F

l G a l

>

v a E o >

t p ə

i

l

g k

a e

e k

> n g

n z

w e l o c

i n

l

u

n

) o ə

e o a

u

ŋ >

O e

g

y

C a o >

E

r t

-

a z

h n

a i e

o

N i

d > j

l e

n x r

i e h

r

a

d

> o y n i

p n r

a

l C

t >

o

r k

a > p

c

e a C r d i

P a a

a > o

o C

> t r a

> M a w

u

m

g o a

e é h

o

l u a

i

l

g

a n t

n > g

l a a m

p é c K

> d

n g >

e

n J

u o

N

p

s a

a a a

N a b

a o

s i n

h

j D

> > i n

a u a

e c (

p s y

e

a a w

r

ɩ > t

n c e c a > P

e k

u a a

y R

u m > i

s i a

z

g l

(

s

o e e

u e

n m

i o l

a u g i a >

m

a

i b

s

h t r

i m y

l o s a u i

b

n x ะ >

ʋ r a

g a

b

d a I L

S >

r a

H

r a

> a

i s

S

o

a a e M > n

o

v จ

o

C g

a >

r y y

w t a

K Q I l

T

a l a T

o T M

>

L

i a g

g h a l

N K o n

B

a a n k a a h m i r

> > y l

> n V i n i

M

N r e a

a d Figure 3: Clustering of 100 pivot future tense markers. Each nodeM is colored based on its family

a

C u

a

K I i > n r a

u

H S a k a

a

l s

e S y s H

m a a a a T

r

b S n ' M

u a a a é

a h

t d

a a l e d

h > f m

a S i

m T n a l

c

> i o

e r l

P

G M n t

u a B i е o

r t n

k t l e

a n

a a

T

a w

a n i

k щ

n h a n S

g >

a s

h

F e

o a n

m

n >

n

n t >

i e > a

r o

K C

t

o ẽ n a >

i

> b n h

u n

i

n s s

s o s

o a y

>

é i

> e T B i

s

l r n

a >

i d r l l G

a o

a g

k m

information. Languages with no record on WALS remainedi white. This clustering is basedn a on JSD of

a e

u

d

n I n u

s a r j

g M

j i s

a l M o

p E

> e a

>

u p

M

i

k

> a a

ŋ

m m i

o B

n

l o

ɔ b i

E e

a h

k P a

k a n

b č s

e a

n n b

r

o o

M t

>

a B >

>

o

e r B j e

K i

a i i

d l

n u

i

b a

n V

a

> r r

K

D

m i

i a

e K

s

n

L J n

h

e

a

n t

l u k r k k

n t

u

a a

y

e a

r t l > e е t o

t t

markers in marking 5733 verses in bible. It can be seen that in many cases languages with the same

ќ

e T s

n o a >

S >

k

e

e ḓ >

n

s k

a

C a > n

a u W

-

t

w a

e W a a i

M s

b r

v n

d

e

a a o

> n

n

P d e

W e

n l

a a

a e a

o V a a

k w

b c t ɛ

a > e

w a

B

y

>

u >

p n

>

M

)

e K e

t o

e >

k k g

o

g

o a a u a

a

L y p

u -

b

K K g

g n p

o > n

t

family behave accordingly in tense marking. n

e

o u a

a

r a r l

D

A u

o u

> t r

h l l

c l

e a a >

a a

g k

w K n

m

I s > ( a

a

n h > s

y

o e

a h > l

o N s a

b di ga

a

M i m

k e z y

e i

i a

> w g

S G > a

N th o

m e o

r u n a b

e o a b a

c

C S m

>

i a

a

p v

u b

a a

a

h > u

>

S C

u a

k

m m u e

t

u a

a

r l h I

> s b

- e

i

a

n > m

i B a h

m in k g

C

u j

i n

i > T

n P d

u

w

> o a

r

n g

i d i h

N in a

C y

b

i N

m > a

l

a

l n

> i a

is n F

i P

h

C k

g o n a

i T h

d

k m

>

a

e

) t

H a

i

r >

e ê

g l

i a N

( c M o

k

a >

M wac

g a

n

i g d

u

>

B

u

o e

Z ek

u

t

>

>

p

n i

´ i dicion yu’un hech laj spas . . . ” (literally “a bless- tics.h Typology is concerned with all areas ofy lin-

i

C k

n u i

y B i

S a

g l

n

i > d

> n

e

n u

i g o

h n p

C e > m i M h d c

e n

T e

Creole Fr

i

a a h

k w l

> e n s i

h e C S n ü

M goin

ai'i Pidgin > â r w

> a

a H w i

a

Z ans > sal s a a d k i > r g f n a A h c ing . . . LAJ make”), “eng newliving 40010042 guistics:A morphology (Song, 2014), syntax (Com-

kwarra > ba a w > i h s a L Be . . . you will surely be rewarded.” Perfective aspect rie, 1989; Croft, 2001; Croft and Poole, 2008; and past are correlated in the real world since most Song, 2014), semantic roles (Hartmann et al., events that are viewed as simple wholes are in the 2014; Cysouw, 2014), semantics (Koptjevskaja- past. But future events can also be viewed this way Tamm et al., 2007; Dahl, 2014; Walchli¨ and as the example shows. Cysouw, 2012; Sharma, 2009) etc. Typological in- Similar maps for present and future tenses are formation is important for many NLP tasks includ- presented0.31 in the Figure 6 and Figure 7. ing discourse analysis (Myhill and Myhill, 1992), information retrieval (Pirkola, 2001), POS tag- 5 Related work ging (Bohnet and Nivre, 2012), parsing (Bohnet and Nivre, 2012; McDonald et al., 2013), machine Our work is inspired by (Cysouw, 2014; Cysouw0.31 translation (Hajicˇ et al., 2000; Kunchukuttan and and Walchli,¨ 2007); see also (Dahl, 2007; Walchli,¨ Bhattacharyya, 2016) and morphology (Bohnet 2010). Cysouw creates maps like Figure 5 by et al., 2013). manually identifying occurrences of the proper noun “Bible” in a parallel corpus of Jehovah’s Tense is a central phenomenon in linguistics Witnesses’ texts. Areas of the map correspond and the languages of the world differ greatly in to semantic roles, e.g., the Bible as actor (it tells whether and how they express tense (Traugott, you to do something) or as (it was printed). 1978; Bybee and Dahl, 1989; Dahl, 2000, 1985; This is a definition of semantic roles that is com- Santos, 2004; Dahl, 2007; Santos, 2004; Dahl, plementary to and different from prior typologi- 2014). cal research because it is empirically grounded in Low resource. Even resources with the widest real language use across a large number of lan- coverage like World Atlas of Linguistic Structures guages. It allows typologists to investigate tradi- (WALS) (Dryer et al., 2005) have little informa- tional questions from a radically new perspective. tion for hundreds of languages. Many researchers The field of typology is important for both the- have taken advantage of parallel information for oretical (Greenberg, 1960; Whaley, 1996; Croft, extracting linguistic knowledge in low-resource 2002) and computational (Heiden et al., 2000; settings (Resnik et al., 1997; Resnik, 2004; Mihal- Santaholma, 2007; Bender, 2009, 2011) linguis- cea and Simard, 2005; Mayer and Cysouw, 2014; Clustering of 1107 Language based on their average divergence in past, present, and future markings Barbacoan Sino-Tibetan Quechuan Cariban Mangrida Arawakan Altaic Lower - Oto-Manguean Jivaroan Nadahup Yele Mayan

Basque

Z

a

c

a Songhay

t

l

á

n

- A

S h

a u

n Cahuapanan a

c

M S

) a

o a

a t

u e l S T r á l l

t n

c e

o t h n i a a

o n t

u a e

- u W u

s a n t C T u

E r h

G h

n o h

e Matacoan T a e e L e c G c

a h p n s i r s l R o e A w í a l t e n t u t C u t N t e e w o e r l i o n w t e x

c n a a z C r r b a Q a a b i

o r l N

n n Y i

o c u l i l r a e t c

a r o o n o a a

o a H y b h a e a n a d M H o t ( H n e n H t b z l E S t a

n e c u a r

t x l P d a u Y l i a u e a c e e o u a u e u l a w c

M u w G a

i c p t

a e N G u t

c Z f

h a e n d c Na-Dene

o

g Q a c e a a m e s a a l A T P a

p N M a k a t s o M x e

o u a u O u n n e h u A a a s s e n a g i y n Z H i t u a

u u n t u P t l u o t

c C c g b i a

t m a t p n

g a l e r M C o T u n r a h C P h ( P x n e e N - d F r c

e a a r i u a a y x h p o a r t b i n i o

u e e h h i

c A k o s m a

c c l M i c u H K i l

o n e c c o p a o n K t a n u e e e i c a a u K v s - M r n p n i i a

N a a u - a G t n a

t K e c K o s h a i q a a a B a a i m u a i N b i p b n W n o b r p

d z l W e h u t W k b c a n m M a Z I r a n k l c r l e A H é l o e

u g a x W i M l

m u a t m n a

i r t N a i n s d c q i e w e o o o e t g a d a g t c a i r a u a á x N N f K y P i e i s r o a o t i a D s f h y m i P e t h r a l a g E i o e o a a m a s C w u n n u a u a h a g l i l P T o r o i T o I K o n t e n e a s u á d l t o u s g K o g o r c K a t s A o a a M g o e d B n g c c l y n e u r l M G a a i n u r p s y T i n a r M a a r a s a h u C l i H K a a s K Camsá a t m l M e h h c v o B n l a G p a g a n o e a h w u N t y G a y s o a r C o t k C o o n e y a i O u y M a o j u i K a e B u m a H u s o u C c a u o o i h a u a A D a r r c r n a z r Z o A u a t S m n l n c t h r a - t r B d ñ o a a a t n r g e k l n o N i b e c g e B i i t n e o u n S B t o a u c p S w m i u e r h t i o i n e h n m t t a n a u n e s l t p Q n e A L m a l l T s fl h Y a e i o i h c P S u g F r C T a a o e c h a u K n M e - l t a b a e i p a a t i e - M á t i u Z u u p a h r a r r c p e - a S n n Y l r B o N h i a C o s a l N a p a t a t t C K e a m u H i a i a T a Q s m a g ) l n o p h l a y W a - n o c n r C k a c a a o a l a P o a t a y u a a K m i r i r a t w a t a i a m r N l a r h J h l c Teberan-Pawaian m W o n i í a a m - h i S a a a x e h a n Z t K T s A r h e Y u k r t z o A h i h J u o c c o i A c l a s e a q s t n a a a K a i i é h i S o u e u A k h o A t t p s w a o á s m u n S h a t a s r i b s l n a S G e k i q D z M a p l U i T e o P E y a t n u e t a o o n a N g h e i a C c o u C g a o r T e r a a a c r a n v A s u u l C a t e i P M u r u n l t - e e k g u e M a a n t s e i a M m n i l s b a b a n n a k u o a I u M a e n e a n s S l n ú a l e j a A o a n a g Tacanan P r s v n n r m u g e S u a ó i i t B - u o d n n A I K a m a C u d o v h i a í A n l l a u n h i n n e a t á z v R u g o C h n u d a l e g S i g C t a p O M a Q g a O e p a r C J p u j W i a n n i A a o p l Y l u i T u i r l e i ü t o h o n a a n o o S h i n ) a W e F i P a a s a a J t t n d r c a a b c a a u M l u e d o ã M a a u a a h h s n c w J b n Urarina B l e a s o u a u K a a b a i s n i n a - r a L k r g K a a M a e n a n l m Q r n A a g a U B M g a a S o o X K N C n ( t u Z g s n i b h m n i K K u I e a a i c i S o o R u i a a r o m o O o r n a C e h n o q P a t i e a fi i M g a c m m r o u a b a s e o d t u a k C a C n e a o n e a n n b p y m a m t d t Mixe-Zoque r U n N S a p i a C n a M g u a u a n a t a l a L c l O l i e a g g d s k i r e a s í h u c n h a a p C b a K u c a u a o t o r v o g e q i M n l a l h s a Y a p t i i D - a d i o g e r Q y i B n m i a L a n c n r e i a n e i H i T N a a P B l t S o a C u n n n a e a r n g J i a i l c C n o r n o o u e k a a M e e s A r G A m n o a e s a W á r i a T t r i o Macro-Ge t í o n o y h r u r o M s u g a n y i a e e Z a a n n a o n a u n c u z b d a m h r a T r a E r i g N tu t n T p a i M j n i a í S l a n c U i g I L a E o o b u d lé n o ' t t í F n in r n h u A M m o e o T i é e tí D y p r R a t G u b n c a th m h h a a a H L a e i e M w z e u g Zaparoan n e d c K h T t r a r o n M r E c é n i l i á o e n a ó s a c a te b S o r a A a M b x w i n g e u - s n G C a h c F a n l t O a ʼ a h la e ia a Z ü a k i S a g B s r z r a n S u n m s e n o h Z a i r e e c e p u p s i lí B b o a p k B lo a a n e G a p a p o a r a w r O C o t tl A o t y Y a - g Mosetenan e M e é e l p á O t M g ê p A h n iv n lu e c u l h i a t n e r M e a c t r a t i c Z o k n c B i iy i o u r a e M u b a n Z a k a e k N B l a p a r o Q u m i a a T G o i ik e a p t v B u i g fi a u o e a B b b g ' a r K a t c v n i i O a a r e I m h n Niger-Congo M h a c a c i Y c B u n n A l i o o a T m a í M o m M N in a s h - o s te ta a i T c o a c o o p N k a c q o r n v a ra A o ir m c th d e o K e n P b a a P c b h a if a w ú o a a U r m e n e a M n r a l a s M o A h e o Tarascan o t ix o k b A b C u H S ix te o o ro o l i K S t c K d a a r n S u a n a e y n o a r o á s a a b c N a li e e u n k D u s r o m M s t u a K g e u I a o n i m e h c I a o o m B h r a n h in e o n m y r o e a a K h r u a o C r g w Awin-Pare n Q a y M o n n C p s o a a r N u ia a K r b e in é t a C e t u a a h r lh m h c u S g t a w H I á h n a r d a a n b c u T o n M k g ik o a N a a - k a u b k ré V a n á o M l e - r u t i é Waorani i C H a a y e h T S ip a n tn i h o S n i l T a n a l W i io u u ec e m e i p a q v t tu s A n a a F n e e w o h z a s K g g a a l J D e l a w M B S a a i A w a i i m P r li w la unknown a y a M t t i C ij i a u á i ah T n in D a m s C h u m á B im h in u m a i i H an o g H d a n S n n u a n a o a o c C y é Y d b o te e as u n n n S t h N m u a Karkar-Yuri S t a e e n ta ra N n c Z an hi n l g g w C d M a ir H q ar a j ra la e d S l u I si m B M a ay o i u sa U u z si N g a k K on i i la o e K o Kuot a g n m T ne y K n n ti a ob s e r er a B ( T e e N e h h w N T b lo u ut ya o C a a ab o N o n in d n m a li S ri e at an a i ru ig k y h l bw b T u u C gh ia d m a i Left May a N n ) U u al H l A a a M p rn e ta F U pu o te es n ra la m N s r o n E n a a du se h c as T ga E a e C is te o I n in o a co r k w M al sc r n P a B a an ra Peba-Yaguan L H e T K is l ab a á in a ó aw u in T af aj h w n o N aj C a Z bu t u m e i' o id G w s ta i P q a e k g W i ue B rm ta A e T dg u n an st ic in B la irí y Tol N e Q u a a a i rn 'a N n B k ag ge K n a a a eo C r a jo ro B or l ia n b' y ra n jo al A nt M Pi ba e a T N a T dg l C ac o or d G z in M Trans- rre th a i el g n s e (N ky ta e oa í S rn i o l D m r tr D ge de a a ai a ri S um te t C ga a) a án ao P r ra P av g K a eo X an a tp le al ya Guahiban a n a in ta B sa e b ga r U n G Q n oo st P T aq g N e Es la am K e im hw ta po p ri t K rt a do K u ol No a d r lm uw i Huitotoan e u M N m a K am M é d en o ng S ez xi og M a l e Q q co o Ng aa se ue ui O Ix us al lw ré tal t il K sa a ta O om aa m Cr ro to i P ha West Bougainville e m c a o Ot t ob a C le om i N m rr h F i i a av ren B kw ac ch Be ho Ta an zi ot ng o Lo o S o ak ern Yanomam G Ch a K th ito uj ou di ng S e K C a ep na w ho S wa ar pi Ts la go D a' yu ra an jim ae D ba ax Nakh-Daghestanian i M Ik am ng h ni a a B so n lis Se ka aa nia ng no a X e en e E h Sa Ba uf rm sy ol nc n ba Ig o A ori re Fre Jua M bo M d C le n C Y al rio lan reo Huavean o M o ay K Is C lor a rub ea t an Go ad mp a S fu uci ur o M ru Ba t L ma ix li in l Bu nc te Sa må lu Af hé c te ok sk (C rik ma Vu r B or Uru-Chipaya am a a ian yn e ans Tik eg N roo rw ian K n) No eg on orw Ek ni N ish K aju an h ni M ura k D lis ma Indo-European B a nk ng o ul nd o E x R i (G ink Vla ish ha a ed n an 3-) 0) Ya na Sw nia ani 45 150 lu ) lba lb k (1 00- nka A g A ree 11 Su he G sh ( East Bird's Head T su G rn li H up ode Eng U iri uri M le ole ish Y rak Mo idd Cre ngl oco La tu M an l E bo wo aiti rio ué i' H e K Did eliz Tucanoan Za a B ba iwa Mo re Las refa M hi Fa a o ar end ent S Ka u V iam out hua Pap rou Baining-Taulil he Ha ind ep rn P Sou usa Ny Lok ueb th F op- la ali Ar m Mix ana Ti K tec M yan mug on a'an dur on so M -Gu Khoe-Kwadi M fu Ya uru Mo K nes t we ank ha' Irig í ah ana ayab eny Nya ey K au K nko ah a ila N le W imb amb Oksapmin y sik M oro T roon Sou Ku ame thea S tu C a ster hon arm n Pu L a Z ebl enj uau a Na e S o Totonacan h z Tr uatl Nin inita grib Ign rio Do en Ma acia Mi daré chig no Iu o Ma ueng Lobi Bob S a thern Choco ou zu ou H th N lu S a uam Nor deb ang alíes th N ele H ) M -Do debe Zou eria argo s de M le (Nig s-Ya ayo Gus Mak hin rowil Huá Th ii im C Torricelli ca- nuc ara Ted bau Laur o Qu ka ngka Sout icoch echu Mina h Bo a Qu a anu Nort livian echu Term in Hu h Bo Que a ha Ch allag livian chua Hak u a Huá Que ioukro Mande n c No uco hua Ad rth J Quec di unín Q hua H do uech ur Mo inese Kam ua J an Ch baata Min N w T ng Da Tai-Kadai atar Hmo Bashk un Kara- ir G Kara Kalpa Wa chay- k k age) Balka Parau rolangu U r (mac Tupian zbek Malay Chuva vanese Eas sh Ja e ) tern M chines o f Congo Southe ari A Senouf public o Sou rn Alta Supyire ratic Re th Aze i (Democ Uralic rbaijan Mono Aze i eng rbaijani Mangs Koore ahu Gal te L c ibi Cari chang co Mixte C b A na Peñas Ticuna herokee Magdale Arabela Kouya O obe ssetic Miy í Russia B yá Guaran uriat Mb boro Araucanian ara Oirat Eastern K Khakas Waama C ü himborazo Lelemi Münd Highland ia Mambila Cañar Quichua Niger Candoshi Hig hland Quic Konkomba Ayacu hua cho Quechu Kamwe Cu a sco Quechua Tojolabal Ajyíninka bo Apurucayali Obo Mano Kordofanian Ucayali-Yu rúa Ashéninka Nakanai Pichis Ashéni Dan nka Guinea) Huicho Kara ( New Ama (P l apua New Guine Guinea Kpelle Ok a) Chachi Baatonum Woun Meu Tsonga Alamblak Romanian Croatian Arauan Mayo Gwahatike Kutep Gumuz Japanese Northern Paiute Sepik Naasioi Southern Pastaza Quechua Telugu Luang Kamula Bola Yawa Fore Avokaya Mountain Koiali Wandala Kwoma Sundanese Kenga Central Sudanic Sharanahua Kaiwá Yaminahua Galela Northern Oaxaca Nadëb Sukuma Mbula Yuracare Achuar-Shiwiar Sepik Iwam Chuave Halia Asháninka Mokole Awiyaana Lukpa Pama-Nyungan Kosena Wè Northern Páez Lote So Eastern Tamang uth Giziga yrgyz Lampun Yareban K g Api Shuar Bislama Kazakh Gumatj Dj sh Quechua ambarrpuyn Huaylas Anca Cen gu Austro-Asiatic a tra cash Quechu l Mazahua onchucos An ua Duri Northern C rímac Quech Eastern Apu chua Gilberte Wanca Que se Huaylla Imbongu Bariai g Tam Cofán Bo-Un il gu Sio Umbu-Un Kube Burarra T iawari inputz Somba-S B Gumuz Borong ambam ape Kono (S M P ierra L inotepa eone) T Nacio amane oraja- nal Mix Guhu-S N Sa'dan tec Algic Nabak gawn Ta Chin -Yegha chelhi Korafe a Am t Yareb bulas uha Merey Yawey C Austronesian pet hortí Sele T kyom imne Ae Ta Yipma wala eng South al Hen ern Sa Guaicuruan Ang jabi Ramo mo Pun aaina bisa Kekc Huam C hí Yopno huuk a K ese Hay uanua Senagi S potec erbia hi Za M n atzac potec arsha Y ho Za ul Mb llese oogoc Iatm uko Z na Juku eyaga n Ta Tequistlatecan i K k K Kail uwa um Ledo Ch ataay ineña umb Cav a Sh urun saruf illuk g U tec Silac Zapo L ayo Paezan ag na ac apa alál uaru hixí n M Y Ag uzi Isth o Za ixte Ba mus pote c mo Ize Zap c Ga re otec bwa Gud Oji W e Iroquoian eeta este arkw u Mu rn P M imb nda enan L nal Dii ni Pajo M nka uan ayo héni epeh e Ga yao Chibchan p As rn T Mix apa Ifug rthe ila ui Kim iwa ao No Juqu Yaq M ré bwa otu Oji a Sou ern ehu A the Sev Tep n nu rn B Korean lco hua M fo iri ichi epe c uy for ach rn T ote C ang Tl ste Zap a acu thea ley Kew M a ou Val est wa elp S ula W Ke C a col ast ibo om Zamucoan Tla E on G alte ern o-C ra ok pe est pib Co K ana c C W hi esa ora up hin S Ter r C H ang ant nta aya nga elo M ec Sa l N pa a T ng alay E Pam ann oba Japanese -B o P er Bo opt Ham n K i' yria wa i-Z ma Ny nya om ru ab ma K Du ew We wa ebr K ste Kartvelian H o a rn ong G diw Bu Ko ba aro éu kid saa i A no Ma dh gu n M Sin Ta aca a ayo kia tec nob -M í Te o o Sulka n la p ssa pa M eux Ye A ukú Y am ila ur a aw C und ink B a uic M D ari uh ate ern La Ch id c est for iqu Border w iri M it uth B mu url ano So lba ua Ka e a B ana E da M B g se zan an Ö Ej D ny ere M m ja us Ke äb e ie un Harakmbet Ng ac C nta ton e en w To el M tra ai la Y a op l B nt og i U án o pa ato er pp M nto Pa D ka i Ta er ay k ra 'd G la N a Dravidian a a a e m M t o nd ca A éa T ro ig xa al uw nta T ng tec D a lo ot a za n ob li I on D a fá Ga u fu ac z M Co ís Z rif ga ía rec ul un o West Papuan m g e D Pa ri Qu o a D ha u A io -G pa en om r d te em la T B ie B ele pe z Ja rr n ok c ek Ca o B o Ch og o us ba in D an K a ru an Misumalpan o c è a S Tu iy S m tec ro ab r an be o K e A M ra T hm a ro K u si at al ch n K eo tr ue re ay D en Q a n M ap el Uto-Aztecan K a C ca lia P n ó M ar aw go he ar gg ar m 'g on ac o Th kw ar H ja S p m A a a ai u Ca M A ro se b do av alh rn O o R a C e H te n ko se D ot u h es er A ue e e ok in Panoan t d l a W as ug ul i Sa o s E rt lf op I r Po Fu H e yo n le Iy jw ria Se c W o a e te i 'w 'ja ig po rh O ol u C N a a e w of jw h East Bougainville Z sg g D a a o la tti ra do it C ro it a u ra a M am h te M hh t G lo u M as m or C e o ch a a o B C e ta L am na r te at u rn a a i b Q ra r B m ar e d r a a e a a Nambikuaran S n A m y V t S la u a a e e- (T e w rn h b l N n B o n o te ra G ga a a g e g ou L es a st n y W w o nd o) f o T e i o tl d i o p W nd w L ec a C a m a a th S u a h pi N l u ah u N a sh ow o ah a o y a Dagan S N n n L r u n L n a ya an I u th co a cá p a i n n s e a a ar ia S do ha a M o C W g al e is n i st i h n t in N s e er xt ic u I B sa E g al s n e M H s p ie a ia D c i ru W e m n i B u u n b n Chiquito t a a l D r a o ka y u i U e is o N ag an n n K w e ab ya nd a W M a b e M N u B s e S an B afi t- to i c e a C Eastern Sudanic in i M m en ra ab n B a b t k r ia V u c e ra A g ic l e l U r r l T ag g d L rd o a t u l a o i a e h a to m a r n m d G m u n n P ia ia b n A ah a a N a u n n a ta r si n ta lu Afro-Asiatic N e r o o m S p e x c A r n o s P a i d th o g g E S d h M h n S in n n c F a o er a i lc ia w la ts e i i la n s e n o e e s C j th K aa t a L c i e n a i i l e Ir I td ro ia M r H li u a T u a s sh D í ib i rd a F ri i n s b n Dogon l a k d i P F m a M r i e i sh e m n M a i t a rn l r ia y o n e F e b e a a H st G a s n L y n e w te a N u a g in S l i n o n n d W a tv a S o ( g a u a i n r m K n s M tv ia T a a t East Geelvink Bay L h n a e a a n s S e a n L o i k a n n t n n I o t n d y i d s n ia s c it T a r E i h T th h e e a a F n is J o i k o n d a l h a u m a o n d n u o c k C m r u p g a th P e a a s a o T t i z v n E la a n a S L C a S g s i ( M n Hmong-Mien lo i h s c C C r c A in y i a ô i h z S a n e p c n t x i a v e ix a W w te t a e e n n a r j a ia l C a ia F n h M e R d n ) B M a is ) l d n S re 'I t n y n 3 G s h o A y o v e á a n a h i m r o c l N 5 i i O u z r i le t p 4 t I j a a a a ir a S 1 a m r z a t n b c E e Cacua-Nukak ) A i l o L a e s u r á i i n ) C to o N s h m a n c g ( S g ig a s ti a L il y a M li k rn u D m E u a r c s e e g a o w m i ín i h e h n y g r P w a x r t a r la B o e o b n C e r l i o jo g a G o l g G u a N h t a i P n y W s o e i Eskimo-Aleut n N u n T q l n e u K i u e u o o a i d V y a P a s - c i b t t e A n v a u d a Z A la a e t n i s k n n i l r r e d p i a M c te t n a c A e d e a h e n u K r E a m i w S n o a é p H

e c u ( K r i u ) K a l e a i a w m a P s a M c i i l R i a o g c i y w o ta t y n a K b o T W u m e a C h h l k g t r r a a i k e o n l l h la a o m l M e e b a K A o d h p a a n w P a e á s a m g M t l n E a t e C t d S Z a I d e o e a i ( o l n A s m n c a n g u h n G r h n s o C a k e o r M n e a o n y a a lf - u a a h b H n a t K j s O H d a m u s a A U i e a a m M k i a a ' r m i t u F a b l F k r 'o i a i S m n g n e K i o H y m e n b h a d D a o e g n r h o B n a e T k i i a h l I n a i e T i w h a o i n S m h u g q a a n a s I k a d ) D e u d n n B c C h o N m u r a l e l r a a c a d u h o M a a e l l k a t a c u n e k a a a f t W j n f t g d o r c n h P i

a l S i M M a M n fi A a o o o t

s B u e u a ) a n d u u u a n l a n M a E l u r a r F L o o Y k o t o n G s a l T t e C i A a t N p a l

S a c r a e u s o p u d n l e T r h T n n e N i c i Z t e

r r T n a á w r i g l i y o c

o y l u C t i e n S

i n e a c i u a c o a n h t n R n i

h U a a l i i a i p u N S b e t C r a a a l r n a o t t a o ( t u a a m t c - E a a e u n h W x o

t C n u M b L t

a a T E s r e P G i n u r e e

h t w u y a r K e d o a z T c o m I n t S m G u M n o y a e t a o e h n l e T n a i h W a a e s L a n a t n e G n n n a

e e n S G o l k

n Y B a k p c n r a h n s s p a a c G a i a

l k a o R a

A o i e a n M e r n o T a n t a

t W i t g w e T e e t t a p u w i T a n w e X M u k g s i o C l t C n e s n o N í - r e a w u I z o

w o e t p o

i o a n t a l l a l o t Y a o e o p n t h ' C a M i e u r e a h t

r a d n a u r c a o z n M o z g l a x o a b a a n a l t i Y t r B n A ) a b S K w t n l n P i g n l Q z n a n u N i u e t n o y o i a B s

- s c i o a a o l d e r o a l O a a t a n P b y c u u i a a a t j y e n o r u o b o e e e H a T g a o n b t i

e r c o a n y n P e o o a M k n n b ( H H a n n a a A r l E i n H e n g o b t h a C o t

i k t n a ( S r y n e M i u t t n b d J P r C i i e e

o n z T y g a P a n e Y o a

o t l c a p d u s l n u u x l n e e e a a K - n l c c o i a u n w P a n e u a a u t i M e u a n a a a n p B e k r Y i l n a a G a c P h M i i t y i u e

M a w l a e a c u l p u w r w ó p e c M

s o H l a r t e a ( l Z g c E i a

' a n c G i m

a Q n M a u a i o g i e N e S a a i n u p A a t o f t C T y u a K y e a G d S b v l

N a s e M P a a s a n b u o M c a N m k i o n r M a t a h h u n d e e o a i a s s A g u o k a l O l a M i e e Z t u a u V w n

k a N a a k s i N h d m u i r r c u t g l g n x t l n P u n f e i y a n m

u H u g g b e k b u o a p t n a n D P g a a e g e N m o t g o o c l a a l e b l e C ( s n f C u i n i

C l a a u a a j p t n i a M n o n t u M o e a b a o T l m o l v o n w o e P a h a r N a i e o a S a a e r a a a u ( u x l a a a A é h e d B i m A y k u o i - P s ( o e e s n r O h p r g r r e c k C w r t a n y M t o i

h n ( a l i A u s m m u r c C p u C o A a n F a l E

c o T c p a O x b h C t k a w d c A d c a i i n - t n l i s o s P a K n n o M u a u m k a H e o i r i K t a n H a U h y r W n B u g i u o c an x a a N a i g N e a n o M i c T P i e a e o r a a l s m u I i t u i c o

p e W a u a n K i a N i g k o a e n K c a

h g r n e n M h t b c a a n W a a - n e m a n o m a - b e n h B i q Y o t t t b a n i d a b v l c z G a i p S i a n r a G t u r A l p I k a a a W W a y a K e I p g e i p c a é

e N a u a i b o W I l h b r a a M e r y A g a a i a ) r Z k m a t r r i A k - a e i t a ó a B A i S e u m h a e a c l a y a l N H g u n r s i r m u p S P m l e u s a h t S a a p o d t k u i g e o n L a p a l n Z u c t c i u g n a Y

o i n n o e a d o W k a a l t n u x K a N o n a i a l l t n a M

s a N c n r i r a a i á D t q i p y s a o l f e K a e n l e s L u o a s M e t m e p i

t a i a n P D a e u p n A i i h i C i M a r c ea w e i B m e a t m i i d u h N h n a w a B a a T B b a t a á K o M K y K x a G o n a t S f r r y o P o i n n n e g o t B r a h e e i T o v o A o T a

a a a T e l a c l m b a u t í T y m n a a u i T a i l l P o o e u r á n o p e N a r a c i C s u u n n n K D l I r l a g i o t s c B u t m i h n a E e l u i N s e o a K t n a E a e K e c G y i - a n u b d a a a o a o i a e a a l a b T a d a l o u u s u n T k g y a n h g d a e r i g c a M u e l a a o o Z s g r e o m o s l u a o - o a S K a Z t l a g e P c Z g a a M t t a y n e a e r m e - W a ) H r Ai u u p k a h n s r h t N k a p e t e u r m n i y a g n k r r a c a v i e o e a r x B G r h i e n a e d a r T g i p M A a r t l a n u r B o t h M T a h r u a K h C t e y t a m y e a ) h G p i a r e a K c a o t t h a a p b i a M t C a r o w e s y a i v e o s i i e g u o n l w a w y o c l s i C M i k t u O u g a l m i a a n i m o a r N a o p d M l a e u a n n r d q p n a k C D S B r n i i - s G i a e p j t o r a o a y C a o a a u t o c n a A l W l i z u b n M a B u n P i y i r u K y i a i h a a o a I g A o l á g m a H T W a

i a

a W N T o H e a a r h u o g W z o o C a u o a j y o a k o l a a n e m d n r a i u P a i n a o G h a I r u D e a A a t e - a s S c a u Z o a o e i t y r m i k t t t K i A a a t e U i a S r e c n o f f t r s t i a m l a - B l n r r m C n y c B z n a e a n T o n e F g S m S i Q T l o n ) e w o t n e t e ñ u d u

a u B t o u n ( S R p u i N a a t o A n u n r i a h u a g p a c A g e c s m t b c l n a c t a e i o c u e Q l b e n c M g i n n u a u e e e n a m h r t a b h a n u d e S h o i l a o i h l u i n n C T l n i t m t n e a a a t a K a K c o N i e i a a l u c Y l a b o e p m A m o a n L t d h r n - c a e n B r T a K r n fl h c a t b e p S u u n l t g o u B G s u a i P n e F - e a S p h a t u a i o á a a - M u M g h e o o a e Z a r e b i h i n l o i

v a l p a a n H - M á r c p c Z ) e t r B z - o a C p t a r K t A t m e a r c N a m u a h Y N e l S t a i N A i s g r a C i l h y g ) M a a n a d T o a a s l Q l c t a g i x o p c n r ( H a M u i o l a M i l n p t u - a y u a n c d a a e o o w a a a W k a c a t h a i i s a h o W o u C a i a i P m r r a t a H a i l h c o t r a a m - n i K N h x A m l r T s a t m h í l s u i K x i

u e Y J e a e h a n Z a a o t i e n h h l S r Y q m Z A c t r a a a A k r z l h s e q a a i u o c c a u o i A

n t a o u e u J s c a h i i n S t p s u n K a u é h i o b p A t m g A k h r i q i w a o z á s a G k C s a h a o T e u S t p a e t e o D bg s l M n b z a t n U S i a u e i a P E o y t n l C a h r C a o a e o N g g o a g o c a c ca l m T e r a a u u e r n I i A P s u u u l e C t e n v e e M u g i ) a - t k r n a M t a a n s s e b a n m n o b a M a i l e k u n e a I n j a a a u l e M s a l n a g n ú a e A o S v i n n r P r u g o S s u a ó i - m d n n a t B u C o n h i i í A I K a m a u d l a v u h A l á z v n n a n u g o C h n e t R S i C t a a e g a g Q u p d l O M r C J g a O e p i a n n i A p u j a W Y l u i a p l i t o n T u i o l e S ü o h ) a r o a h s i n a e F a i n o P a a J a W t c l a a e c t n d r M u d 0.02 o b a a u a a u h ã M c w b l a a h s n e J n B K a a b a s o u a u i n a i L k n a - s r a r e g K a M a n l m Q n a A a g a r S a o U o X K B M g a n C n ( t u Z N a b h m n i K K g s n i I e a a i i o S o o R u u a r o m o O c i a C e h n o P a r t e a n fi a i M g a c m q o u i e o m d t u a r k a b a s C n e a o e a n n C m a m t d n i r U b p y N a S t p t a C n a M g u n u a n a a i a g d a a l a L c a l O l e g h s i r e s í u c n a a C k b K u c a h u a p o t a o r v o g q i a M n l a l h s a Y a p t i e i D - d i o g e r Q y i B n m i a L a n c r e i a n e n H i T N u a a P B l t S o a i n n n a e a r n C g o o J i a i l c C n o n u e A m k a a M e r e s W A r r G a n o a e s a i o t á í i T t r y h r u r o o n s o g a n y i e Z a n n M u c a e a b a o n a u n u r z a a d a m u h T r i E r i g a N t t n T p a c M j I í S l a n i n i g L a E o o b U u d lé n o ' t í F n r n h u A M m t e o T n é i e í D o i y p t R a t G u b n c th r m h h a a H L a a e in e M w z e a u g h E T t e d r c K o n r c é r i l a i á n M o e a a ó s n a c a t b S o r a A M b x e n g e u - s n G C h w c i F a l te O i a a ʼ n B h la a a Z ü a k i a S a g m s e rn z h r a n S u n e s c e o u Z a i r l e o a p B l a p s i í B b p k o a Figureo n e 4: Clustering of 1107 languages based on the average Jensen-ShannonG a p w a divergence in past, C o t t p a r a r O e l M A o t y Y a - g e l p á O e g ê é A n iv n l te c M l h ip th n e r M e u u a c ta r a t ia c Z o k n c B i iy i u r e o a n Z a k a e M u k b B l p a r o Q u N T G a o i ik m e i a a a u p t v B u i a g fi K o e a b b g ' a r a t c B n i i a a r e v m h n O M h a c I a c i Y B u n n A l i c a T m a í M o m M N o o in a s h - o s t ta a i T c o a c o o e N k a c q o r p v a r A o r m t n e o K e n a i a c P h d c b h a P b f a a w ú o a a U ri m e n e a M n r a l a s o o present, and future marking of their respective top markers. Each nodeA ish e coloredo based on its fam- M i k o t x b C b H A S i t u o o ro o l x e K a r n u a in K S t c d a n o a r S á s a a e N y li e e o n k D a s b c a m M s u a K u e u r o a o n i m e t u o g I B r a n h n h c I a y o m h o e a a i e o n m a r C r g K h r Q u a y o M o w C n p s n a rn N u i a o K ra b e n é e a u a a ri tl a C c tu g th a w H h m h h n S a r d a I á u T o n M a n b c a N a - k g ik o a é k a u b M lk r V a n á o u te é i C -H r a i y e T a ily information. Languages with no record on WALS remained white. It canS S beip a in l seen that many small h tn i h o in o u u c T a n a l W p i q v te e m e i a n a a tu s A w o h z F n e e K g g a a a se l a w B S la J D A w a M i m a i a a iy P ra li M aw tl i ta i C ij i u á h T n i D a m m s C h n u u á B im h in H n m a id i a o g H a an n S n n c C u y é a o a o e e a u Y d b o nt S n S st h N m n a t tr a e e u n a a N n c Z an hi nd l g g clusters of nodes have the same color, which together with our quantitative evaluationw C supports that if M a ir H q ar a j ra la e d S l u I si im B M a ay o z i u sa U u n is N g al k K o e K o a T in a o g in m o e y K en rn at a b s er e h B T e e N u th a w (N T b lo u y o C a a ab o N o n in d n m a li S ri e at an a i ru ig k y h l b b T u u C gh w ia d m a i N an ) U u al H l A a a M p rn se ta F U pu o te e n ra E la m divergence of tense marking is low the languages are very likely to be genetically related.N s r o n a u e h nc a T g a E d s C i st o a a ne sc e k Iw n M li co ra o rn P a s n a L T i al B ba aa ár in H e a K si a f aj h a ón w u n T a j C w Z b ot N ua ai o ui G m se ta 'i q d aw e k g W Pi ue B rm ta A e T dg u n an st ic i B la irí y N er Q u n a a ga ig n 'a N na B ak o a e K nj a B re C ri a ob ro o al an nj 'a y tr P o l A n N M i ba e a T o a T dg l C ac or r d G ze in M re th a ik lt g n s er (N y al e oa í St n ig o D m ar ra D e de a e it ag ri S um t C a a) a án ao P re ra P av g K at o X an a a pa le al y in ta B an ba ga r s ne G Q ng U oo st E P T aq N we st la am K et im th ad po p rio K or a o K ul l N a de ru m uw i M N m a K ss am M é d en o ng S ez xi og M a l e Q qu co o g aa se ue it O Ix N s al lw ré al to il Ku sa L t Christodouloupoulos and Steedman, 2015; Lison lel corpora: a set of labels that is available for a a C ar Ot m a am 1 o re O om i P ch a o t i Nt ob a C le om m rr ha Fr i Bi wa va en ek o T ca ch B i th an no oz So go L ko rn G C a a he i hu K ut i ton j So ed K C ga ep na w ho S wa ar pi Ts la go D a'a L yu ara an j e and Tiedemann, 2016). is projected to via alignment links withinx the i b m M I 2 D m ga in ka a n sh i S ak B aso an gli en aa a ni n En S B ou X me ye le ch an a fo Ar ris eo ren Ju ba Ig o Cr F an M bo M o d ole C Y ala ri lan re ol M or y K Is C G ora am ub ea t ian o do p a S fu uc urm M rul Ba t L B an ix i ain ål ul A ch tec S km u ( fr ém ute Bo rsk Ca ika a L V r an no m parallel corpus. labels can either bea obtainedi y e ans 1 Tik eg N roo rw ian K n) No eg on orw Ek ni N ish Ku aju an h ni M ra k D glis ma B an nko En Ro uli din ax (G ka Vl ish ) ) Y ha ed an ian 53- 00 al na) Sw ani an 14 -15 un lb Alb k ( 00 ka A eg ree (11 Sus 5.1 Parallel corpora and annotation h G sh T u G ern gli Hi upu through manual annotation or through anod En analy- U ri ri M le ole lish Y rak Mo idd Cre ng oco La tu M an l E bo wo aiti rio ué i' H e K Did eliz Za a B a iwa ob e La M efar shi Far Ma nda to K ru Ve men Sou ah pia ou the H ua Pa ndr p rn So aus projection yi oke P u a N -L ueb th F L rop l a sis module that may be available for , but not a M li 1 A am ix an n Tim K tec M nya r ug on a'a udu on M so M u-G Y ur of ane ut M we Kan sha Irig í h kan ' yab enya Ny aey Ka u K ank aha ila ole W mba amb Nyo siki n M S ro T eroo out Kut am heas Sh u L C a te o for . We interpret label here broadly, includ-rm rn P na 2 Za u L ebla enje Suau Na zo T huat Nin rinit l In general, parallel corpora are a resource of im- rib ar og Igna io D ien ré Mac cian Iu M ada higu o i bo M eng Lob n Bo So z a uther uth N ulu So Huam No deb anga alíes rth N ele H ) M -Dos debe Zou geria argo de M le ing, e.g., part of speech labels, morphologicalktags (Ni s- G a Yaro ayo H usii M Chin wilc uán Tha dim au a-Lau uco Q raka Te gkab Sou ricoc uech inan th Bo ha Q ua mense importance in natural language process- M nu N liv uec ma orth ian Q hua Ter Chin Hual Boliv uech akha laga H ian Q ua H krou uánu uech Adiou Nort co Q ua h Jun uechu Hdi ín Qu a odo e echua Jur M hines Kamb Nan C aata and segmentation boundaries, sense labels, moodMin aw g D Tatar Hmon Bashk un Kara- ir G Kara Kalpa ing at least since Brown et al. (1993)’s work on Wa cha k e) y-Bal arauk anguag kar P macrol Uzbe alay ( C k M se huvash Javane Easter ese ngo) So n Mari Achin ufo c of Co uthern ire Seno epubli South Altai Supy ocratic R Azerba labels, event labels, syntactic analysis and corefer-o (Dem ijan Mon Aze i eng rbaijani Mangs Koore Lahu Gali te machine translation and they are widely used. tec bi Car chang co Mix C ib A na Peñas herokee Magdale Arabela Kouya Osse iyobe Ru tic M raní ssia Buri Mbyá Gua at araboro Oirat ence. We can only cite a small subset of papersE us-astern K Khakas Waama Chim Lel ündü borazo Hig emi M bila hland Quic In addition to machine translation, other applica- Nigeria Mam Cañar High hua land Quich Konkomba Ayacuc ua ho Quechua Kamwe Cusco Que ojolabal Ajyí chua T ninka Apuruc Obo Manobo Ucayali ayali -Yurúa Ashéni Nakanai P nka ing annotation projection published in the last two ichis Ashénin Dan ka w Guinea) Huichol Kara (Papua Ne Ama (Pa pua New Guinea) tions include typology (Asgari and Mofrad, 2016; Guinea Kpelle Chachi Baatonum Woun Meu Tsonga Alamblak Romanian Mayo Croatian Gwahatike Kutep G decades: McEnery and Xiao (1999), Ide (2000),Japanese umuz Northern Paiute Naasioi Malaviya et al., 2017) and paraphrase mining Southern Pastaza Quechua Telugu Luang Kamula Bola Fore Avokaya Mountain Koiali Wandala Kwoma Yarowsky et al. (2001), Xiao and McEnery (2002),Sundanese Lambayeque Quechua Kenga Sharanahua Kaiwá minahua (Bannard and Callison-Burch, 2005). Galela Ya Northern Oaxaca Nahuatl Nadëb Sukuma Mbula Achuar-Shiwiar Sepik Iwam Chuave Halia Asháninka Diab and Resnik (2002), Hwa et al. (2005), Muk-Mokole Awiyaana Lukpa Kosena Wè Northern Páez Lote ng Annotation projection is a specific use of paral- South ern Tama Giziga East L Kyrgyz ampung Api Shuar Bislama Kazakh Gumatj Dja ash Quechua mbarrpuyng Huaylas Anc Cent u ash Quechua ral Mazahu nchucos Anc ua Duri a Northern Co rímac Quech Eastern Apu chua Gilberte Wanca Que se Huaylla Imbongu Bariai Bo-Ung Tamil gu Sio Umbu-Un Kube Burarra iawari Tinputz Somba-S B Borong ambam e Kon Map o (Sierr awa Pinot a Leone R epa Nac ) mane Toraja ional M uhu-Sa -Sa'dan ixtec G abak Ngawn N T Chin -Yegha achelh Korafe a Am it Yareb bulas uha Merey Yawey epet Chortí Sel T kyom imne Ae Ta Yipma wala ng Sout l Hene hern S Anga jabi Ramo amo Pun aaina mbisa Kekc Hua o C hí Yopn huuk a K ese Hay uanua otec Serbi i Zap M an tzach otec arsh Ya Zap l M alles gocho atmu buk e Zoo I a Ju o agan kun T Key aili Ku aku do K waat m Le ña Chu aay avine mbu C ufa Shill rung Usar S uk otec ilaca Zap a L yoa lálag arun achi pan Ya Agu i Is xío Z Mix Bauz thmu apo tec o Iz s Za tec Gam ere pote wa Gu c Ojib de eta We kwe stern Mar bu Mun Pe Lim l D dan nan jona ii i Pa n Ma inka hua yoy hén epe e Gap ao As rn T Mix apa Ifug rthe ila ui Kim iwa ao No Juqu Yaq M ré bwa otu Oji a Sou ern ehu A the Sev Tep n nu rn B lco hua M fo irif ichi epe ec uya or ach rn T pot Ca ng Tl aste Za ewa cua the lley t K a M Sou Va es ew elpa lula W st K o Co laco Ea nib ma n T -Co ra Go ltep ster ibo Co kan ec We hip esa ra Ku a Chi S er Co H pan nan ta T yar ga el g M tec San Na pan on al El am nna To g ay P -Ba o ba er Bo Pop Ham n K ti' yria wa i-Z ma N ny om ru yab am K Du ew W wa a ebr est H go Ka ern on G diw Bu Ko ba ar éu ki saa i A o dno Ma dh gu n M Sin o Ta aca a ay kia tec nob -M aí Te o o san pal M peu es A kú a xi Y ru Y m la C ndu nka aw u u Di ri Bu a ica M rn aa C hi tec ste L r hi d e ifo M qu thw Bir u u ita ou a am K rle no S alb Bu na ad M Ba E az ng se an ya re Öm Ejj D en be M ie a us K gä c e un N na C nta to le en w To e M tra ai tla Y ga op l B an o i U án o ap ato er pp M nto P D ka i Ta er ay k ara a'd G laa Ne a m M t or nd ca A léa Tu on ig xa ga ec D w tal To an at o ali o to D az fán G bu If na M o ís ar ug c az C c Zu ifu ao Dí are im lg n e P ar Qu o a D h u A io -G pa en om d te e la T B ier B el p mz Ja rr o e ec ek Ca on B ko C og o us ba hi D an K a ru na o uc yè a nt S T bi S m ec ro a er A an be To K m r M ra Kh ua a l h K i te tra ec en a o n u ar M y De Ce Q K an a ap l ca w li P n ó M ar a go he ar gg ar m 'g n ac o T kw a H ja S o p ha a ra u a M A om se A d i av C lh rn r o R ba o e a te O ko se o u C H es rn A e e D to hi ste gu ld S el ka n W a tu fu pi a o s E or ul o Iy r P F H e o an le Iy jw ri Se c W o a e te i o 'w 'ja ig po rh O lo uj C N a ga e w f w h Z is ag o D a a o tla tt ur d M ita C ro i ha G ra a a m ho te M h t lo hu M s m ro C e o c a a an a t t B C ue nt L m a ri e a Q ar ra B am a eb d rr a a e a ra S n A m y V te ( S la n hu ba la e - T e w er a a N n B o no o st ar t G g a a go en go u L e T s in oy W w d ) f o W d e L c tl a dm i o ap n hw e ua C p N la t S h ua N h is w ou a h a ay h o S N a n L or u an L n an a an I u th c a á p ay ri n n s e o ac a a ia S do ha a M o C W g al e is n i st i h n It in N s e e x ic u B a E g a si rn te H ss ie la a D c M i ru W pe m n B u u n b in at l D ar a o k y gu ni U e is o a N a a e a n n K w d b ya n a W M M a b B e N u n s e S a B afi t- to ni c e a C i bi M m e ra a n B a b nt k r ia c V c e r A g i u e al U r ar l T ag lg d rd o h at o u l a o L a e u t P m a r n im d G m h n an ia ia n A a ra i n N at u n n b ta e rs o a lu a S N p e x c A o m n o s P a i d rt o g g E S d h M h h n in n n c o e a S c ia w la ts e F a l rn is l n o e e s C i it a a te a L c i e n ji h K a e Ir I td o a M a H il u la T u ar i h ri i r a is is D ís b in d l F r n M a k b d i P F m a n r i e i sh n e m a M a i to a r l r i e y n e F e b s a a H st G a e n L y n e w lt ia N u a g in S a n o n n d W tv ia n S o ( g a M a v a r m K n u L t i h T a a a st a n s e n a e a L o i n S k a n n n t n a I o t n y i d s in i s c i T d a r E n sh T t h te e a a F a i h J o h i k o n d u l c a u m a o n n h o k C m r p g d ta t P e a a u a o T i z v n E la a s n a S L C a g s i ( M lo ri h S s c C C n a c e A i y i a ô i h z S n x n p c n t x a v e i ja W w t t a e e in n a r h e ia l C a i B F M n s e a R n S d a a i ) M l d re 'I n ) n y n 3 G s h o A y o v te lá N a 5 in a h i m r l o c t p 4 t i O u z ra ia e i a a Ir j a a c r o S 1 m ) A i z a t n b E e o L a e o s u r lá i ic n ) C t S g N s h m a n g ( ig a s t n a L il y a i M li k r u D m E u a c s e e g a o m ri í h e n g r P w w a n ix r th y a B o b n r la ir o l o e o C e G o g u j a g a N t l i G P n y W s o h n N a u n T q l e in e u K i u e u o o i d V b y a P a s - a c i a u A l t t e A n v s d a a a e t n i k n n i c l r r e d p i a e M te t n a c A u d e E h e n K w S n r a é a p m i K r e o ) K a u H ( i u l e a i a w m ia P o s a M c i i l y w R o ta W u t y g c i n a K b o T m e a C h h i l k g r r a l a a o m k e M o e e b n la l h a K a A o d h s p a a a n w P t l e E a t C m g M Z I n d e o e e ti d S ( a o l A s m n c a a n g h n G r h s C k u e r e a o n n y o a a a f o M u n a a b l n a t j s O H h m u H s a K e a a d a M k i a U i r i t m u F a b l F a 'k r m 'o ig a i e K m y n e n D a b i o o e H a m d h n B e k i g n r o n a T i i i o h l I n i n S m h e T g w h a a a a n a s I k a c d ) D e d q n r B l e C h o u N m u a a h o c l r a d u a t a u n e M a a e l l a k a c f t k W n a t g d a M l o r j f n h P i i M a e n fi A a o o o u u t s B u a u a a n n l d u a n M L E l o u a r F n Y r k o e t o i A G s a l T t C c r a e N p a s l p r h d n T u n o n e r T i e i g n a y N i c r y Z t c e C w i u i o i i u n a c S a h R a l o a n N a b e i U i n p a S l r u t t C i r a a ( t a a a t o M a x E m h c n u G i b a o n t r e t P w L e o K r y a e T e d o m m a z c t t M S a o I u n s n a i W n a l y a e n n n a e e G a a t n p n Y B a a k S o n l a G a n a k n e a r n o o e k i M T a t a a a W p a w i t n g w n T s X M u k g e e T i s a n i o o o z t C n l a l n I a p a o - o o M o p d n t h ' Y a z a g i a o a e u r e u a h t l a M o a a o t P n B i g n A ) K w n t l z t i u n S u y e n n a a a o - s o B s d t t n P b y l O a a i a e o g y e u e j u a a n P r T e a b o M e n b t i o o A k l n c i e n a n o b n r C g r i k n n n a ( o t C i y e T i u t o n i o e a y J a P e s l a p n u K n g P i n t n M e e - a a n k u a a n a c n p B e n c a i P r Y M i i t y u l M w r s H a a r a w l ó p ' E l l a o n S G a a i ( a M a g o i y e C i i t y S a b v i n u e K a u o N m i a l b M n a a d e M o a n h r u N k l a V i w n e e a a k g u g i N a h r k i t s u n r f i m l d l g D b o k p m n o P u n g s l f b l a a o a e g n a n e u a a e N ( l a n o o o i j a é a b a M l m w l a o e m a a u S a a v o a A k u o i B r s e ( o i r s t g w r O n e C m n l p ( y M A a o C A t k l E i s a d O n A C a n i a n C m a H e i i P d T w a u t k W u a o U r h y B m i g N T c e a a g n o W P n x t u I a a a o i c a r g k a g r b o e M n a a u e i - n m h e n h a Y a i o b l S p I k n a I u r t t i a t a n r a p g p G N a a i a y r a r y A a c - i ) i e o e a i h t B A y a t r a A a e a a S m a h i S l n s r s e L S P a p a Z o u ó n k u n g a c a l p n l k a a l u e Y a i n u i K o i W L n s á t e p a y a t D u e K o a M a i n l m p a a a u p u c n B A i e m h e a l d N h n e t a B a b G i M M a B a T a K S a r K i v t B a e o e i o n n a í A r o T a t y m a T b e n u T l a i a a i T s c n i l n e c p N B D r m n C o l u t o N l a o a a u E n t e - i n t n a e a h u i y b d e a a a l n a i a l b r e k u i d u e a T a a c a a n g y t a e o a i o o l a e Z - K e o P W a S Z m e k a m ) M t - Z e a e r u g

c h m r N p r r a n v u n e t a a B a g a e k r r o e k r e

h r x g p r T G a o t a h i y i t A r u t i y m a a a a b e e t e u w t h M ) p i e C a a o a s i w o o a y g i M o u r e t C i c o i i k m a n o l a n n a i a a p i d i q p D r - d r S i r r A a n o W a a n c a j t l u a a p i a l n n a b W P y i l a i I g a T a z á W a h N H a

T z a u o i r g u W o a k a g d

l n P e G o i o m a e s S I t e a a U a o a i s y K i A m i a t n i t e - e f a l o m f n a t C a y t n F o t e e e Q T T u R u u S

a ( e ) u n p o e r a A u a g a a n a c l b i A M u i c c g c a a n u e b m h t a d h n S l n n u a n i K e m o N t t a i m l a c e - o n l

n d B a n e c t r r a K p n b u o B G o e a n g á u a i a h o M o b e a l H o l v c a - ) Z t z t a a A c r h A e

i S i N g M i n r l d a

c g x ( M a u M o i t p u

e c d h s u o a i h H i a

a x A l t u i s

e a n e

l Y q m r c Z

a

i

u n

h a u

b g p z C a

u o g

b

a t o e

g m c

I e

)

0.04 Verses marked in CRS Verses marked in FIJ

crs:ti hwc:wen tcs:bin crs:ti hwc:wen tcs:bin crs:ti fij:qai tcs:bin tzo:laj crs:ti fij:qai tcs:bin tzo:laj crs:ti tcs:bin crs:ti tcs:bin crs:ti hwc:wen tcs:bin tzo:laj crs:ti hwc:wen tcs:bin tzo:laj crs:ti fij:qai hwc:wen tcs:bin tzo:laj crs:ti fij:qai hwc:wen tcs:bin tzo:laj crs:ti fij:qai tcs:bin crs:ti fij:qai tcs:bin no_marker no_marker

Verses marked in HWC Verses marked in TCS

crs:ti hwc:wen tcs:bin crs:ti hwc:wen tcs:bin crs:ti fij:qai tcs:bin tzo:laj crs:ti fij:qai tcs:bin tzo:laj crs:ti tcs:bin crs:ti tcs:bin crs:ti hwc:wen tcs:bin tzo:laj crs:ti hwc:wen tcs:bin tzo:laj crs:ti fij:qai hwc:wen tcs:bin tzo:laj crs:ti fij:qai hwc:wen tcs:bin tzo:laj crs:ti fij:qai tcs:bin crs:ti fij:qai tcs:bin no_marker no_marker

Verses marked in TZO Verses not marked in neither of languages

crs:ti hwc:wen tcs:bin crs:ti hwc:wen tcs:bin crs:ti fij:qai tcs:bin tzo:laj crs:ti fij:qai tcs:bin tzo:laj crs:ti tcs:bin crs:ti tcs:bin crs:ti hwc:wen tcs:bin tzo:laj crs:ti hwc:wen tcs:bin tzo:laj crs:ti fij:qai hwc:wen tcs:bin tzo:laj crs:ti fij:qai hwc:wen tcs:bin tzo:laj crs:ti fij:qai tcs:bin crs:ti fij:qai tcs:bin no_marker no_marker

Figure 5: A map of past tense based on the largest clusters of verses with particular combinations of the past tense pivots from Seychellois Creole (CRS), Fijian (FIJ), Hawaiian Creole (HWC), Torres Strait Creole (TCS) and Tzotzil (TZO). For each of the five languages, we present a subfigure that highlights the subset of verse clusters that are marked by the pivot of that language. The sixth subfigure highlights verses not marked by any of the five pivots. Verses marked in AFR Verses marked in ISL

pap:ta pap:ta rro:hanona rro:hanona pap:ta rro:hanona pap:ta rro:hanona afr:is isl:er pap:ta rro:hanona urd: afr:is isl:er pap:ta rro:hanona urd: urd: urd: afr:is pap:ta rro:hanona afr:is pap:ta rro:hanona no_marker no_marker

Verses marked in PAP Verses marked in RRO

pap:ta pap:ta rro:hanona rro:hanona pap:ta rro:hanona pap:ta rro:hanona afr:is isl:er pap:ta rro:hanona urd: afr:is isl:er pap:ta rro:hanona urd: urd: urd: afr:is pap:ta rro:hanona afr:is pap:ta rro:hanona no_marker no_marker

Verses marked in URD Unmarked Verses

pap:ta pap:ta rro:hanona rro:hanona pap:ta rro:hanona pap:ta rro:hanona afr:is isl:er pap:ta rro:hanona urd: afr:is isl:er pap:ta rro:hanona urd: urd: urd: afr:is pap:ta rro:hanona afr:is pap:ta rro:hanona no_marker no_marker

Figure 6: A map of present tense based on the largest clusters of verses with particular combinations of the past tense pivots from (PAP), Waima (RRO), Afrikaans (ARF), Urdu (URD) and Icelandic (ISL). For each of the five languages, we present a subfigure that highlights the subset of verse clusters that are marked by the pivot of that language. The sixth subfigure highlights verses not marked by any of the five pivots. Verses marked in KLV Verses marked in MSA

msa:akan msa:akan quc:na quc:na tpi:bai tpi:bai tte:kani tte:kani klv:dereh msa:akan quc:na tpi:bai tte:kani klv:dereh msa:akan quc:na tpi:bai tte:kani quc:na tpi:bai quc:na tpi:bai no_marker no_marker

Verses marked in QUC Verses marked in TPI

msa:akan msa:akan quc:na quc:na tpi:bai tpi:bai tte:kani tte:kani klv:dereh msa:akan quc:na tpi:bai tte:kani klv:dereh msa:akan quc:na tpi:bai tte:kani quc:na tpi:bai quc:na tpi:bai no_marker no_marker

Verses marked in TTE Unmarked Verses

msa:akan msa:akan quc:na quc:na tpi:bai tpi:bai tte:kani tte:kani klv:dereh msa:akan quc:na tpi:bai tte:kani klv:dereh msa:akan quc:na tpi:bai tte:kani quc:na tpi:bai quc:na tpi:bai no_marker no_marker

Figure 7: A map of future tense based on the largest clusters of verses with particular combinations of the past tense pivots from Bwanabwana (TTE), Tok Pisin (TPI), Quiche´ (QUC), Malay (MSA) and Maskelynes (KLV). For each of the five languages, we present a subfigure that highlights the subset of verse clusters that are marked by the pivot of that language. The sixth subfigure highlights verses not marked by any of the five pivots. erjee et al. (2006), Pado´ and Lapata (2009), Das allel corpus, i.e., in a corpus that contains an or- and Petrov (2011), de Souza and Orasan˘ (2011), der of magnitude more languages than parallel Nordrum (2015), Marasovic´ and Frank (2016) and corpora currently in use. We showed that Su- Agic´ et al. (2016). perPivot performs well for the crosslingual anal- Of particular relevance is work that projects ysis of the linguistic phenomenon of tense. We tense: Spreyer and Frank (2008), Xue et al. produced analysis results for more than 1000 lan- (2013), Loaiciga et al. (2014), Zhang and Xue guages, conducting – to the best of our knowledge (2014) and Friedrich and Gateva (2017). – the largest crosslingual computational study per- In contrast to this previous work, the labels we formed to date. We extended existing methodol- project in this paper are not the result of human an- ogy for leveraging parallel corpora for typological notation nor the result of the annotation computed analysis by overcoming a limiting assumption of by an NLP analysis module. Instead we interpret earlier work. We only require that a linguistic fea- words in L1 as annotation labels (words like CRS ture is overtly marked in a few of thousands of lan- “ti” and TZO “laj”) and project these word anno- guages as opposed to requiring that it be marked in tation labels to another language L2. all languages under investigation.

6 Discussion 8 Future directions

Our motivation is not to develop a method that can There are at least two future directions that seem then be applied to many other corpora. Rather, promising to us. our motivation is that many of the more than 1000 languages in the Parallel Bible Corpus are low- • Creating a common map of tense along the resource and that providing a method for creat- lines of Figure 5, but unifying the three tenses ing the first richly annotated corpus (through the projection of annotation we propose) for many of • Addressing shortcomings of the way we these languages is a significant contribution. compute alignments: (i) generalizing char- The original motivation for our approach is acter n-grams to more general features, so provided by the work of the typologist Michael that templates in templatic morphology, redu- Cysouw. He created the same type of annotation plication and other more complex manifesta- as we, but he produced it manually whereas we use tions of linguistic features can be captured; automatic methods. But the structure of the anno- (ii) use n-gram features of different lengths tation and its use in linguistic analysis is the same to account for differences among languages, as what we provide. e.g., shorter ones for Chinese, longer ones for The basic idea of the utility of the final out- English; (iii) segmenting verses into clauses come of SuperPivot is that the 1163 languages all and performing alignment not on the verse richly annotate each other. As long as there are a level (which caused many errors in our exper- few among the 1163 languages that have a clear iments), but on the clause level instead; (iv) marker for linguistic feature f, then this marker using global information more effectively, can be projected to all other languages to richly e.g., by extracting alignment features from annotate them. For any linguistic feature, there is automatically induced bi- or multilingual lex- a good chance that a few languages clearly mark icons. it. Of course, this small subset of languages will be different for every linguistic feature. Acknowledgments Thus, even for extremely resource-poor lan- We gratefully acknowledge financial support from guages for which at present no annotated resources Volkswagenstiftung and fruitful discussions with exist, SuperPivot will make available richly an- Fabienne Braune, Michael Cysouw, Alexander notated corpora that should advance linguistic re- Fraser, Annemarie Friedrich, Mohsen Mahdavi search on these languages. Mazdeh, Mohammad R.K. Mofrad, Yadollah 7 Conclusion Yaghoobzadeh, and Benjamin Roth. We are in- debted to Michael Cysouw for the Parallel Bible We presented SuperPivot, an analysis method for Corpus. low-resource languages that occur in a superpar- References Joan L Bybee and Osten¨ Dahl. 1989. The creation of tense and aspect systems in the languages of the ˇ Zeljko Agic,´ Anders Johannsen, Barbara Plank, world. John Benjamins Amsterdam. Hector´ Alonso Mart´ınez, Natalie Schluter, and An- ders Søgaard. 2016. Multilingual projection for George Casella and Roger L. Berger. 2008. Statistical parsing truly low-resource languages. Transactions Inference. Thomson. of the Association for Computational Linguistics 4:301–312. Christos Christodouloupoulos and Mark Steedman. 2015. A massively parallel corpus: the bible in Judith L. Aissen. 1987. Tzotzil Clause Structure. 100 languages. Language resources and evaluation Springer. 49(2):375–395.

Roger W Andersen. 1990. Papiamentu tense-aspect, Bernard Comrie. 1989. Language universals and lin- with special attention to discourse. Pidgin and cre- guistic typology: Syntax and morphology. Univer- ole tense-mood-aspect systems pages 59–96. sity of Chicago press.

Ehsaneddin Asgari and Mohammad R. K. Mofrad. William Croft. 2001. Radical construction grammar: 2016. Comparing fifty natural languages and twelve Syntactic theory in typological perspective. Oxford genetic languages using word embedding language University Press on Demand. divergence (WELD) as a quantitative measure of William Croft. 2002. Typology and universals. Cam- language distance. In Workshop on Multilingual and bridge University Press. Cross-lingual Methods in NLP. pages 65–74. William Croft and Keith T Poole. 2008. Inferring Colin Bannard and Chris Callison-Burch. 2005. Para- universals from grammatical variation: Multidimen- phrasing with bilingual parallel corpora. In Pro- sional scaling for typological analysis. Theoretical ceedings of the 43rd Annual Meeting on Associa- linguistics 34(1):1–37. tion for Computational Linguistics. Association for Computational Linguistics, pages 597–604. Michael Cysouw. 2014. Inducing semantic roles. Per- spectives on semantic roles pages 23–68. Emily M Bender. 2009. Linguistically na¨ıve!= lan- guage independent: why nlp needs linguistic typol- Michael Cysouw and Bernhard Walchli.¨ 2007. Paral- ogy. In Proceedings of the EACL 2009 Workshop lel texts: using translational equivalents in linguistic on the Interaction between Linguistics and Compu- typology. STUF-Sprachtypologie und Universalien- tational Linguistics: Virtuous, Vicious or Vacuous?. forschung 60(2):95–99. Association for Computational Linguistics, pages 26–32. Osten¨ Dahl. 1985. Tense and aspect systems. Basil Blackwell. Emily M Bender. 2011. On achieving and evaluating ¨ language-independence in nlp. Linguistic Issues in Osten Dahl. 2000. Tense and Aspect in the Languages Language Technology 6(3):1–26. of Europe. Walter de Gruyter. Osten¨ Dahl. 2007. From questionnaires to parallel cor- Steven Bird. 2006. Nltk: the natural language toolkit. pora in typology. STUF-Sprachtypologie und Uni- In Proceedings of the COLING/ACL on Interac- versalienforschung 60(2):172–181. tive presentation sessions. Association for Compu- tational Linguistics, pages 69–72. Osten¨ Dahl. 2014. The perfect map: Investigating the cross-linguistic distribution of tame categories in a Bernd Bohnet and Joakim Nivre. 2012. A transition- parallel corpus. Aggregating Dialectology, Typol- based system for joint part-of-speech tagging and ogy, and Register Contents Analysis. Linguistic Vari- labeled non-projective dependency parsing. In Pro- ation in Text and Speech. Linguae & litterae 28:268– ceedings of the 2012 Joint Conference on Empirical 289. Methods in Natural Language Processing and Com- putational Natural Language Learning. Association Dipanjan Das and Slav Petrov. 2011. Unsupervised for Computational Linguistics, pages 1455–1465. part-of-speech tagging with bilingual graph-based projections. In The 49th Annual Meeting of the Bernd Bohnet, Joakim Nivre, Igor Boguslavsky, Association for Computational Linguistics: Human Richard´ Farkas, Filip Ginter, and Jan Hajic.ˇ 2013. Language Technologies, Proceedings of the Confer- Joint morphological and syntactic analysis for richly ence, 19-24 June, 2011, Portland, Oregon, USA. inflected languages. Transactions of the Association pages 600–609. for Computational Linguistics 1:415–428. Jose´ Guilherme Camargo de Souza and Constantin Peter F Brown, Vincent J Della Pietra, Stephen A Della Orasan.˘ 2011. Can projected chains in parallel Pietra, and Robert L Mercer. 1993. The mathemat- corpora help coreference resolution? In Dis- ics of statistical machine translation: Parameter esti- course Anaphora and Anaphor Resolution Collo- mation. Computational linguistics 19(2):263–311. quium. Springer, pages 59–69. Mona T. Diab and Philip Resnik. 2002. An unsuper- Maria Koptjevskaja-Tamm, Martine Vanhove, and Pe- vised method for word sense tagging using parallel ter Koch. 2007. Typological approaches to lexical corpora. In Proceedings of the 40th Annual Meet- semantics. Linguistic typology 11(1):159–185. ing of the Association for Computational Linguis- tics, July 6-12, 2002, Philadelphia, PA, USA.. pages Anoop Kunchukuttan and Pushpak Bhattacharyya. 255–262. 2016. Faster decoding for subword level phrase- based smt between related languages. arXiv preprint Matthew S Dryer, David Gil, Bernard Comrie, Hagen arXiv:1611.00354 . Jung, Claudia Schmidt, et al. 2005. The world atlas of language structures. Oxford University Press. Pierre Lison and Jorg¨ Tiedemann. 2016. Opensub- titles2016: Extracting large parallel corpora from Chris Dyer, Victor Chahuneau, and Noah A. Smith. movie and tv subtitles. In Proceedings of the 10th 2013. A simple, fast, and effective reparameteriza- International Conference on Language Resources tion of IBM model 2. In Human Language Tech- and Evaluation. nologies: Conference of the North American Chap- ter of the Association of Computational Linguis- Sharid Loaiciga, Thomas Meyer, and Andrei Popescu- tics, Proceedings, June 9-14, 2013, Westin Peachtree Belis. 2014. English-french verb phrase alignment Plaza Hotel, Atlanta, Georgia, USA. pages 644–648. in europarl for tense translation modeling. In The Ninth Language Resources and Evaluation Confer- Annemarie Friedrich and Damyana Gateva. 2017. ence. EPFL-CONF-198442. Classification of telicity using cross-linguistic anno- tation projection. In Proceedings of the 2017 Con- Laurens van der Maaten and Geoffrey Hinton. 2008. ference on Empirical Methods in Natural Language Visualizing data using t-sne. Journal of Machine Processing. Association for Computational Linguis- Learning Research 9(Nov):2579–2605. tics, Copenhagen, Denmark, pages 2549–2555. Chaitanya Malaviya, Graham Neubig, and Patrick Lit- Joseph H Greenberg. 1960. A quantitative approach tell. 2017. Learning language representations for ty- Proceedings of the 2017 Con- to the morphological typology of language. Inter- pology prediction. In ference on Empirical Methods in Natural Language national journal of American linguistics 26(3):178– Processing 194. . Association for Computational Linguis- tics, Copenhagen, Denmark, pages 2519–2525. Jan Hajic,ˇ Jan Hric, and Vladislav Kubon.ˇ 2000. Ma- Ana Marasovic´ and Anette Frank. 2016. Multilin- chine translation of very close languages. In Pro- gual modal sense classification using a convolutional ceedings of the sixth conference on Applied natu- neural network. In Workshop on Representation ral language processing. Association for Computa- Learning for NLP. pages 111–120. tional Linguistics, pages 7–12. Thomas Mayer and Michael Cysouw. 2014. Creat- Iren Hartmann, Martin Haspelmath, and Michael ing a massively parallel bible corpus. Cysouw. 2014. Identifying semantic role clusters 135(273):40. and alignment types via microrole coexpression ten- dencies. Studies in Language. International Journal Ryan T McDonald, Joakim Nivre, Yvonne Quirmbach- sponsored by the Foundation ?Foundations of Lan- Brundage, Yoav Goldberg, Dipanjan Das, Kuzman guage? 38(3):463–484. Ganchev, Keith B Hall, Slav Petrov, Hao Zhang, Os- car Tackstr¨ om,¨ et al. 2013. Universal dependency Serge Heiden, Sophie Prevost,´ Benoit Habert, Helka annotation for multilingual parsing. In ACL (2). Folch, Serge Fleury, Gabriel Illouz, Pierre Lafon, pages 92–97. and Julien Nioche. 2000. Typtex: Inductive typo- logical text classification by multivariate statistical Tony McEnery and Richard Xiao. 1999. Domains, text analysis for nlp systems tuning/evaluation. In Maria types, aspect marking and english-chinese transla- Gavrilidou, George Carayannis, Stella Markanto- tion. Languages in Contrast 2(2):211–229. natou, Stelios Piperidis, Gregory Stainhaouer (eds)´ Second International Conference on Language Re- John H McWhorter. 2005. Defining creole. Oxford sources and Evaluation. pages p–141. University Press.

Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Rada Mihalcea and Michel Simard. 2005. Parallel Cabezas, and Okan Kolak. 2005. Bootstrapping texts. Natural Language Engineering 11(03):239– parsers via syntactic projection across parallel texts. 246. Natural language engineering 11(03):311–325. Amitabha Mukerjee, Ankit Soni, and Achla M Raina. Nancy Ide. 2000. Cross-lingual sense determination: 2006. Detecting complex predicates in hindi using Can it work? Computers and the Humanities pos projection across parallel corpora. In Proceed- 34(1):223–234. ings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties. Stephen C Johnson. 1967. Hierarchical clustering Association for Computational Linguistics, pages schemes. Psychometrika 32(3):241–254. 28–35. John Myhill and Myhill. 1992. Typological discourse RZ Xiao and AM McEnery. 2002. A corpus-based ap- analysis: Quantitative approaches to the study of proach to tense and aspect in english-chinese trans- linguistic function. Blackwell Oxford. lation. In The 1st International Symposium on Con- trastive and Translation Studies between Chinese Lene Nordrum. 2015. Exploring spontaneous-event and English. marking though parallel corpora: Translating en- glish ergative intransitive constructions into nor- Nianwen Xue, Yuchen Zhang, and Yaqin Yang. wegian and swedish. Languages in Contrast 2013. Distant annotation of chinese tense 15(2):230–250. and modality. In Workshop on Annotation of Modal Meaning in Natural Language (WAMM). Sebastian Pado´ and Mirella Lapata. 2009. Cross- Association for Computational Linguistics, lingual annotation projection for seman- pages 47–55. http://aclanthology.coli.uni- tic roles. J. Artif. Intell. Res. 36:307–340. saarland.de/pdf/W/W13/W13-0307.pdf. https://doi.org/10.1613/jair.2863. David Yarowsky, Grace Ngai, and Richard Wicen- Ari Pirkola. 2001. Morphological typology of lan- towski. 2001. Inducing multilingual text analysis guages for ir. Journal of Documentation 57(3):330– tools via robust projection across aligned corpora. 348. In Proceedings of the first international conference on Human language technology research. Associa- Philip Resnik. 2004. Exploiting hidden meanings: Us- tion for Computational Linguistics, pages 1–8. ing bilingual text for monolingual annotation. Com- putational Linguistics and Intelligent Text Process- Yuchen Zhang and Nianwen Xue. 2014. Automatic in- ing pages 283–299. ference of the tense of chinese events using implicit linguistic information. In Proceedings of the 2014 Philip Resnik, Mari Broman Olsen, and Mona Diab. Conference on Empirical Methods in Natural Lan- 1997. Creating a parallel corpus from the book of guage Processing (EMNLP). Association for Com- 2000 tongues. In Proceedings of the Text Encoding putational Linguistics, Doha, Qatar, pages 1902– Initiative 10th Anniversary User Conference (TEI- 1911. http://www.aclweb.org/anthology/D14-1204. 10). Citeseer. Gillian Sankoff. 1990. The grammaticalization of tense and aspect in tok pisin and sranan. Language Varia- tion and Change 2(03):295–312. Marianne Elina Santaholma. 2007. Grammar sharing techniques for rule-based multilingual nlp systems. Proceedings of the 16th Nordic Conference of Com- putational Linguistics (NODALIDA) . Diana Santos. 2004. Translation-based corpus stud- ies: Contrasting English and Portuguese tense and aspect systems. 50. Rodopi. Devyani Sharma. 2009. Typological diversity in new englishes. English World-Wide 30(2):170–195. Jae Jung Song. 2014. Linguistic typology: Morphology and syntax. Routledge. Kathrin Spreyer and Anette Frank. 2008. Projection- based acquisition of a temporal labeller. In IJCNLP. pages 489–496. Elizabeth Closs Traugott. 1978. On the expression of spatio-temporal relations in language. Universals of human language 3:369–400. Bernhard Walchli.¨ 2010. The template in synchrony and diachrony. Baltic linguistics 1. Bernhard Walchli¨ and Michael Cysouw. 2012. Lexical typology through similarity semantics: Toward a se- mantic map of motion . Linguistics 50(3):671– 710. Lindsay J Whaley. 1996. Introduction to typology: the unity and diversity of language. Sage Publications.