Using Global Constraints and Reranking to Improve Cognates Detection

Michael Bloodgood Benjamin Strauss Department of Computer Science Computer Science and Engineering Dept. The College of New Jersey The Ohio State University Ewing, NJ 08628 Columbus, OH 43210 [email protected] [email protected]

Abstract guistics, we set a broader definition. We use the word ‘cognate’ to denote, as in (Kondrak, 2001): Global constraints and reranking have not “...words in different languages that are similar in been used in cognates detection research form and meaning, without making a distinction to date. We propose methods for using between borrowed and genetically related words; global constraints by performing rescoring for example, English ‘sprint’ and the Japanese bor- of the score matrices produced by state of rowing ‘supurinto’ are considered cognate, even the art cognates detection systems. Using though these two languages are unrelated.” These global constraints to perform rescoring is broader criteria are motivated by the ways scien- complementary to state of the art methods tists develop and use cognate identification algo- for performing cognates detection and re- rithms in natural language processing (NLP) sys- sults in significant performance improve- tems. For cross-lingual applications, the advan- ments beyond current state of the art per- tage of such technology is the ability to identify formance on publicly available datasets words for which similarity in meaning can be ac- with different language pairs and various curately inferred from similarity in form; it does conditions such as different levels of base- not matter if the similarity in form is from strict line state of the art performance and dif- genetic relationship or later borrowing (Mericli ferent data size conditions, including with and Bloodgood, 2012). more realistic large data size conditions Cognates detection has received a lot of atten- than have been evaluated with in the past. tion in the literature. The research of the use of 1 Introduction statistical learning methods to build systems that can automatically perform cognates detection has This paper presents an effective method for us- yielded many interesting and creative approaches ing global constraints to improve performance for for gaining traction on this challenging task. Cur- cognates detection. Cognates detection is the task rently, the highest-performing state of the art sys- of identifying words across languages that have a tems detect cognates based on the combination common origin. Automatic cognates detection is of multiple sources of information. Some of the important to linguists because cognates are needed most indicative sources of information discovered to determine how languages evolved. Cognates to date are word context information, phonetic in- are used for protolanguage reconstruction (Hall formation, word frequency information, temporal and Klein, 2011; Bouchard-Cotˆ e´ et al., 2013). information in the form of word frequency dis- Cognates are important for cross-language dictio- tributions across parallel time periods, and word nary look-up and can also improve the quality of burstiness information. See section3 for fuller ex- machine translation, word alignment, and bilin- planations of each of these sources of information gual lexicon induction (Simard et al., 1993; Kon- that state of the art systems currently use. Scores drak et al., 2003). for all pairs of words from language L1 x language A word is traditionally only considered cognate L2 are generated by generating component scores with another if both words proceed from the same based on these sources of information and then ancestor. Nonetheless, in line with the conven- combining them in an appropriate manner. Simple tions of previous research in computational lin- methods of combination are giving equal weight-

This paper was published in the Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1983-1992, Vancouver, Canada, July 2017. c 2017 Association for Computational Linguistics DOI: https://doi.org/10.18653/v1/P17-1181 ing for each score, while state of the art perfor- matrices for other language pairs and data mance is obtained by learning an optimal set of sets have state of the art score matrices that weights from a small seed set of known cognates. are already able to achieve 11-point interpo- Once the full matrix of scores is generated, the lated average precision of 57%. word pairs with the highest scores are predicted as being cognates. Although we are not aware of work using global The methods we propose in the current paper constraints to perform rescoring to improve cog- consume as input the final score matrix that state nates detection, there are related methodologies of the art methods create. We test if our meth- for reranking in different settings. Methodologi- ods can improve performance by generating new cally related work includes past work in structured rescored matrices by rescoring all of the pairs of prediction and reranking (Collins, 2002; Collins words by taking into account global constraints and Roark, 2004; Collins and Koo, 2005; Taskar that apply to cognates detection. Thus, our meth- et al., 2005a,b). Note that in these past works, ods are complementary to previous methods for there are many instances with structured outputs creating cognates detection systems. Using global that can be used as training data to learn a struc- constraints and performing rescoring to improve tured prediction model. For example, a semi- cognates detection has not been explored yet. We nal application in the past was using online train- find that rescoring based on global constraints im- ing with structured perceptrons to learn improved proves performance significantly beyond current systems for performing various syntactic analyses state of the art levels. and tagging of sentences such as POS tagging and The cognates detection task is an interesting base noun phrase chunking (Collins, 2002). Note task to apply our methods to for a few reasons: that in those settings the unit at which there are structural constraints is a sentence. Also note that • It’s a challenging unsolved task where ongo- there are many sentences available so that online ing research is frequently reported in the lit- training methods such as discriminative training of erature trying to improve performance; structured perceptrons can be used to learn struc- tured predictors effectively in those settings. In • There is significant room for improvement in contrast, for the cognates setting the unit at which performance; there are structural constraints is the entire set of • It has a global structure in its output clas- cognates for a language pair and there is only one 1 sifications since if a word lemma wi from such unit in existence (for a given language pair). language L1 is cognate with a word lemma We call this a single overarching global structure wj from language L2, then wi is not cognate to make the distinction clear. The method we with any other word lemma from L2 different present in this paper deals with a single overar- from wj and wj is not cognate with any other ching global structure on the predictions of all in- stances in the entire problem space for a task. For word lemma wk from L1. this type of setting, there is only a single global • There are multiple standard datasets freely structure in existence, contrasted with the situa- and publicly available that have been worked tion of there being many sentences each impos- on with which to compare results. ing a global structure on the tagging decisions for that individual sentence. Hence, previous struc- • Different datasets and language pairs yield tured prediction methods that require numerous initial score matrices with very different qual- instances each having a structured output on which ities. Some of the score matrices built using to train parameters via methods such as perceptron the existing state of the art best approaches training are inapplicable to the cognates setting. In yield performance that is quite low (11-point this paper we present methods for rescoring effec- interpolated average precision of only ap- tively in settings with a single overarching global proximately 16%) while some of these score structure and show their applicability to improv- 1A lemma is a base form of a word. For example, in En- ing the performance of cognates detection. Still, glish the words ‘baked’ and ‘baking’ would both map to the we note that philosophically our method builds lemma ‘bake’. Lemmatizing software exists for many lan- guages and lemmatization is a standard preprocessing task on previous structured prediction methods since in conducted before cognates detection. both cases there is a similar intuition in that we’re using higher-level structural properties to inform where the task is to extract (x, y) pairs such that and accordingly alter our system’s predictions of (x, y) are in some relation R. Here are examples: values for subitems within a structure. • X might be a set of states and Y might be a In section2 we present our methods for per- set of cities and the relation R might be “is forming rescoring of matrices based on global the capital of”; constraints such as those that apply for cognates detection. The key intuition behind our approach • X might be a set of images and Y might be is that the scoring of word pairs for cognateness a set of people’s names and the relation R ought not be made independently as is currently might be “is a picture of”; done, but rather that global constraints ought to be taken into account to inform and potentially alter • X might be a set of English words and Y system scores for word pairs based on the scores of might be a set of French words and the re- other word pairs. In section3 we provide results of lation R might be “is cognate with”. experiments testing the proposed methods on the A common way these problems are approached cognates detection task on multiple datasets with is that a model is trained that can score each pair multiple language pairs under multiple conditions. (x, y) and those pairs with scores above a thresh- We show that the new methods complement and old are extracted. We propose that often the rela- effectively improve performance over state of the tion will have a tendency, or a hard constraint, to art performance achieved by combining the ma- satisfy particular properties and that this ought to jor research breakthroughs that have taken place be utilized to improve the quality of the extracted in cognates detection research to date. Complete pairs. precision-recall curves are provided that show the The approach we put forward is to re-score each full range of performance improvements over the (x, y) pair by utilizing scores generated for other current state of the art that are achieved. Summary pairs and our knowledge of properties of the rela- measurements of performance improvements, de- tion being extracted. In this paper, we present and pending on the language pair and dataset, range evaluate methods for improving the scores of each from 6.73 absolute MaxF1 percentage points to (x, y) pair for the case when the relation is known 16.75 absolute MaxF1 percentage points and from to be one-to-one and discuss extensions to other 5.58 absolute 11-point interpolated average preci- situations. sion percentage points to 17.19 absolute 11-point The current approach is to generate a matrix of interpolated average precision percentage points. scores for each candidate pair as follows: Section4 discusses the results and possible exten-   sions of the method. Section5 wraps up with the sx1,y1 ··· sx1,yn main conclusions.  . .. .  ScoreX,Y =  . . .  . (1) sx ,y ··· sx ,y 2 Algorithm n 1 n n Then those pairs with scores above a threshold are While our focus in this paper is on using global predicted as being in the relation. We now de- constraints to improve cognates detection, we be- scribe methods for sharpening the scores in the lieve that our method is useful more generally. We matrix by utilizing the fact that there is an over- therefore abstract out some of the specifics of cog- arching global structure on the predictions. nates detection and present our algorithm more generally in this section, with the hope that it will 2.1 Reverse Rank be able to be used in the future for other appli- We know that if (xi, yj) ∈ R, then (xk, yj) ∈/ cations in addition to cognates detection. None R for k 6= i when R is 1-to-1. We define of our abstraction harms understanding of our reverse rank(xi, yj) = |{xk ∈ X|sxk,yj ≥ method’s applicability to cognates detection and sxi,yj }|. Intuitively, a high reverse rank means that the fact that the method may be more widely ben- there are lots of other elements of X that score bet- eficial does not in any way detract from the utility ter to yj than xi does; this could be evidence that we show it has for improving cognates detection. (xi, yj) is not in R and ought to have a lower score. A common setting is where one has a set X = Alternatively, if there are very few or no other ele- {x1, x2, ..., xn} and a set Y = {y1, y2, ..., yn} ments of X that score better to yj than xi does this could be evidence that (xi, yj) is in R and ought to We do this on datasets where the assumptions have a higher score. In accord with this intuition, hold and see how close our methods get to the we use reverse rank as the basis for rescaling our Hungarian maximal assignment at similar points scores as follows: of the precision-recall curves. For our larger

sxi,yj datasets where the assumptions don’t hold, the scoreRR(xi, yj) = . (2) reverse rank(xi, yj) Hungarian either can’t complete due to limited 2.2 Forward Rank computational resources or it functioned poorly in comparison with the performance of our reverse Analogous to reverse rank, another basis we can rank and forward rank combination methods. use for adjusting scores is the forward rank. We define forward rank(xi, yj) = |{yk ∈ 3 Experiments Y |sxi,yk ≥ sxi,yj }|. We then scale the scores anal- ogously to how we did with reverse ranks via an Our goal is to test whether using the global struc- inverse linear function.2 ture algorithms we described in section2 can sig- nificantly boost performance for cognates detec- 2.3 Combining Reverse Rank and Forward tion. To test this hypothesis, our first step is to Rank implement a system that uses state of the art re- For combining reverse rank and forward rank, we search results to generate the initial score matri- present results of experiments doing it two ways. ces as a current state of the art system would cur- The first is a 1-step approach: rently do for this task. To that end, we imple- sx ,y mented a baseline state of the art system that uses score (x , y ) = i j , (3) RR FR 1step i j product the information sources that previous research has found to be helpful for this task such as pho- where netic information, word context information, tem- poral context information, word frequency infor- product =reverse rank(xi, yj)× (4) mation, and word burstiness information (Kon- forward rank(xi, yj). drak, 2001; Mann and Yarowsky, 2001; Schafer The second combination method involves first and Yarowsky, 2002; Klementiev and Roth, 2006; computing the reverse rank and re-adjusting every Irvine and Callison-Burch, 2013). Consistent with score based on the reverse ranks. Then in a second past work (Irvine and Callison-Burch, 2013), we step the new scores are used to compute forward use supervised training to learn the weights for ranks and then those scores are adjusted based on combining the various information sources. The the forward ranks. We refer to this method as system combines the sources of information by us- RR FR 2step. ing weights learned by an SVM (Support Vector Machine) on a small seed training set of cognates3 2.4 Maximum Assignment to optimize performance. This baseline system If one makes the assumption that all elements in X obtains state of the art performance on cognates and Y are present and have their partner element detection. Using this state of the art system as in the other set present with no extra elements and our baseline, we investigated how much we could the sets are not too large, then it is interesting to improve performance beyond current state of the compute what the ‘maximal assignment’ would be art levels by applying the rescoring algorithm we using the Hungarian Algorithm to optimize: described in section2. We performed experi- ments on three language pairs: French-English, X German-English, and Spanish-English, with dif- max score(x, y) Z∈X×Y ferent text corpora used as training and test data. (x,y)∈Z The different language pairs and datasets have dif- s.t. (xi, yj) ∈ Z ⇒ (xk, yj) ∈/ Z, ∀k 6= i ferent levels of performance in terms of their base- (xi, yj) ∈ Z ⇒ (xi, yk) ∈/ Z, ∀k 6= j. line current state of the art score matrices. In the (5) 3The small seed set was randomly selected and less than 2For both reverse rank and forward rank we also experi- 20% in all cases. It was not used for testing. Note that us- mented with exponential decay and step functions, but found ing this data to optimize performance of the baseline system that simple division by the ranks worked as well or better than makes our baseline even stronger and makes it even harder for any of those more complicated methods. our new rescoring method to achieve larger improvements. next few subsections, we describe our experimen- 3.5 Word Burstiness tal details. The intuition is that cognates will have similar burstiness measures (Church and Gale, 1995). For 3.1 Lemmatization word burstiness we used the same corpora as for We used morphological analyzers to convert the the temporal corpora. words in text corpora to lemma form. For En- glish, we used the NLTK WordNetLemmatizer 3.6 Phonetic Information (Bird et al., 2009). For French, German, and Span- The intuition is that cognates will have correspon- ish we used the TreeTagger (Schmid, 1994). dences in how they are pronounced. For this, we compute a measurement based on Normalized 3.2 Word Context Information Edit Distance (NED). We used the N-Gram corpus (Michel et al., 3.7 Combining Information Sources 2010). For English we used the English 2012 Google 5-gram corpus, for French we used the We combine the information sources by using a French 2012 Google 5-gram corpus, for German linear Support Vector Machine to learn weights we used the German 2012 Google 5-gram corpus, for each of the information sources from a small and for Spanish we used the Spanish 2012 Google seed training set of cognates. So our final score 5-gram corpus. From these corpora we compute assigned to a candidate cognate pair (x, y) is: word context similarity scores across languages using Rapp’s method (Rapp, 1995, 1999). The X score(x, y) = w score (x, y), (6) intuition behind this method is that cognates are m m m∈metrics more likely to occur in correlating context win- dows and this statistic inferred from large amounts where metrics is the set of measurements such of data captures this correlation. as phonetic similarity measurements, word bursti- ness similarity, relative frequency similarity, etc. 3.3 Frequency Information that were explained in subsections 3.2 through The intuition is that over large amounts of data 3.6; wm is the learned weight for metric m; and cognates should have similar relative frequencies. scorem(x, y) is the score assigned to the pair We compute our relative frequencies by using the (x, y) by metric m. same corpora mentioned in the previous subsec- The scores such assigned represent a state of tion. the art approach for filling in the matrix identified in equation1. At this point the matrix of scores 3.4 Temporal Information would be used to predict cognates. We now turn to evaluation of the use of the global constraint The intuition is that cognates will have simi- rescoring methods from section2 for improving lar temporal distributions (Klementiev and Roth, performance beyond the state of the art levels. 2006). To compute the temporal similarity we use newspaper data and convert it to simple daily 3.8 Using Global Constraints to Rescore word counts. For each word in the corpora the For our cognates data we used the French-English word counts create a time series vector. The pairs from (Bergsma and Kondrak, 2007) and Fourier transform is computed on the time series the German-English and Spanish-English vectors. Spearman rank correlation is computed pairs from (Beinborn et al., 2013). Fig- on the transform vectors. For English we used ure1 shows the precision-recall 7 curves for the English Gigaword Fifth Edition4. For French we used French Gigaword Third Edition5. For 7Precision and recall are the standard measures used for 6 systems that perform search. Precision is the percentage of Spanish we used Spanish Gigaword First Edition . predicted cognates that are indeed cognate. Recall is the per- The German news corpora were obtained by web centage of cognates that are predicted as cognate. We vary the crawling http://www.tagesspiegel.de/ threshold that determines cognateness to generate all points along the Precision-Recall curve. We start with a very high and extracting the news articles. threshold enabling precision of 100% and lower the threshold until recall of 100% is reached. In particular, we sort the test 4 Linguistic Data Consortium Catalog No. LDC2011T07 examples by score in descending order and then go down the 5Linguistic Data Consortium Catalog No. LDC2011T10 list of scores in order to complete the entire precision-recall 6Linguistic Data Consortium Catalog No. LDC2006T12 curve. Precision-Recall Curves Precision-Recall Curves 100 100

80 80

60 60 Precision Precision

40 40

20 20

0 0 0 20 40 60 80 100 0 20 40 60 80 100 Recall Recall Baseline RR_FR_1step Max Assignment Max Assignment Score Baseline RR_FR_1step Max Assignment Max Assignment Score RR RR_FR_2step RR RR_FR_2step

Figure 1: Precision-Recall Curves for Figure 2: Precision-Recall Curves for French-English. Baseline denotes state of German-English. Baseline denotes state of the art performance. the art performance.

French-English, Figure2 shows the performance Precision-Recall Curves 100 for German-English, and Figure3 shows the performance for Spanish-English. Note that state 80 of the art performance (denoted in the figures as 60 Baseline) has very different performance across Precision the three datasets, but in all cases the systems 40 from section2 that incorporate global constraints and perform rescoring greatly exceed current 20 state of the art performance levels. The Max 0 0 20 40 60 80 100 Recall Baseline RR_FR_1step Max Assignment Max Assignment Score Assignment is really just the single point that the RR RR_FR_2step Hungarian finds. We drew lines connecting it, but keep in those lines are just connecting the single point to the endpoints. Max Assignment Figure 3: Precision-Recall Curves for Score traces the precision-recall curve back from Spanish-English. Baseline denotes state of the Max Assignment by steadily increasing the the art performance. threshold so that only points in the maximum assignment set with scores above the increasing threshold are predicted as cognate. For the non-max assignment curves, it is some- shows the results for German-English, and Table3 times helpful to compute a single metric summa- show the results for Spanish-English. In all cases, rizing important aspects of the full curve. For this using global structure greatly improves upon the purpose, maxF1 and 11-point interpolated average state of the art baseline performance. In (Bergsma precision are often used. MaxF1 is the F1 mea- and Kondrak, 2007), for French-English data a re- sure (i.e., harmonic mean of precision and recall) sult of 66.5 11-point IAP is reported for a situation at the point on the precision-recall curve where F1 where word alignments from a bitext are available is highest. The interpolated precision pinterp at a and a result of 77.7 11-point IAP is reported for given recall level r is defined as the highest preci- a situation where translation pairs are available in sion level found for any recall level r0 ≥ r: large quantities. The setting considered in the cur- 0 rent paper is much more challenging since it does p (r) = max 0 p(r ). (7) interp r ≥r not use bilingual dictionaries or word alignments The 11-point interpolated average precision from bitexts. The setting in the current paper is (11-point IAP) is then the average of the pinterp the one mentioned as future work on page 663 at r = 0.0, 0.1, ..., 1.0. Table1 shows these per- of (Bergsma and Kondrak, 2007): ”In particular, formance measures for French-English, Table2 we plan to investigate approaches that do not re- METHOD MAX F1 11-POINT IAP Precision-Recall Curves BASELINE 54.92 50.99 100 RR 62.94 59.62

RR FR 1STEP 68.35 64.42 80 RR FR 2STEP 69.72 67.29

60

Table 1: French-English Performance. BASELINE Precision 40 indicates current state of the art performance.

20 METHOD MAX F1 11-POINT IAP

0 BASELINE 21.38 16.25 0 20 40 60 80 100 Recall RR 22.71 17.80 Baseline RR RR_FR_1step RR_FR_2step RR FR 1STEP 28.68 22.37 RR FR 2STEP 28.11 21.83

Table 2: German-English Performance. BASE- Figure 4: Precision-Recall Curves for LINE indicates current state of the art performance. French-English (large data). Note that Base- line denotes state of the art performance. quire the bilingual dictionaries or bitexts to gener- ate training data.” Note that the evaluation thus far is a bit artifi- ments under this much more challenging condi- cial for real cognates detection because in a real tion. With approx. ten thousand squared candi- setting you wouldn’t only be selecting matches dates, i.e., approx. 100 million candidates, to con- for relatively small subsets of words that are sider for cognateness, this is a large data condi- guaranteed to have a cognate on the other side. tion. The Hungarian didn’t run to completion on Such was the case for our evaluation where the two of the datasets due to limited computational French-English set had approx. 600 cognate pairs, resources. On French-English it completed, but the German-English set had approx. 1000 pairs, achieved poorer performance than any of the other and the Spanish-English set had approx. 3000 methods. This makes sense as it is designed when pairs. In a real setting, the system would have to there really is a bipartite matching to be found like consider words that don’t have a cognate match in in the artificial yet standard cognates evaluation the other language and not only words that were that was just presented. When confronted with hand-selected and guaranteed to have cognates. large amounts of words that create a much denser We are not aware of others evaluating according space and have no match at all on the other side to this much more difficult condition, but we think the all or nothing assignments of the Hungarian it is important to consider especially given the po- are not ideal. The reverse rank and forward rank tential impacts it could have on the global struc- rescoring methods are still quite effective in im- ture methods we’ve put forward. Therefore, we proving performance although not by as much as run a second set of evaluations where we take the they did in the small data results from above. ten thousand most common words in our corpora Figure4 shows the full precision-recall curves for each of our languages, which contain many for French-English for the large data condition, of the cognates from the standard test sets and Figure5 shows the curves for German-English for we add in any remaining words from the stan- the large data condition, and Figure6 shows the dard test sets that didn’t make it into the top ten results for Spanish-English for the large data con- thousand. We then repeat each of the experi- dition. Tables4 through6 show the summary metrics METHOD MAX F1 11-POINT IAP for the three language pairs for the large data ex- BASELINE 56.26 57.03 RR 68.52 69.33 periments. We can see that the reverse rank and RR FR 1STEP 70.66 71.47 forward rank methods of taking into account the RR FR 2STEP 73.01 74.22 global structure of interactions among predictions is still helpful, providing large improvements in Table 3: Spanish-English Performance. BASE- performance even in this challenging large data LINE indicates current state of the art performance. condition over strong state of the art baselines that METHOD MAX F1 11-POINT IAP Precision-Recall Curves 100 BASELINE 55.08 51.35 RR 60.88 58.79

80 RR FR 1STEP 65.87 63.55 RR FR 2STEP 65.76 65.26

60

Precision Table 4: French-English Performance (large data). 40 BASELINE indicates state of the art performance.

20 METHOD MAX F1 11-POINT IAP

0 BASELINE 21.25 16.17 0 20 40 60 80 100 Recall Baseline RR RR_FR_1step RR_FR_2step RR 24.78 19.13 RR FR 1STEP 30.72 24.97 RR FR 2STEP 30.34 24.86

Figure 5: Precision-Recall Curves for Table 5: German-English Performance (large German-English (large data). Note that Baseline data). BASELINE indicates state of the art perfor- denotes state of the art performance. mance.

METHOD MAX F1 11-POINT IAP BASELINE 54.75 54.55 RR 62.52 61.42 Precision-Recall Curves 100 RR FR 1STEP 66.45 65.89 RR FR 2STEP 66.38 65.5

80 Table 6: Spanish-English Performance (large 60 data). BASELINE indicates state of the art perfor- Precision

40 mance.

20 not necessarily hard constraints; 0 0 20 40 60 80 100 Recall Baseline RR RR_FR_1step RR_FR_2step • developing more theory to more precisely understand some of the nuances of using global structure when it’s applicable and Figure 6: Precision-Recall Curves for making connections with other areas of ma- Spanish-English (large data). Note that Baseline chine learning such as semi-supervised learn- denotes state of the art performance. ing, active learning, etc.; and

• investigating how to have a machine learn that global structure exists and learn what make cognate predictions independently of each form of global structure exists. other and don’t do any rescoring based on global constraints. 5 Conclusions 4 Discussion Cognates detection is an interesting and challeng- ing task. Previous work has yielded state of the We believe that this work opens up new avenues art approaches that create a matrix of scores for all for further exploration. A few of these include the word pairs based on optimized weighted combina- following: tions of component scores computed on the basis • investigating the utility of applying and ex- of various helpful sources of information such as tending the method to other applications such phonetic information, word context information, as Information Extraction applications, many temporal context information, word frequency in- of which have similar global constraints as formation, and word burstiness information. How- cognates detection; ever, when assigning a score to a word pair, the current state of the art methods do not take into ac- • investigating how to handle other forms of count scores assigned to other word pairs. We pro- global structure including tendencies that are posed a method for rescoring the matrix that cur- rent state of the art methods produce by taking into Michael Collins and Terry Koo. 2005. Dis- account the scores assigned to other word pairs. criminative reranking for natural language pars- The methods presented in this paper are com- ing. Computational Linguistics 31(1):25–70. https://doi.org/10.1162/0891201053630273. plementary to existing state of the art methods, easy to implement, computationally efficient, and Michael Collins and Brian Roark. 2004. Incremen- practically effective in improving performance by tal parsing with the perceptron algorithm. In Proceedings of the 42nd Meeting of the Asso- large amounts. Experimental results reveal that ciation for Computational Linguistics (ACL’04), the new methods significantly improve state of Main Volume. Barcelona, Spain, pages 111–118. the art performance in multiple cognates detec- https://doi.org/10.3115/1218955.1218970. tion experiments conducted on standard freely and David Hall and Dan Klein. 2011. Large-scale cognate publicly available datasets with different language recovery. In Proceedings of the 2011 Conference on pairs and various conditions such as different lev- Empirical Methods in Natural Language Process- els of baseline performance and different data size ing. Association for Computational Linguistics, Ed- conditions, including with more realistic large data inburgh, Scotland, UK., pages 344–354. http:// www.aclweb.org/anthology/D11-1032. size conditions than have been evaluated with in the past. Ann Irvine and Chris Callison-Burch. 2013. Su- pervised bilingual lexicon induction with multi- ple monolingual signals. In Proceedings of the 2013 Conference of the North American Chap- References ter of the Association for Computational Linguis- tics: Human Language Technologies. Association Lisa Beinborn, Torsten Zesch, and Iryna Gurevych. for Computational Linguistics, Atlanta, Georgia, 2013. Cognate production using character-based pages 518–523. http://www.aclweb.org/ machine translation. In Proceedings of the anthology/N13-1056 Sixth International Joint Conference on Natu- . ral Language Processing. Asian Federation of Alexandre Klementiev and Dan Roth. 2006. Weakly Natural Language Processing, Nagoya, Japan, supervised named entity transliteration and discov- pages 883–891. http://www.aclweb.org/ ery from multilingual comparable corpora. In anthology/I13-1112. Proceedings of the 21st International Confer- ence on Computational Linguistics and 44th An- Shane Bergsma and Grzegorz Kondrak. 2007. nual Meeting of the Association for Computa- Alignment-based discriminative string similarity. tional Linguistics. Association for Computational In Proceedings of the 45th Annual Meeting of Linguistics, Sydney, Australia, pages 817–824. the Association of Computational Linguistics. https://doi.org/10.3115/1220175.1220278. Association for Computational Linguistics, Prague, Czech Republic, pages 656–663. http://www. Grzegorz Kondrak. 2001. Identifying cognates by pho- aclweb.org/anthology/P07-1083. netic and semantic similarity. In Proceedings of the second meeting of the North American Chapter Steven Bird, Ewan Klein, and Edward Loper. of the Association for Computational Linguistics on 2009. Natural Language Processing with Python. Language technologies. Association for Computa- O’Reilly Media, Inc., 1st edition. tional Linguistics, Stroudsburg, PA, USA, NAACL ’01, pages 1–8. http://www.aclweb.org/ Alexandre Bouchard-Cotˆ e,´ David Hall, Thomas L. anthology/N/N01/N01-1014.pdf. Griffiths, and Dan Klein. 2013. Automated Re- Grzegorz Kondrak, Daniel Marcu, and Kevin Knight. construction of Ancient Languages using Proba- 2003. Cognates can improve statistical transla- bilistic Models of Sound Change. Proceedings of tion models. In Proceedings of the 2003 Con- the National Academy of Sciences 110:4224–4229. ference of the North American Chapter of the As- https://doi.org/10.1073/pnas.1204678110. sociation for Computational Linguistics on Hu- man Language Technology: Companion Volume Kenneth W. Church and William A. Gale. 1995. of the Proceedings of HLT-NAACL 2003–short Pa- Poisson mixtures. Natural Language Engineering pers - Volume 2. Association for Computational 1:163–190. Linguistics, Stroudsburg, PA, USA, NAACL-Short ’03, pages 46–48. http://www.aclweb.org/ Michael Collins. 2002. Discriminative training meth- anthology/N/N03/N03-2016.pdf. ods for hidden markov models: Theory and ex- periments with perceptron algorithms. In Pro- Gideon S. Mann and David Yarowsky. 2001. Mul- ceedings of the 2002 Conference on Empirical tipath translation lexicon induction via bridge lan- Methods in Natural Language Processing. Asso- guages. In Proceedings of the second meet- ciation for Computational Linguistics, pages 1–8. ing of the North American Chapter of the As- https://doi.org/10.3115/1118693.1118694. sociation for Computational Linguistics on Lan- guage technologies. Association for Computa- Ben Taskar, Lacoste-Julien Simon, and Klein Dan. tional Linguistics, Stroudsburg, PA, USA, NAACL 2005b. A discriminative matching approach ’01, pages 1–8. http://www.aclweb.org/ to word alignment. In Proceedings of Hu- anthology/N/N01/N01-1020.pdf. man Language Technology Conference and Con- ference on Empirical Methods in Natural Lan- Benjamin S. Mericli and Michael Bloodgood. 2012. guage Processing. Association for Computational Annotating cognates and etymological origin in Tur- Linguistics, Vancouver, British Columbia, Canada, kic languages. In Proceedings of the First Workshop pages 73–80. http://www.aclweb.org/ on Language Resources and Technologies for Tur- anthology/H/H05/H05-1010.pdf. kic Languages. European Language Resources As- sociation, Istanbul, Turkey, pages 47–51. http: //arxiv.org/abs/1501.03191.

Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Or- want, Steven Pinker, Martin A. Nowak, and . 2010. Quantitative analysis of culture using millions of digi- tized books. Science 331(6014):176–182. https://doi.org/10.1126/science.1199644.

Reinhard Rapp. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting of the Association for Computa- tional Linguistics. Association for Computational Linguistics, Cambridge, Massachusetts, USA, pages 320–322. https://doi.org/10.3115/981658.981709.

Reinhard Rapp. 1999. Automatic identification of word translations from unrelated english and ger- man corpora. In Proceedings of the 37th An- nual Meeting of the Association for Computational Linguistics. Association for Computational Linguis- tics, College Park, Maryland, USA, pages 519–526. https://doi.org/10.3115/1034678.1034756.

Charles Schafer and David Yarowsky. 2002. In- ducing translation lexicons via diverse similar- ity measures and bridge languages. In Pro- ceedings of the 6th Conference on Natural language Learning - Volume 20. Association for Computational Linguistics, Stroudsburg, PA, USA, pages 1–7. http://www.aclweb.org/ anthology/W/W02/W02-2026.pdf.

Helmut Schmid. 1994. Part-of-speech tagging with neural networks. In COLING. pages 172–176. http://www.aclweb.org/anthology/C/ C94/C94-1027.pdf.

Michel Simard, George F. Foster, and Pierre Isabelle. 1993. Using cognates to align sentences in bilingual corpora. In Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research: Distributed Computing - Volume 2. IBM Press, CASCON ’93, pages 1071–1082.

Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin. 2005a. Learning structured predic- tion models: A large margin approach. In Proceed- ings of the 22nd International Conference on Ma- chine learning. ACM, pages 896–903.