Using Global Constraints and Reranking to Improve Cognates Detection

Using Global Constraints and Reranking to Improve Cognates Detection Michael Bloodgood Benjamin Strauss Department of Computer Science Computer Science and Engineering Dept. The College of New Jersey The Ohio State University Ewing, NJ 08628 Columbus, OH 43210 [email protected] [email protected] Abstract guistics, we set a broader definition. We use the word ‘cognate’ to denote, as in (Kondrak, 2001): Global constraints and reranking have not “...words in different languages that are similar in been used in cognates detection research form and meaning, without making a distinction to date. We propose methods for using between borrowed and genetically related words; global constraints by performing rescoring for example, English ‘sprint’ and the Japanese bor- of the score matrices produced by state of rowing ‘supurinto’ are considered cognate, even the art cognates detection systems. Using though these two languages are unrelated.” These global constraints to perform rescoring is broader criteria are motivated by the ways scien- complementary to state of the art methods tists develop and use cognate identification algo- for performing cognates detection and re- rithms in natural language processing (NLP) sys- sults in significant performance improve- tems. For cross-lingual applications, the advan- ments beyond current state of the art per- tage of such technology is the ability to identify formance on publicly available datasets words for which similarity in meaning can be ac- with different language pairs and various curately inferred from similarity in form; it does conditions such as different levels of base- not matter if the similarity in form is from strict line state of the art performance and dif- genetic relationship or later borrowing (Mericli ferent data size conditions, including with and Bloodgood, 2012). more realistic large data size conditions Cognates detection has received a lot of atten- than have been evaluated with in the past. tion in the literature. The research of the use of 1 Introduction statistical learning methods to build systems that can automatically perform cognates detection has This paper presents an effective method for us- yielded many interesting and creative approaches ing global constraints to improve performance for for gaining traction on this challenging task. Cur- cognates detection. Cognates detection is the task rently, the highest-performing state of the art sys- of identifying words across languages that have a tems detect cognates based on the combination common origin. Automatic cognates detection is of multiple sources of information. Some of the important to linguists because cognates are needed most indicative sources of information discovered to determine how languages evolved. Cognates to date are word context information, phonetic in- are used for protolanguage reconstruction (Hall formation, word frequency information, temporal and Klein, 2011; Bouchard-Cotˆ e´ et al., 2013). information in the form of word frequency dis- Cognates are important for cross-language dictio- tributions across parallel time periods, and word nary look-up and can also improve the quality of burstiness information. See section3 for fuller ex- machine translation, word alignment, and bilin- planations of each of these sources of information gual lexicon induction (Simard et al., 1993; Kon- that state of the art systems currently use. Scores drak et al., 2003). for all pairs of words from language L1 x language A word is traditionally only considered cognate L2 are generated by generating component scores with another if both words proceed from the same based on these sources of information and then ancestor. Nonetheless, in line with the conven- combining them in an appropriate manner. Simple tions of previous research in computational lin- methods of combination are giving equal weight- This paper was published in the Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1983-1992, Vancouver, Canada, July 2017. c 2017 Association for Computational Linguistics DOI: https://doi.org/10.18653/v1/P17-1181 ing for each score, while state of the art perfor- matrices for other language pairs and data mance is obtained by learning an optimal set of sets have state of the art score matrices that weights from a small seed set of known cognates. are already able to achieve 11-point interpo- Once the full matrix of scores is generated, the lated average precision of 57%. word pairs with the highest scores are predicted as being cognates. Although we are not aware of work using global The methods we propose in the current paper constraints to perform rescoring to improve cog- consume as input the final score matrix that state nates detection, there are related methodologies of the art methods create. We test if our meth- for reranking in different settings. Methodologi- ods can improve performance by generating new cally related work includes past work in structured rescored matrices by rescoring all of the pairs of prediction and reranking (Collins, 2002; Collins words by taking into account global constraints and Roark, 2004; Collins and Koo, 2005; Taskar that apply to cognates detection. Thus, our meth- et al., 2005a,b). Note that in these past works, ods are complementary to previous methods for there are many instances with structured outputs creating cognates detection systems. Using global that can be used as training data to learn a struc- constraints and performing rescoring to improve tured prediction model. For example, a semi- cognates detection has not been explored yet. We nal application in the past was using online train- find that rescoring based on global constraints im- ing with structured perceptrons to learn improved proves performance significantly beyond current systems for performing various syntactic analyses state of the art levels. and tagging of sentences such as POS tagging and The cognates detection task is an interesting base noun phrase chunking (Collins, 2002). Note task to apply our methods to for a few reasons: that in those settings the unit at which there are structural constraints is a sentence. Also note that • It’s a challenging unsolved task where ongo- there are many sentences available so that online ing research is frequently reported in the lit- training methods such as discriminative training of erature trying to improve performance; structured perceptrons can be used to learn structured predictors effectively in those settings. In • There is significant room for improvement in contrast, for the cognates setting the unit at which performance; there are structural constraints is the entire set of • It has a global structure in its output clas- cognates for a language pair and there is only one 1 sifications since if a word lemma wi from such unit in existence (for a given language pair). language L1 is cognate with a word lemma We call this a single overarching global structure wj from language L2, then wi is not cognate to make the distinction clear. The method we with any other word lemma from L2 different present in this paper deals with a single overar- from wj and wj is not cognate with any other ching global structure on the predictions of all instances in the entire problem space for a task. For word lemma wk from L1. this type of setting, there is only a single global • There are multiple standard datasets freely structure in existence, contrasted with the situa- and publicly available that have been worked tion of there being many sentences each impos- on with which to compare results. ing a global structure on the tagging decisions for that individual sentence. Hence, previous struc- • Different datasets and language pairs yield tured prediction methods that require numerous initial score matrices with very different qual- instances each having a structured output on which ities. Some of the score matrices built using to train parameters via methods such as perceptron the existing state of the art best approaches training are inapplicable to the cognates setting. In yield performance that is quite low (11-point this paper we present methods for rescoring effec- interpolated average precision of only ap- tively in settings with a single overarching global proximately 16%) while some of these score structure and show their applicability to improv- 1A lemma is a base form of a word. For example, in En- ing the performance of cognates detection. Still, glish the words ‘baked’ and ‘baking’ would both map to the we note that philosophically our method builds lemma ‘bake’. Lemmatizing software exists for many languages and lemmatization is a standard preprocessing task on previous structured prediction methods since in conducted before cognates detection. both cases there is a similar intuition in that we’re using higher-level structural properties to inform where the task is to extract (x; y) pairs such that and accordingly alter our system’s predictions of (x; y) are in some relation R. Here are examples: values for subitems within a structure. • X might be a set of states and Y might be a In section2 we present our methods for per- set of cities and the relation R might be “is forming rescoring of matrices based on global the capital of”; constraints such as those that apply for cognates detection. The key intuition behind our approach • X might be a set of images and Y might be is that the scoring of word pairs for cognateness a set of people’s names and the relation R ought not be made independently as is currently might be “is a picture of”; done, but rather that global constraints ought to be taken into account to inform and potentially alter • X might be a set of English words and Y system scores for word pairs based on the scores of might be a set of French words and the re- other word pairs. In section3 we provide results of lation R might be “is cognate with”.

Load more