An Etymological Approach to Cross-Language Orthographic Similarity
Total Page:16
File Type:pdf, Size:1020Kb
An Etymological Approach to Cross-Language Orthographic Similarity. Application on Romanian Alina Maria Ciobanu, Liviu P. Dinu Faculty of Mathematics and Computer Science, University of Bucharest Center for Computational Linguistics, University of Bucharest [email protected],[email protected] Abstract spondences are the most popular approaches em- ployed for establishing relationships between lan- In this paper we propose a computational guages. Barbanc¸on et al. (2013) emphasize the va- method for determining the orthographic riety of computational methods used in this field, similarity between Romanian and related and state that the differences in datasets and ap- languages. We account for etymons and proaches cause difficulties in the evaluation of the cognates and we investigate not only the results regarding the reconstruction of the phylo- number of related words, but also their genetic tree of languages. Linguistic phylogeny forms, quantifying orthographic similari- reconstruction proves especially useful in histor- ties. The method we propose is adaptable ical and comparative linguistics, as it enables the to any language, as far as resources are analysis of language evolution. Ringe et al. (2002) available. propose a computational method for evolutionary 1 Introduction tree reconstruction based on a “perfect phylogeny” Language relatedness and language change across algorithm; using a Bayesian phylogeographic ap- space and time are two of the main questions of the proach, Alekseyenko et al. (2012), continuing the historical and comparative linguistics (Rama and work of Atkinson et al. (2005), model the expan- Borin, 2014). Many comparative methods have sion of the Indo-European language family and been used to establish relationships between lan- find support for the hypothesis which places its guages, to determine language families and to re- homeland in Anatolia; Atkinson and Gray (2006) construct their proto-languages (Durie and Ross, analyze language divergence dates and argue for 1996). If grouping of languages in linguistic fam- the usage of computational phylogenetic meth- ods in the question of Indo-European age and ori- ilies is generally accepted, the relationships be- 1 tween languages belonging to the same family are gins. Using modified versions of Swadesh’s lists , periodically investigated. In spite of the fact that Dyen et al. (1992) investigate the classification of linguistic literature abounds in claims of classifi- Indo-European languages by applying a lexicosta- cation of natural languages, the degrees of similar- tistical method. ity between languages are far from being certain. The similarity of languages is interesting not In many situations, the similarity of natural lan- only for historical and comparative linguistics, guages is a fairly vague notion, both linguists and but for machine translation and language acqui- non-linguists having intuitions about which lan- sition as well. Scannell (2006) and Hajicˇ et al. guages are more similar to which others. McMa- (2000) argue for the possibility of obtaining a bet- hon and McMahon (2003) and Rama and Borin ter translation quality using simple methods for (2014) note that the computational historical lin- very closely related languages. Koppel and Ordan guistics did not receive much attention until the (2011) study the impact of the distance between beginning of the 1990s, and argue for the necessity languages on the translation product and conclude of development of quantitative and computational that it is directly correlated with the ability to dis- methods in this field. tinguish translations from a given source language from non-translated text. Some genetically re- 1.1 Related Work lated languages are so similar to each other, that According to Campbell (2003), the methods based on comparisons of cognate lists and sound corre- 1http://www.wordgumbo.com/ie/cmp/iedata.txt 1047 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1047–1058, October 25-29, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics speakers of such languages are able to communi- man experts (Rama and Borin, 2014)). Our ap- cate without prior instruction (Gooskens, 2007). proach implies a detailed investigation which ac- Gooskens et al. (2008) analyze several phonetic counts not only for the number of related words, and lexical predictors and their conclusion is that as it is usually done in lexicostatistics (where the lexical similarity can be seen as a predictor of lan- relationships between languages are determined guage intelligibility. The impact of language sim- based on the percentage of related words), but also ilarities in the process of second language acquisi- for their forms, quantifying orthographic similari- tion is argued by the contrastive analysis hypoth- ties. We employ three string similarity metrics for esis, which claims that where similarities between a finer-grained analysis, as related words in dif- the first and the second language occur, the acqui- ferent languages do not have identical forms and sition would be easier compared with the situation their partial similarity implies different degrees of in which there were differences between the two recognition and comprehensibility. For example, languages (Benati and VanPatten, 2011). the Romanian word luna˘ (moon) is closer to its Latin etymon luna than the word batr˘ anˆ (old) to 1.2 Our Approach its etymon veteranus, and the Romanian word vantˆ (wind) is closer to its French cognate pair vent than Although there are multiple aspects that are rele- the word castel (castle) to its cognate pair chateauˆ . vant in the study of language relatedness, such as the orthographic, phonetic, syntactic and semantic In this paper we investigate the orthographic differences, in this paper we focus only on the or- similarity between Romanian and related lan- thographic similarity. The orthographic approach guages. Romanian is a Romance language, be- relies on the idea that sound changes leave traces longing to the Italic branch of the Indo-European in the orthography, and alphabetic character cor- language family, and is of particular interest re- respondences represent, to a fairly large extent, garding its geographic setting. It is surrounded sound correspondences (Delmestri and Cristianini, by Slavic languages and its relationship with the 2010). big Romance kernel was difficult. Besides gen- In this paper we propose an orthographic simi- eral typological comparisons that can be made larity method focused on etymons (direct sources between any two or more languages, Romanian of the words in a foreign language) and cognates can be studied based on comparisons of ge- (words in different languages having the same ety- netic and geographical nature, participating in nu- mology and a common ancestor). In a broadly ac- merous areally-based similarities that define the cepted sense, the higher the similarity degree be- Balkan convergence area. Joseph (1999) states tween two languages, the closer they are. that, regarding the genetic relationships, Roma- One of our motivations is that when people en- nian can be studied in the context of those lan- counter a language for the first time in written guages most closely related to it and that the form, it is most likely that they can distinguish and well-studied Romance languages enable compar- individualize words which resemble words from isons that might not be possible otherwise, within their native language. These words are proba- less well-documented families of languages. The bly either inherited from their mother tongue (ety- position of Romanian within the Romance fam- mons), or have a common ancestor with the words ily is controversial (McMahon and McMahon, in their language (cognates). 2003): either marginal or more integrated within the group, depending on the versions of the cog- Our first goal is, given a corpus C, to automat- nate lists that are used in the analysis. ically detect etymons and cognates. In Section 2 we propose a dictionary-based approach to auto- In Section 3.1 we apply our method on Roma- matically extract related words, and a method for nian in different stages of its evolution, running computing the orthographic similarity of natural our experiments on high-volume corpora from languages. Most of the traditional approaches in three historical periods: the period approximately this field focus either on etymology detection or between 1642 and 1743, the second half of 19th on cognate identification, most of them reporting century (1870 - 1889), and the present period. In results only on small sets of cognate pairs (usually Section 3.2 we make use of a fourth corpus, Eu- manually determined lists of about 200 cognates, roparl, with a double goal: on the one hand, to for which the cognate judgments are made by hu- check if degrees of similarity between Romanian 1048 and other languages in the present period are con- C Lingua sistent across two different corpora, and on the etymology other hand, to investigate whether there are dif- etymology ferences between the overall degrees of similarity Nlingua cognates obtained for the entire corpus and those obtained cognates in various experiments at sentence level. The con- clusions of our paper are outlined in Section 4. λ 2 Methodology and Algorithm Nwords - Nlingua λ In this section we introduce a technique for deter- λ mining the orthographic similarity of languages. λ In order to obtain accurate results, we investigate