Identifying Cognate Sets Across Dictionaries of Related Languages

Total Page:16

File Type:pdf, Size:1020Kb

Identifying Cognate Sets Across Dictionaries of Related Languages Identifying Cognate Sets Across Dictionaries of Related Languages Adam St Arnaud David Beck Grzegorz Kondrak Dept of Computing Science Dept of Linguistics Dept of Computing Science University of Alberta University of Alberta University of Alberta [email protected] [email protected] [email protected] Abstract 1992). While cognates are valuable to linguists, their identification is a time-consuming process, We present a system for identifying cog- even for experts, who have to sift through hun- nate sets across dictionaries of related lan- dreds or even thousands of words in related lan- guages. The likelihood of a cognate re- guages. The languages that are the least well stud- lationship is calculated on the basis of a ied, and therefore the ones in which historical lin- rich set of features that capture both pho- guists are most interested, often lack cognate in- netic and semantic similarity, as well as formation. the presence of regular sound correspon- A number of computational methods have been dences. The similarity scores are used proposed to automate the process of cognate iden- to cluster words from different languages tification. Many of the systems focus on iden- that may originate from a common proto- tifying cognates within classes of semantically word. When tested on the Algonquian lan- equivalent words, such as Swadesh lists of basic guage family, our system detects 63% of concepts. Those systems, which typically con- cognate sets while maintaining cluster pu- sider only the phonetic or orthographic forms of rity of 70%. words, can be further divided into the ones that operate on language pairs (pairwise) vs. multilin- 1 Introduction gual approaches. However, because of seman- Cognates are words in related languages that have tic drift, many cognates are no longer exact syn- originated from the same word in an ancestor lan- onyms, which severely limits the effectiveness of guage; for example English earth and German such systems. For example, a cognate pair like En- Erde. On average, cognates display higher pho- glish bite and French fendre “to split” cannot be netic and semantic similarity than random word detected because these words are listed under dif- pairs between languages that are indisputably re- ferent basic meanings in the Comparative Indoeu- lated (Kondrak, 2013). The term cognate is some- ropean Database (Dyen et al., 1992). In addition, times used within computational linguistics to de- the number of basic concepts is typically small. note orthographically similar words that have the In this paper, we address the challenging task of same meaning (Nakov and Tiedemann, 2012). In identifying cognate sets across multiple languages this work, however, we adhere to the strict linguis- directly from dictionary lists representing related tic definition of cognates and aim to distinguish languages, by taking into account both the forms them from lexical borrowings by detecting regular of words and their dictionary definitions (c.f. Fig- sound correspondences. ure1). Our methods are designed for less-studied Cognate information between languages is crit- languages — we assume only the existence of ba- ical to the field of historical and comparative lin- sic dictionaries containing a substantial number of guistics, where it plays a central role in determin- word forms in a semi-phonetic notation, with the ing the relations and structures of language fami- meaning of words conveyed using one of the major lies (Trask, 1996). Automated phylogenetic recon- languages. Such dictionaries are typically created structions often rely on cognate information as in- before Bible translations, which have been accom- put (Bouchard-Cotˆ e´ et al., 2013). The percentage plished for most of the world’s languages. of shared cognates can also be used to estimate the While our approach is unsupervised, assuming time of pre-historic language splits (Dyen et al., no cognate sets from the analyzed language fam- 2519 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2519–2528 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics aːyweːpiwin ease, rest kaːskipiteːw he pulls him scraping C mihkweːkin red cloth C aːyweːpiwin ease, rest … 42 O aːnweːpiwin rest, repose ahkawaːpiwa he watches meškweːkenwi red woolen handcloth F yoːwe earlier, before … C mihkweːkin red cloth F meškweːkenwi red woolen handcloth 1725 M mæhkiːkan red flannel iːnekænok they are so big, tall O miskweːkin red cloth kaːskeponæːw he scratches him M mæhkiːkan red flannel … C kaːskipiteːw he pulls him scraping aːnweːpiwin rest, repose M kaːskeponæːw he scratches him 872 kaːškipin scrape, claw O kaːškipin scrape, claw O miskweːkin red cloth … Figure 1: Example of multilingual cognate set identification across four Algonquian dictionaries: Cree (C), Fox (F), Menominee (M) and Ojibwa (O). Cognate set numbers are shown on the right. ily to start with, it incorporates supervised ma- netic forms. Turchin et al.(2010) apply a heuris- chine learning models that either leverage cognate tic based on consonant classes to identify the ratio data from unrelated families, or use self-training of cognate pairs to non-cognate pairs between lan- on subsets of likely cognate pairs. We derive guages in an effort to determine the likelihood that two types of models to classify pairs of words they are related. Ciobanu and Dinu(2013) find across languages as either cognate or not. The cognate pairs by referring to dictionaries contain- language-independent general model employs a ing etymological information. Rama(2015) ex- number of features defined on both word forms periments with features motivated by string ker- and definitions, including word vector representa- nels for pairwise cognate classification. tions. The additional specific models exploit regu- A more challenging version of the task is to lar sound correspondences between specific pairs cluster cognates within lists of words that have of languages. The scores from the general and identical definitions. Hauer and Kondrak(2011) specific models inform a clustering algorithm that use confidence scores from a binary classifier that constructs the proposed cognate sets. incorporates a variety of string similarity features We evaluate our system on dictionary lists that to guide an average score clustering algorithm. represent four indigenous North American lan- Hall and Klein(2010, 2011) define generative guages from the Algonquian family. On the task models that model the evolution of words along of pairwise classification, we achieve a 42% error a phylogeny according to automatically learned reduction with respect to the state of the art. On sound laws in the form of parametric edit dis- the task of multilingual clustering, our system de- tances. List and Moran(2013) propose an ap- tects 63% of gold sets, while maintaining a cluster proach based on sound class alignments and an av- purity score of 70%. The system code is publicly 1 erage score clustering algorithm. List et al.(2016) available. extend the approach to include partial cognates 2 Related Work within word lists. Cognate identification that considers semantic Most previous work in automatic cognate identi- information is a less-studied problem. Again, the fication only consider words as cognates if they task can be framed as either a pairwise classifi- have identical definitions. As such, they make lim- cation or multi-lingual clustering. In a pairwise ited or no use of semantic information. The sim- context, Kondrak(2004) describes a system for plest variant of this task is to make pairwise cog- identifying cognates between language dictionar- nate classifications based on orthographic or pho- ies which is based on phonetic similarity, com- 1https://github.com/ajstarna/SemaPhoR plex multi-phoneme correspondences, and seman- 2520 tic information. The method of Wang and Sitbon LCSR is the longest common subsequence ra- • (2014) employs word sense disambiguation com- tio of the words. bined with classic string similarity measures for Alignment score reflects an overall phonetic finding cognate pairs in parallel texts to aid lan- • similarity, provided by the ALINE phonetic guage learners. aligner (Kondrak, 2009). Finally, very little has been published on creat- ing cognate sets based on both phonetic and se- Consonant match returns the number of • mantic information, which is the task that we fo- aligned consonants normalized by the num- cus on in this paper. Kondrak et al.(2007) com- ber of consonants in the longer word. bine phonetic and sound correspondence scores For example, consider the words meskweˇ :kenwi with a simple semantic heuristic, and create cog- and mæhki:kan (meSkwekenwi and mEhkikan in nate sets by using graph-based algorithms on con- ASJP notation) from cognate set 1725 in Figure1. nected components. Steiner et al.(2011) aim at The corresponding values for the above four fea- a fully automated approach to the comparative tures are 0.364, 0.364, 0.523, and 0.714, respec- method, including cognate set identification and tively. language phylogeny construction. Neither of those The semantic features refer to the dictionary systems and datasets are publicly available for the definitions of words. We assume that the defini- purpose of direct comparison to our method. tions are provided in a single meta-language, such as English or Spanish. We consider not only a def- 3 Methods inition in its entirety, but also its sub-definitions, In this section, we describe the design of our which are separated by commas and semicolons. language-independent general model, as well as We distinguish between a closed class of about the language-specific models. Given a pair of 300 stop words, which express grammatical re- words from related languages, the models produce lationships, and an open class of content words, a score that reflects the likelihood of the words which carry a meaning. Filtering out stopwords re- being cognate. The models are implemented as duces the likelihood of spurious matches between Support Vector Machine (SVM) classifiers via the dictionary definitions.
Recommended publications
  • Interdisciplinary Approaches to Stratifying the Peopling of Madagascar
    INTERDISCIPLINARY APPROACHES TO STRATIFYING THE PEOPLING OF MADAGASCAR Paper submitted for the proceedings of the Indian Ocean Conference, Madison, Wisconsin 23-24th October, 2015 Roger Blench McDonald Institute for Archaeological Research University of Cambridge Correspondence to: 8, Guest Road Cambridge CB1 2AL United Kingdom Voice/ Ans (00-44)-(0)1223-560687 Mobile worldwide (00-44)-(0)7847-495590 E-mail [email protected] http://www.rogerblench.info/RBOP.htm This version: Makurdi, 1 April, 2016 1 Malagasy - Sulawesi lexical connections Roger Blench Submission version TABLE OF CONTENTS TABLE OF CONTENTS................................................................................................................................. i ACRONYMS ...................................................................................................................................................ii 1. Introduction................................................................................................................................................. 1 2. Models for the settlement of Madagascar ................................................................................................. 2 3. Linguistic evidence...................................................................................................................................... 2 3.1 Overview 2 3.2 Connections with Sulawesi languages 3 3.2.1 Nouns..............................................................................................................................................
    [Show full text]
  • Cognate Words in Mehri and Hadhrami Arabic
    Cognate Words in Mehri and Hadhrami Arabic Hassan Obeid Alfadly* Khaled Awadh Bin Mukhashin** Received: 18/3/2019 Accepted: 2/5/2019 Abstract The lexicon is one important source of information to establish genealogical relations between languages. This paper is an attempt to describe the lexical similarities between Mehri and Hadhrami Arabic and to show the extent of relatedness between them, a very little explored and described topic. The researchers are native speakers of Hadhrami Arabic and they paid many field visits to the area where Mehri is spoken. They used the Swadesh list to elicit their data from more than 20 Mehri informants and from Johnston's (1987) dictionary "The Mehri Lexicon and English- Mehri Word-list". The researchers employed lexicostatistical techniques to analyse their data and they found out that Mehri and Hadhrmi Arabic have so many cognate words. This finding confirms Watson (2011) claims that Arabic may not have replaced all the ancient languages in the South-Western Arabian Peninsula and that dialects of Arabic in this area including Hadhrami Arabic are tinged, to a greater or lesser degree, with substrate features of the Pre- Islamic Ancient and Modern South Arabian languages. Introduction: three branches including Central Semitic, Historically speaking, the Semitic language Ethiopian and Modern south Arabian languages family from which both of Arabic and Mehri (henceforth MSAL). Though Arabic and Mehri descend belong to a larger family of languages belong to the West Semitic, Arabic descends called Afro-Asiatic or Hamito-Semitic that from the Central Semitic and Mehri from includes Semitic, Egyptian, Cushitic, Omotic, (MSAL) which consists of two branches; the Berber and Chadic (Rubin, 2010).
    [Show full text]
  • Comparing the Cognate Effect in Spoken and Written Second Language Word Production
    1 Short title: Cognate effect in spoken and written word production Comparing the cognate effect in spoken and written second language word production 1 2 1 Merel Muylle , Eva Van Assche & Robert J. Hartsuiker 1Department of Experimental Psychology, Ghent University, Ghent, Belgium 2 Thomas More University of Applied Sciences, Antwerp, Belgium Address for correspondence: Merel Muylle Department of Experimental Psychology Ghent University Henri Dunantlaan 2 B-9000 Gent (Belgium) E-mail: [email protected] 2 Abstract Cognates – words that share form and meaning between languages – are processed faster than control words. However, it is unclear whether this effect is merely lexical (i.e., central) in nature, or whether it cascades to phonological/orthographic (i.e., peripheral) processes. This study compared the cognate effect in spoken and typewritten production, which share central, but not peripheral processes. We inquired whether this effect is present in typewriting, and if so, whether its magnitude is similar to spoken production. Dutch-English bilinguals performed either a spoken or written picture naming task in English; picture names were either Dutch-English cognates or control words. Cognates were named faster than controls and there was no cognate-by-modality interaction. Additionally, there was a similar error pattern in both modalities. These results suggest that common underlying processes are responsible for the cognate effect in spoken and written language production, and thus a central locus of the cognate effect. Keywords: bilingualism, word production, cognate effect, writing 3 Converging evidence suggests that bilinguals activate both their mother tongue (L1) and their second language (L2) simultaneously when processing linguistic information (e.g., Dijkstra & Van Heuven, 2002; Van Hell & Dijkstra, 2002).
    [Show full text]
  • 1 Roger Schwarzschild Rutgers University 18 Seminary Place New
    Roger Schwarzschild Rutgers University 18 Seminary Place New Brunswick, NJ 08904 [email protected] to appear in: Recherches Linguistiques de Vincennes November 30, 2004 ABSTRACT In some languages, measure phrases can appear with non-compared adjectives: 5 feet tall. I address three questions about this construction: (a) Is the measure phrase an argument of the adjective or an adjunct? (b) What are we to make of the markedness of this construction *142lbs heavy? (c) Why is it that the markedness disappears once the adjective is put in the comparative (2 inches taller alongside 2lbs heavier)? I claim that because degree arguments are ‘functional’, the measure phrase has to be an adjunct and not a syntactic argument of the adjective. Like event modifiers in extended NP’s and in VPs, the measure phrase predicates of a degree argument of the adjective. But given the kind of meaning a measure phrase must have to do its job in comparatives and elsewhere, it is not of the right type to directly predicate of a degree argument. I propose a lexically governed type-shift which applies to some adjectives allowing them to combine with a measure phrase. KEY WORDS adjective, measure phrase, degree, functional category, lexical, adjunct, argument, antonymy. Measure Phrases as Modifiers of Adjectives1 1. Introduction There is a widely accepted account of expressions like five feet tall according to which the adjective tall denotes a relation between individuals and degrees of height and the measure phrase, five feet, serves as an argument of the adjective, saturating the degree-place in the relation2.
    [Show full text]
  • Palatalization/Velar Softening: What It Is and What It Tells Us About the Nature of Language Morris Halle
    Palatalization/Velar Softening: What It Is and What It Tells Us about the Nature of Language Morris Halle This study proposes an account both of the consonantal changes in- volved in palatalization/velar softening and of the fact that this change is encountered before front vowels. The change is a straightforward case of feature assimilation provided that segments/phonemes are viewed as complexes of features organized into the ‘‘bottle brush’’ model illustrated in example (4) and elsewhere in the text, and that the universal set of features includes, in addition to the familiar binary features, six unary features, which specify the designated articulator(s) for every segment (not only for consonants). Keywords: phonetics, palatalization, velar softening, features, articula- tors For Ken Stevens, friend and colleague, on his 80th birthday My purpose in this study is to present an account of the very common alternation between dorsal and coronal consonants often referred to as palatalization or velar softening. This alternation, exemplified by English electri[k] ϳ electri[s]ity, occurs most often before front vowels. In spite of its extremely common occurrence in the languages of the world, to this time there has been no proper account of palatalization that would relate it to the other properties of language, in particular, to the fact that it is found most commonly before front vowels. In the course of working on these remarks, it became clear to me that palatalization raises numerous theoretical questions about which there is at present no agreement among phonologists. Since these include matters of the most fundamental importance for phonology, I start by exten- sively discussing the main issues and the views on them that seem to me most persuasive at this time (section 1).
    [Show full text]
  • Jicaque As a Hokan Language Author(S): Joseph H
    Jicaque as a Hokan Language Author(s): Joseph H. Greenberg and Morris Swadesh Source: International Journal of American Linguistics, Vol. 19, No. 3 (Jul., 1953), pp. 216- 222 Published by: The University of Chicago Press Stable URL: http://www.jstor.org/stable/1263010 Accessed: 11-07-2017 15:04 UTC REFERENCES Linked references are available on JSTOR for this article: http://www.jstor.org/stable/1263010?seq=1&cid=pdf-reference#references_tab_contents You may need to log in to JSTOR to access the linked references. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at http://about.jstor.org/terms The University of Chicago Press is collaborating with JSTOR to digitize, preserve and extend access to International Journal of American Linguistics This content downloaded from 12.14.13.130 on Tue, 11 Jul 2017 15:04:26 UTC All use subject to http://about.jstor.org/terms JICAQUE AS A HOKAN LANGUAGE JOSEPH H. GREENBERG AND MORRIS SWADESH COLUMBIA UNIVERSITY 1. The problem 2. The phonological equivalences in Hokan 2. Phonological note have been largely established by Edward 3. Cognate list Sapir's work.3 The Jicaque agreements are 4. Use of lexical statistics generally obvious. A special point is that 5.
    [Show full text]
  • Cognet: a Large-Scale Cognate Database
    CogNet: a Large-Scale Cognate Database Khuyagbaatar Batsuren† Gábor Bella† Fausto Giunchiglia†§ DISI, University of Trento, Trento, Italy† Jilin University, Changchun, China§ {k.batsuren; gabor.bella; fausto.giunchiglia}@unitn.it Abstract IELex2, or ABVD (Greenhill et al., 2008), cover only the small set of 225 Swadesh basic concepts, This paper introduces CogNet, a new, although with an extremely wide coverage of up to large-scale lexical database that provides 4000 languages. Secondly, in these databases, lex- cognates—words of common origin and ical entries that belong to scripts other than Latin meaning—across languages. The database or Cyrillic mostly appear in phonetic transcription currently contains 3.1 million cognate pairs instead of using their actual orthographies in their across 338 languages using 35 writing sys- tems. The paper also describes the automated original scripts. These limitations prevent such re- method by which cognates were computed sources from being used in real-world computa- from publicly available wordnets, with an tional tasks on written language. accuracy evaluated to 94%. Finally, statistics This paper describes CogNet, a new large-scale, and early insights about the cognate data high-precision, multilingual cognate database, as are presented, hinting at a possible future well as the method used to build it. Our main exploitation of the resource1 by various fields technical contributions are (1) a general method of lingustics. to detect cognates from multilingual lexical re- sources, with precision and recall parametrable ac- 1 Introduction cording to usage needs; (2) a large-scale cognate database containing 3.1 million word pairs across Cognates are words in different languages that 338 languages, generated with the method above; share a common origin and the same meaning, (3) WikTra, a multilingual transliteration dictio- such as the English letter and the French lettre.
    [Show full text]
  • Phonetics in Phonology
    Phonetics in Phonology John J. Ohala University of California, Berkeley At least since Trubetzkoy (1933, 1939) many have thought of phonology and phonetics as separate, largely autonomous, disciplines with distinct goals and distinct methodologies. Some linguists even seem to doubt whether phonetics is properly part of linguistics at all (Sommerstein 1977:1). The commonly encountered expression ‘the interface between phonology and phonetics’ implies that the two domains are largely separate and interact only at specific, proscribed points (Ohala 1990a). In this paper I will attempt to make the case that phonetics is one of the essential areas of study for phonology. Without phonetics, I would maintain, (and allied empirical disciplines such as psycholinguistics and sociolinguistics) phonology runs the risk of being a sterile, purely descriptive and taxonomic, discipline; with phonetics it can achieve a high level of explanation and prediction as well as finding applications in areas such as language teaching, communication disorders, and speech technology (Ohala 1991). 1. Introduction The central task within phonology (as well as in speech technology, etc.) is to explain the variability and the patterning -- the “behavior” -- of speech sounds. What are regarded as functionally the ‘same’ units, whether word, syllable, or phoneme, show considerable physical variation depending on context and style of speaking, not to mention speaker-specific factors. Documenting and explaining this variation constitutes a major challenge. Variability is evident in several domains: in everyday speech where the same word shows different phonetic shapes in different contexts, e.g., the release of the /t/ in tea has more noise than that in toe when spoken in isolation.
    [Show full text]
  • Working Paper on Exonyms
    UNITED NATIONS E/CONF.98/CRP.32 ECONOMIC AND SOCIAL COUNCIL 13 July 2007 Ninth United Nations Conference on the Standardization of Geographical Names New York, 21 - 30 August 2007 Item 10 of the provisional agenda* Exonyms Working Paper on Exonyms Submitted by Lebanon** * E/CONF.98/1. ** Prepared by Amal Husseini (Lebanon), Head of Aerial Photography Section in the Photogrammetry Department, Lebanese Army – Geographic Affairs Directorate. WORKING PAPER ON EXONYMS (Submitted under item #10 of the Provisional Agenda E/CONF.98/1) By Amal HUSSEINI (LEBANON) Head of Aerial photography section in the photogrammetry department Lebanese army – Geographic Affairs Directorate Survey Engineering Diploma – ESGT - Le Mans (France). High Studies Diploma in Remote Sensing & GIS - Paul Sabatier University (Toulouse III) – (France). The problem of the standardization of geographical names, of which the discussion of exonyms forms a large part, is an extensive and complex one. United Nations resolutions recommend the reduction of the number of exonyms as far and as quickly as possible. An exonym is a name for a place that is not used within that place by the local inhabitants, or a name for a people or language that is not used by the people or language to which it refers. The name used by the people or locals themselves is an endonym or autonym 1. For all Semitic people, including Arabic people, place names are related to an expression of a religious or an emotional thought. Like wishing and seeking blessings. Geographic names are as old as humanity, and more mysterious, because they date back to far periods when human discovered the agriculture as food resources.
    [Show full text]
  • First Name Americanization Patterns Among Twentieth-Century Jewish Immigrants to the United States
    City University of New York (CUNY) CUNY Academic Works All Dissertations, Theses, and Capstone Projects Dissertations, Theses, and Capstone Projects 2-2017 From Rochel to Rose and Mendel to Max: First Name Americanization Patterns Among Twentieth-Century Jewish Immigrants to the United States Jason H. Greenberg The Graduate Center, City University of New York How does access to this work benefit ou?y Let us know! More information about this work at: https://academicworks.cuny.edu/gc_etds/1820 Discover additional works at: https://academicworks.cuny.edu This work is made publicly available by the City University of New York (CUNY). Contact: [email protected] FROM ROCHEL TO ROSE AND MENDEL TO MAX: FIRST NAME AMERICANIZATION PATTERNS AMONG TWENTIETH-CENTURY JEWISH IMMIGRANTS TO THE UNITED STATES by by Jason Greenberg A dissertation submitted to the Graduate Faculty in Linguistics in partial fulfillment of the requirements for the degree of Master of Arts in Linguistics, The City University of New York 2017 © 2017 Jason Greenberg All Rights Reserved ii From Rochel to Rose and Mendel to Max: First Name Americanization Patterns Among Twentieth-Century Jewish Immigrants to the United States: A Case Study by Jason Greenberg This manuscript has been read and accepted for the Graduate Faculty in Linguistics in satisfaction of the thesis requirement for the degree of Master of Arts in Linguistics. _____________________ ____________________________________ Date Cecelia Cutler Chair of Examining Committee _____________________ ____________________________________ Date Gita Martohardjono Executive Officer THE CITY UNIVERSITY OF NEW YORK iii ABSTRACT From Rochel to Rose and Mendel to Max: First Name Americanization Patterns Among Twentieth-Century Jewish Immigrants to the United States: A Case Study by Jason Greenberg Advisor: Cecelia Cutler There has been a dearth of investigation into the distribution of and the alterations among Jewish given names.
    [Show full text]
  • Testing the Efficacy of a Cognate Curriculum
    CREATE 2009 10/5/09 Empirical and Theoretical Background: Considerable previous work (García, 1991; Nagy, TESTING THE EFFICACY OF 1997; National Reading Panel, 2000; Verhoeven, 1990) suggests that one major determinant of poor A COGNATE CURRICULUM reading comprehension, for Latino children and for other lagging readers, is low vocabulary. Lack of knowledge of the middle and lower Maria Carlo, University of Miami frequency 'academic' words encountered in Diane August, Center for Applied Linguistics middle and secondary school texts impedes comprehension of those texts, which in turn Chris Barr, University of Houston impedes the natural process of learning new Ana Pazos-Rego, St. Thomas University word meanings from exposure during reading (Stanovich, 1986). Austin, CREATE Conference, October 5-6, 2009 Continued… Research Questions One strategy believed to be successful in promoting the rapid acquisition of vocabulary by ELLs involves Can an intervention developed to teach teaching children about the morphological structure of cognate awareness to Spanish/English words. bilingual 3rd and 5th graders improve their Researchers believe that it is beneficial for ELLs if instruction on the structural analysis of words includes learning of English words that have cognate making students aware of the cross-linguistic status in Spanish? morphological relationships between words in their two languages (García & Nagy, 1993; Nagy, García, Does the cognate recognition strategy transfer Durgunoglu, & Hancin-Bhatt,1993; Jiménez et al., 1996; to other cognates that have not been Nation 2001). instructed? This involves making students aware of words that are cognates (words that are spelt alike and have similar meanings in two languages), and making them aware of similarities between derivational morphemes in the two languages (e.g, motivación-motivation).
    [Show full text]
  • Identification of Cognates and Recurrent Sound Correspondences
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Directory of Open Access Journals Identification of Cognates and Recurrent Sound Correspondences in Word Lists Grzegorz Kondrak Department of Computing Science, University of Alberta, Edmonton, AB T6G 2E8, Canada. E-mail: [email protected]. ABSTRACT. Identification of cognates and recurrent sound correspondences is a component of two principal tasks of historical linguistics: demonstrating the relatedness of languages, and reconstructing the histories of language families. We propose methods for detecting and quan- tifying three characteristics of cognates: recurrent sound correspondences, phonetic similarity, and semantic affinity. The ultimate goal is to identify cognates and correspondences directly from lists of words representing pairs of languages that are known to be related. The proposed solutions are language independent, and are evaluated against authentic linguistic data. The results of evaluation experiments involving the Indo-European, Algonquian, and Totonac lan- guage families indicate that our methods are more accurate than comparable programs, and achieve high precision and recall on various test sets. The results also suggest that combining various types of evidence substantially increases cognate identification accuracy. RÉSUMÉ. L’identification de mots apparentés et des correspondances de sons récurrents inter- vient dans deux des principales tâches de la linguistique historique: démontrer des filiations linguistiques et reconstruire l’histoire des familles de langues. Nous proposons des méthodes de détection et de quantification de trois caractéristiques des mots apparentés: les correspon- dances de sons récurrents, la ressemblance phonétique et l’affinité sémantique. Le but ultime est d’identifier les mots apparentés et les correspondances directement à partir de listes de mots représentant des paires des langues dont la filiation est connue.
    [Show full text]