Discovering Lexical Similarity Through Articulatory Feature-Based Phonetic Edit Distance
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Phonetic Matching Toolkit with State-Of-The-Art Meta-Soundex
PHONETIC MATCHING TOOLKIT WITH STATE-OF-THE-ART META-SOUNDEX ALGORITHM (ENGLISH AND SPANISH) _____________ A Thesis Presented to The Faculty of the Department of Computer Science Sam Houston State University _____________ In Partial Fulfillment of the Requirements for the Degree of Master of Science _____________ by Keerthi Koneru December, 2016 PHONETIC MATCHING TOOLKIT WITH STATE-OF-THE-ART META-SOUNDEX ALGORITHM (ENGLISH AND SPANISH) by Keerthi Koneru ______________ APPROVED: Cihan Varol, PhD Thesis Director Narasimha Shashidhar, PhD Committee Member Bing Zhou, PhD Committee Member John B. Pascarella, PhD Dean, College of Science and Engineering Technology DEDICATION I dedicate this dissertation to all my family members who had always motivated and supported me to successfully achieve this. A special dedication to my father, Koneru Paramesh Babu, whom I regard as the most influential person in my life, who always showed me the right direction and responsible for grooming me into a better individual. Last but not the least, I would like to dedicate to my brother Sridhar Reddy and my friend Subash Kumar, who constantly motivated me to work comprehensively and effectively, which helped me a lot in successfully achieving this milestone. iii ABSTRACT Koneru, Keerthi, Phonetic matching toolkit with state-of-the-art Meta-Soundex algorithm (English and Spanish). Master of Science (Computing and Information Science), December, 2016, Sam Houston State University, Huntsville, Texas. Researchers confront major problems while searching for various kinds of data in large imprecise databases, as they are not spelled correctly or in the way they were expected to be spelled. As a result, they cannot find the word they sought. -
The Romance Advantage — the Significance of the Romance Languages As a Pathway to Multilingualism
ISSN 1799-2591 Theory and Practice in Language Studies, Vol. 8, No. 10, pp. 1253-1260, October 2018 DOI: http://dx.doi.org/10.17507/tpls.0810.01 The Romance Advantage — The Significance of the Romance Languages as a Pathway to Multilingualism Kathleen Stein-Smith Fairleigh Dickinson University, Metropolitan Campus, Teaneck, NJ, USA Abstract—As 41M in the US speak a Romance language in the home, it is necessary to personally and professionally empower L1 speakers of a Romance language through acquisition of one or more additional Romance languages. The challenge is that Romance language speakers, parents, and communities may be unaware of both the advantages of bilingual and multilingual skills and also of the relative ease in developing proficiency, and even fluency, in a second or third closely related language. In order for students to maximize their Romance language skills, it is essential for parents, educators, and other language stakeholders to work together to increase awareness, to develop curriculum, and to provide teacher training -- especially for Spanish-speakers, who form the vast majority of L1 Romance language speakers in the US, to learn additional Romance languages. Index Terms—romance languages, bilingual education, multilingualism, foreign language learning, romance advantage I. INTRODUCTION The Romance languages, generally considered to be French, Spanish, Italian, Portuguese, and Romanian, and in addition, regional languages including Occitan and Catalan, developed from Latin over a significant period of time and across a considerable geographic area. The beginnings of the Romance languages can be traced to the disappearance of the Roman Empire, along with Latin, its lingua franca. -
Introducing Micrela: Predicting Mutual Intelligibility Between Closely Related Languages in Europe1
INTRODUCING MICRELA: PREDICTING MUTUAL INTELLIGIBILITY BETWEEN CLOSELY RELATED LANGUAGES IN EUROPE1 Vincent J. van Heuveni, ii, iii, Charlotte S. Gooskensii, ii Renée van Bezooijen i Pannon Egyetem, Veszprém, Hungary ii Groningen University, The Netherlands iii Leiden University, The Netherlands [email protected] 1. Introduction The Wikipedia lists 91 indigenous languages that are spoken in Europe today.2 Forty-two are currently being used by more than one million speakers. Only 24 are recognized as official and working languages of the European Union. Most of these belong to one of three major families within the Indo-European phylum, i.e. Germanic, Romance and Slavic languages. Languages and language varieties (dialects) within one family have descended from a common ancestor language which has become more diversified over the past centuries through innovations. Generally, the greater the geographic distance and historical depth (how long ago did language A undergo an innovation that language B was not part of), the less the two languages resemble one another, and – it is commonly held – the more difficult it will be for speakers of language A to be understood by listeners of language B and vice versa. A working hypothesis would then be that the longer ago two related languages split apart, the less they resemble one another and the smaller their mutual intelligibility. We are involved in a fairly large research project that was set up to test these hypotheses. The basic idea was to first measure the level of mutual intelligibility between all pairs of languages within a family, them compute the degree of structural linguistic similarity between the members, and try to predict the observed level of intelligibility from the linguistic distances measured. -
Does Linguistic Similarity Affect Early Simultaneous Bilingual Language Acquisition?
Journal of Language Contact 13 (2021) 482-500 brill.com/jlc Does Linguistic Similarity Affect Early Simultaneous Bilingual Language Acquisition? Anja Gampe Post-Doc, Institute of Socio-Economics, University of Duisburg-Essen, Duisburg, Germany [email protected] Antje Endesfelder Quick Lecturer, Institute of British Studies, University of Leipzig, Leipzig, Germany [email protected] Moritz M. Daum Professor, Department of Psychology and Jacobs Center for Productive Youth Development, University of Zurich, Zurich, Switzerland [email protected] Abstract It is well established that L2 acquisition is faster when the L2 is more closely related to the learner’s L1. In the current study we investigated whether language similarity has a comparable facilitative effect in early simultaneous bilingual children. The similarity between each bilingual child’s two languages was determined using phonological and typological scales. We compared the vocabulary size of bilingual toddlers learning different pairs of languages. Results show that the vocabulary size of bilingual children is indeed influenced by similarity: the more similar the languages, the larger the children’s vocabulary. Keywords bilingual language acquisition – language similarity – vocabulary development – lexicon size © Anja Gampe, et al., 2021 | doi:10.1163/19552629-13030001 This is an open access article distributed under the terms of the prevailingDownloaded cc-by-nc from license Brill.com10/02/2021 at the time of 08:23:39AM publication. via free access bilingual language acquisition 483 1 Introduction Most of the world’s children grow up learning two or more languages simulta- neously (Grosjean, 1982). Input is naturally divided between the two languages, and consequently the frequency of input is less than 100% for each of the lan- guages (Unsworth, 2015). -
Endangered Languages and Languages in Danger IMPACT: Studies in Language and Society Issn 1385-7908
IMPACT Endangered Languages studies and Languages in Danger in language Issues of documentation, policy, and language rights and Luna F i l i p o v i ´c society Martin Pütz 42 JOHN BENJAMINS PUBLISHING COMPANY Endangered Languages and Languages in Danger IMPACT: Studies in Language and Society issn 1385-7908 IMPACT publishes monographs, collective volumes, and text books on topics in sociolinguistics. The scope of the series is broad, with special emphasis on areas such as language planning and language policies; language conflict and language death; language standards and language change; dialectology; diglossia; discourse studies; language and social identity (gender, ethnicity, class, ideology); and history and methods of sociolinguistics. For an overview of all books published in this series, please see http://benjamins.com/catalog/impact General Editors Ana Deumert Kristine Horner University of Cape Town University of Sheffield Advisory Board Peter Auer Marlis Hellinger University of Freiburg University of Frankfurt am Main Jan Blommaert Elizabeth Lanza Ghent University University of Oslo Annick De Houwer William Labov University of Erfurt University of Pennsylvania J. Joseph Errington Peter L. Patrick Yale University University of Essex Anna Maria Escobar Jeanine Treffers-Daller University of Illinois at Urbana University of the West of England Guus Extra Victor Webb Tilburg University University of Pretoria Volume 42 Endangered Languages and Languages in Danger. Issues of documentation, policy, and language rights Edited by Luna Filipović and Martin Pütz Endangered Languages and Languages in Danger Issues of documentation, policy, and language rights Edited by Luna Filipović University of East Anglia Martin Pütz University of Koblenz-Landau John Benjamins Publishing Company Amsterdam / Philadelphia TM The paper used in this publication meets the minimum requirements of 8 the American National Standard for Information Sciences – Permanence of Paper for Printed Library Materials, ansi z39.48-1984. -
Using Phonetic Knowledge in Tools and Resources for Natural Language Processing and Pronunciation Evaluation
Using phonetic knowledge in tools and resources for Natural Language Processing and Pronunciation Evaluation Gustavo Augusto de Mendonça Almeida Gustavo Augusto de Mendonça Almeida Using phonetic knowledge in tools and resources for Natural Language Processing and Pronunciation Evaluation Master dissertation submitted to the Instituto de Ciências Matemáticas e de Computação – ICMC- USP, in partial fulfillment of the requirements for the degree of the Master Program in Computer Science and Computational Mathematics. FINAL VERSION Concentration Area: Computer Science and Computational Mathematics Advisor: Profa. Dra. Sandra Maria Aluisio USP – São Carlos May 2016 Ficha catalográfica elaborada pela Biblioteca Prof. Achille Bassi e Seção Técnica de Informática, ICMC/USP, com os dados fornecidos pelo(a) autor(a) Almeida, Gustavo Augusto de Mendonça A539u Using phonetic knowledge in tools and resources for Natural Language Processing and Pronunciation Evaluation / Gustavo Augusto de Mendonça Almeida; orientadora Sandra Maria Aluisio. – São Carlos – SP, 2016. 87 p. Dissertação (Mestrado - Programa de Pós-Graduação em Ciências de Computação e Matemática Computacional) – Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, 2016. 1. Template. 2. Qualification monograph. 3. Dissertation. 4. Thesis. 5. Latex. I. Aluisio, Sandra Maria, orient. II. Título. SERVIÇO DE PÓS-GRADUAÇÃO DO ICMC-USP Data de Depósito: Assinatura: ______________________ Gustavo Augusto de Mendonça Almeida Utilizando conhecimento fonético em ferramentas e recursos de Processamento de Língua Natural e Treino de Pronúncia Dissertação apresentada ao Instituto de Ciências Matemáticas e de Computação – ICMC-USP, como parte dos requisitos para obtenção do título de Mestre em Ciências – Ciências de Computação e Matemática Computacional. VERSÃO REVISADA Área de Concentração: Ciências de Computação e Matemática Computacional Orientadora: Profa. -
Evaluating Language Statistics: the Ethnologue and Beyond a Report Prepared for the UNESCO Institute for Statistics
Evaluating Language Statistics: The Ethnologue and Beyond A report prepared for the UNESCO Institute for Statistics John C. Paolillo School of Informatics, Indiana University Assisted by Anupam Das Department of Linguistics, Indiana University March 31, 2006 0. Introduction How many languages are there in the world? In a region or a particular country? How many speakers does a given language have? Are there more speakers of English or Mandarin? How are the numbers of these speakers changing, in the world, in a country or on the Internet? Linguists are often asked questions such as these, whether by members of other disciplines, lay-people, or policy makers. Yet despite the interest in and obvious importance of these questions, they are not easy questions to answer, and there are few sources one can turn to for definitive answers. Since the early 1990s, new awareness of a number of language-related issues have foregrounded the need for good answers to these questions. On the one hand, there is the economic trend of globalization, which requires people from a variety of different countries, ethnicities, cultures and language backgrounds to communicate with one another. Globalization has been accompanied by claims about the economic importance of one language vis-a-vis another, and the importance of specific languages in global communication functions or for scientific and cultural exchange. Such discussions have led to re-evaluations of the status of many languages in a range of contexts, such as the role of English globally and in the European Union, and the role of Mandarin Chinese in the Pacific Rim and on the Internet. -
The Music in Plain Speech and Writing
The Music in Plain Speech and Writing PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sun, 19 May 2013 11:51:46 UTC Contents CMU Pronouncing Dictionary 1 Arpabet 2 International Phonetic Alphabet 5 Phonetic transcription 26 Phonemic orthography 30 Pronunciation 35 Syllable 36 Allophone 46 Homophone 49 Rhyme 52 Half rhyme 59 Internal rhyme 60 Assonance 62 Literary consonance 64 Alliteration 65 Mnemonic 68 Phonetic algorithm 73 Metaphone 74 Soundex 77 Rhyme Genie 79 References Article Sources and Contributors 81 Image Sources, Licenses and Contributors 84 Article Licenses License 85 CMU Pronouncing Dictionary 1 CMU Pronouncing Dictionary CMU Pronouncing Dictionary Developer(s) Carnegie Mellon University Stable release 0.7a / February 18, 2008 Development status Maintained Available in English License Public Domain [1] Website Homepage The CMU Pronouncing Dictionary (also known as cmudict) is a public domain pronouncing dictionary created by Carnegie Mellon University (CMU). It defines a mapping from English words to their North American pronunciations, and is commonly used in speech processing applications such as the Festival Speech Synthesis System and the CMU Sphinx speech recognition system. The latest release is 0.7a, which contains 133,746 entries (from 123,442 baseforms). Database Format The database is distributed as a text file of the format word <two spaces> pronunciation. If there are multiple pronunciations available for a word, all subsequent entries are followed by an index in parentheses. The pronunciation is encoded using a modified form of the Arpabet system. The difference is stress marks on vowels with levels 0, 1, 2; not all entries have stress however. -
Tai Kadai Languages of India: a Probe Into the Seventh Language Family
Tai Kadai languages of India: A probe into the Seventh Language Family Anvita Abbi Jawaharlal Nehru University, New Delhi India is represented by six language families, Indo-Aryan, Dravidian, Austroasiatic and Tibeto-Burman in the mainland India and two, Asutronesian, (the Angan group, i.e. Jarawa- Onge) (Blevins 2008) and Great Andamanese (Abbi 2006, 2008) in the Andaman Islands. Each of these language families is historically and typologically distinct from each other. The Northeast of India is marked by intense linguistic diversity and functional multilingualism as it houses speakers from the Indo-Aryan, Tibeto-Burman and Austroasiatic Mon-Khmer languages. In addition, there are endangered languages of the Tai Kadai group such as Khamyang, Phake, Turung, Aiton, Nora-an extinct language, and nearly extinct language Tai- Ahom. The linguistic structures of the languages belonging to the Tai Kadai are very distinct from those of the Tibeto-Burman, although contact with Burmese and Indo-Aryan language such as Assamese cannot be ruled out. The current paper draws out attention to the distinctive features of the Tai-Kadai group of languages spoken in India as well as those which are the result of areal pressures. Some of the typologically distinct markers for the Tai languages seem to be contour tone system, isolating morphology, SVO word order (Das 2014), lack of distinction between alienable and inalienable possession, presence of associative plurals, presence of enclitics, and words indicating tense, aspect, and mood attached at the end of the verb phrase, i.e. adjunct to the complements. This presentation is an attempt to draw examples from various Tai Kadai languages but particularly from Tai Khamti and Tai Ahom languages to present a case for a distinct language family from the rest of the six language families of India both genealogically and typologically. -
Introduction to Chinese Writing
Introduction to Chinese Writing Sunday, March 27, 2011 BC Carney 102 • 3pm-4:50pm Stephen M. Hou [email protected] Splash Spring 2011 Boston College 1 How Many People Speak Chinese? http://en.wikipedia.org/wiki/File:Native_speakers_in_the_World.jpg 27 Mar 2011 BC Splash Spring 2011 - Chinese Writing - 2 S.M. Hou •Chinese is one of the world’s oldest continuously used writing systems. •More people speak/read Chinese than any other language in the world. •Classical Chinese was the lingua franca of East Asia until the 20th century. •Until the invention of European movable type printing by Gutenberg (1439), the majority of the world’s written and printed material was in Chinese characters. 2 Sample Chinese Writing • One-minute exercise: What are some distinguishing characteristics you observe with Chinese writing? 我聯合國人民同茲決心,欲免後世 再遭今代人類兩度身歷慘不堪言之 戰禍,重申基本人權、人格尊嚴與 價值、以及男女與大小各國平等權 利之信念… WE THE PEOPLE OF THE UNITED NATIONS DETERMINED to save succeeding generations from the scourge of war, which twice in our lifetime has brought untold sorrow to mankind, and to regain faith in fundamental human rights, in the dignity and worth of the human person, in the equal rights of men and women and of nations large and small… Preamble to the UN Charter 27 Mar 2011 BC Splash Spring 2011 - Chinese Writing - 3 S.M. Hou 3 Sample Chinese Writing • One-minute exercise: What are some distinguishing characteristics you observe with Chinese writing? 我聯合國人民同茲決心,欲免後世 再遭今代人類兩度身歷慘不堪言之 戰禍,重申基本人權、人格尊嚴與 價值、以及男女與大小各國平等權 利之信念… WE THE PEOPLE OF THE UNITED NATIONS DETERMINED to save succeeding generations from the scourge of war, which twice in our lifetime has brought untold sorrow to mankind, and to regain faith in fundamental human rights, in the dignity and worth of the human person, in the equal rights of men and women and of nations large and small… Preamble to the UN Charter 27 Mar 2011 BC Splash Spring 2011 - Chinese Writing - 4 S.M. -
Dialect Intelligibility
11 Dialect Intelligibility CHARLOTTE GOOSKENS 11.1 Introduction The present chapter focuses on the communicative consequences of dialectal variation. One of the main functions of language is to enable communication, not only between speakers of the same variety but also between people using different accents, dialects or closely related languages. Most research on dialect intelligibility is relatively recent, especially when it comes to actual testing. The methods for testing and measuring are getting more and more sophisticated, including web‐based experiments that allow for collecting large amounts of data, opening up new approaches to the subject. Some of the first to develop a methodology to test dialect intelligibility were American structuralists (e.g., Hickerson, Turner, and Hickerson 1952; Pierce 1952; Voegelin and Harris 1951), who tried to establish mutual intelligibility among related indigenous American lan- guages around the middle of the previous century. They used the so‐called recorded text testing (RTT) method. This methodology has been standardized and is still being used, for example in the context of literacy programs where a single orthography has to be devel- oped that serves multiple closely‐related language varieties (Casad 1974; Nahhas 2006). Since then, numerous intelligibility investigations have been carried out with various aims, for instance to resolve issues that concern language planning and policies, second‐language learning, and language contact. Data about distances between varieties and detailed knowledge about intelligibility can also be important in sociolinguistic studies. Varieties that have strong social stigma attached to them could unfairly be deemed hard to under- stand (Giles and Niedzielski 1998; Wolff 1959). -
Automated Dating of the Worlds Language Families Based On
Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2011 Automated Dating of the World’s Language Families Based on Lexical Similarity Holman, Eric W ; Brown, Cecil H ; Wichmann, Søren ; Müller, André ; Velupillai, Viveka ; Hammarström, Harald ; Sauppe, Sebastian ; Jung, Hagen ; Bakker, Dik ; Brown, Pamela ; Belyaev, Oleg ; Urban, Matthias ; Mailhammer, Robert ; List, Johann-Mattis ; Egorov, Dmitry Abstract: This paper describes a computerized alternative to glottochronology for estimating elapsed time since parent languages diverged into daughter languages. The method, developed by the Automated Sim- ilarity Judgment Program (ASJP) consortium, is different from glottochronology in four major respects: (1) it is automated and thus is more objective, (2) it applies a uniform analytical approach to a single database of worldwide languages, (3) it is based on lexical similarity as determined from Levenshtein (edit) distances rather than on cognate percentages, and (4) it provides a formula for date calculation that mathematically recognizes the lexical heterogeneity of individual languages, including parent lan- guages just before their breakup into daughter languages. Automated judgments of lexical similarity for groups of related languages are calibrated with historical, epigraphic, and archaeological divergence dates for 52 language groups. The discrepancies between estimated and calibration dates are found to be on average 29% as large as the estimated dates themselves, a figure that does not differ significantly among language families. As a resource for further research that may require dates of known level of accuracy, we offer a list of ASJP time depths for nearly all the world’s recognized language families and formany subfamilies.