Using Morphemes from Agglutinative Languages Like Quechua and Finnish to Aid in Low-Resource Translation
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Machine-Translation Inspired Reordering As Preprocessing for Cross-Lingual Sentiment Analysis
Master in Cognitive Science and Language Master’s Thesis September 2018 Machine-translation inspired reordering as preprocessing for cross-lingual sentiment analysis Alejandro Ramírez Atrio Supervisor: Toni Badia Abstract In this thesis we study the effect of word reordering as preprocessing for Cross-Lingual Sentiment Analysis. We try different reorderings in two target languages (Spanish and Catalan) so that their word order more closely resembles the one from our source language (English). Our original expectation was that a Long Short Term Memory classifier trained on English data with bilingual word embeddings would internalize English word order, resulting in poor performance when tested on a target language with different word order. We hypothesized that the more the word order of any of our target languages resembles the one of our source language, the better the overall performance of our sentiment classifier would be when analyzing the target language. We tested five sets of transformation rules for our Part of Speech reorderings of Spanish and Catalan, extracted mainly from two sources: two papers by Crego and Mariño (2006a and 2006b) and our own empirical analysis of two corpora: CoStEP and Tatoeba. The results suggest that the bilingual word embeddings that we are training our Long Short Term Memory model with do not improve any English word order learning by part of the model when used cross-lingually. There is no improvement when reordering the Spanish and Catalan texts so that their word order more closely resembles English, and no significant drop in result score even when applying a random reordering to them making them almost unintelligible, neither when classifying between 2 options (positive-negative) nor between 4 (strongly positive, positive, negative, strongly negative). -
Reconsidering the “Isolating Protolanguage Hypothesis” in the Evolution of Morphology Author(S): Jaïmé Dubé Proceedings
Reconsidering the “isolating protolanguage hypothesis” in the evolution of morphology Author(s): Jaïmé Dubé Proceedings of the 37th Annual Meeting of the Berkeley Linguistics Society (2013), pp. 76-90 Editors: Chundra Cathcart, I-Hsuan Chen, Greg Finley, Shinae Kang, Clare S. Sandy, and Elise Stickles Please contact BLS regarding any further use of this work. BLS retains copyright for both print and screen forms of the publication. BLS may be contacted via http://linguistics.berkeley.edu/bls/. The Annual Proceedings of the Berkeley Linguistics Society is published online via eLanguage, the Linguistic Society of America's digital publishing platform. Reconsidering the Isolating Protolanguage Hypothesis in the Evolution of Morphology1 JAÏMÉ DUBÉ Université de Montréal 1 Introduction Much recent work on the evolution of language assumes explicitly or implicitly that the original language was without morphology. Under this assumption, morphology is merely a consequence of language use: affixal morphology is the result of the agglutination of free words, and morphophonemic (MP) alternations arise through the morphologization of once regular phonological processes. This hypothesis is based on at least two questionable assumptions: first, that the methods and results of historical linguistics can provide a window on the evolution of language, and second, based on the claim that some languages have no morphology (the so-called isolating languages), that morphology is not a necessary part of language. The aim of this paper is to suggest that there is in fact no basis for what I will call the Isolating Proto-Language Hypothesis (henceforth IPH), either on historical or typological grounds, and that the evolution of morphology remains an interesting question. -
Student Research Workshop Associated with RANLP 2011, Pages 1–8, Hissar, Bulgaria, 13 September 2011
RANLPStud 2011 Proceedings of the Student Research Workshop associated with The 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011) 13 September, 2011 Hissar, Bulgaria STUDENT RESEARCH WORKSHOP ASSOCIATED WITH THE INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING’2011 PROCEEDINGS Hissar, Bulgaria 13 September 2011 ISBN 978-954-452-016-8 Designed and Printed by INCOMA Ltd. Shoumen, BULGARIA ii Preface The Recent Advances in Natural Language Processing (RANLP) conference, already in its eight year and ranked among the most influential NLP conferences, has always been a meeting venue for scientists coming from all over the world. Since 2009, we decided to give arena to the younger and less experienced members of the NLP community to share their results with an international audience. For this reason, further to the first successful and highly competitive Student Research Workshop associated with the conference RANLP 2009, we are pleased to announce the second edition of the workshop which is held during the main RANLP 2011 conference days on 13 September 2011. The aim of the workshop is to provide an excellent opportunity for students at all levels (Bachelor, Master, and Ph.D.) to present their work in progress or completed projects to an international research audience and receive feedback from senior researchers. We have received 31 high quality submissions, among which 6 papers have been accepted as regular oral papers, and 18 as posters. Each submission has been reviewed by -
An Empirical Test of the Agglutination Hypothesis1
An Empirical Test of the Agglutination Hypothesis1 Martin Haspelmath Abstract In this paper, I approach the agglutination-fusion distinction from an empirical point of view. Although the well-known morphological typology of languages (isolating, agglutinating, flexive/fusional, incorporating) has often been criticized as empty, the old idea that there are (predominantly) agglutinating and (predominantly) fusional languages in fact makes two implicit predictions. First, agglutination/fusion is characteristic of whole languages rather than individual con- structions; second, the various components of agglutination/fusion correlate with each other. The (unstated, but widely assumed) Agglutination Hypothesis can thus be formulated as follows: (i) First prediction: If a language is agglutinating/fusional in one area of its mor- phology (e.g. in nouns, or in the future tense), it shows the same type elsewhere. (ii) Second prediction: If a language is agglutinating/fusional with respect to one of the three agglutination parameters (a-c) (and perhaps others), it shows the same type with respect to the other two parameters: (a) separation/cumulation, (b) morpheme invariance/morpheme variability, (c) affix uniformity/affix sup- pletion. Ireportonastudyofthenominalandverbalinflectionalmorphologyofa reasonably balanced world-wide sample of 30 languages, applying a variety of measures for the agglutination parameters and determining whether they are cross- linguistically significant. The results do not confirm the validity of the Agglutination Hypothesis, and the current evidence suggests that “agglutination” is just one way of trying to capture the strangeness of non-Indo-European languages, which all look alike to Eurocentric eyes. M. Haspelmath Max-Planck-Institut fur¨ evolutionare¨ Anthropologie, Leipzig 1Earlier versions of this paper were presented at the 3rd conference of the Association for Linguistic Typology (Amsterdam 1999) and at the 9th International Morphology Meeting (Vienna 2000). -
ALW2), Pages 1–10 Brussels, Belgium, October 31, 2018
EMNLP 2018 Second Workshop on Abusive Language Online Proceedings of the Workshop, co-located with EMNLP 2018 October 31, 2018 Brussels, Belgium Sponsors Primary Sponsor Platinum Sponsors Gold Sponsors Silver Sponsors Bronze Sponsors ii c 2018 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-948087-68-1 iii Introduction Interaction amongst users on social networking platforms can enable constructive and insightful conversations and civic participation; however, on many sites that encourage user interaction, verbal abuse has become commonplace. Abusive behavior such as cyberbullying, hate speech, and scapegoating can poison the social climates within online communities. The last few years have seen a surge in such abusive online behavior, leaving governments, social media platforms, and individuals struggling to deal with the consequences. As a field that works directly with computational analysis of language, the NLP community is uniquely positioned to address the difficult problem of abusive language online; encouraging collaborative and innovate work in this area is the goal of this workshop. The first year of the workshop saw 14 papers presented in a day-long program including interdisciplinary panels and active discussion. In this second edition, we have aimed to build on the success of the first year, maintaining a focus on computationally detecting abusive language and encouraging interdisciplinary work. Reflecting the growing research focus on this topic, the number of submissions received more than doubled from 22 in last year’s edition of the workshop to 48 this year. -
A Massively Parallel Corpus: the Bible in 100 Languages
Lang Resources & Evaluation DOI 10.1007/s10579-014-9287-y ORIGINAL PAPER A massively parallel corpus: the Bible in 100 languages Christos Christodouloupoulos • Mark Steedman Ó The Author(s) 2014. This article is published with open access at Springerlink.com Abstract We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora. Keywords Parallel corpus Á Multilingual corpus Á Comparative corpus linguistics 1 Introduction Parallel corpora are a valuable resource for linguistic research and natural language processing (NLP) applications. One of the main uses of the latter kind is as training material for statistical machine translation (SMT), where large amounts of aligned data are standardly used to learn word alignment models between the lexica of two languages (for example, in the Giza?? system of Och and Ney 2003). Another interesting use of parallel corpora in NLP is projected learning of linguistic structure. In this approach, supervised data from a resource-rich language is used to guide the unsupervised learning algorithm in a target language. Although there are some techniques that do not require parallel texts (e.g. Cohen et al. 2011), the most successful models use sentence-aligned corpora (Yarowsky and Ngai 2001; Das and Petrov 2011). C. Christodouloupoulos (&) Department of Computer Science, UIUC, 201 N. -
A Corpus-Based Study of Unaccusative Verbs and Auxiliary Selection
To be or not to be: A corpus-based study of unaccusative verbs and auxiliary selection Master's Thesis Presented to The Faculty of the Graduate School of Arts and Sciences Brandeis University Department of Computer Science James Pustejovsky, Advisor In Partial Fulfillment of the Requirements for Master's Degree By Richard A. Brutti Jr. May 2012 c Copyright by Richard A. Brutti Jr. 2012 All Rights Reserved ABSTRACT To be or not to be: A corpus-based study of unaccusative verbs and auxiliary selection A thesis presented to the Department of Computer Science Graduate School of Arts and Sciences Brandeis University Waltham, Massachusetts Richard A. Brutti Jr. Since the introduction of the Unaccusative Hypothesis (Perlmutter, 1978), there have been many further attempts to explain the mechanisms behind the division in intransitive verbs. This paper aims to analyze and test some of theories of unac- cusativity using computational linguistic tools. Specifically, I focus on verbs that exhibit split intransitivity, that is, verbs that can appear in both unaccusative and unergative constructions, and in determining the distinguishing features that make this alternation possible. Many formal linguistic theories of unaccusativity involve the interplay of semantic roles and temporal event markers, both of which can be analyzed using statistical computational linguistic tools, including semantic role labelers, semantic parses, and automatic event classification. I use auxiliary verb selection as a surface-level indicator of unaccusativity in Italian and Dutch, and iii test various classes of verbs extracted from the Europarl corpus (Koehn, 2005). Additionally, I provide some historical background for the evolution of this dis- tinction, and analyze how my results fit into the larger theoretical framework. -
Complex Adpositions and Complex Nominal Relators Benjamin Fagard, José Pinto De Lima, Elena Smirnova, Dejan Stosic
Introduction: Complex Adpositions and Complex Nominal Relators Benjamin Fagard, José Pinto de Lima, Elena Smirnova, Dejan Stosic To cite this version: Benjamin Fagard, José Pinto de Lima, Elena Smirnova, Dejan Stosic. Introduction: Complex Adpo- sitions and Complex Nominal Relators. Benjamin Fagard, José Pinto de Lima, Dejan Stosic, Elena Smirnova. Complex Adpositions in European Languages : A Micro-Typological Approach to Com- plex Nominal Relators, 65, De Gruyter Mouton, pp.1-30, 2020, Empirical Approaches to Language Typology, 978-3-11-068664-7. 10.1515/9783110686647-001. halshs-03087872 HAL Id: halshs-03087872 https://halshs.archives-ouvertes.fr/halshs-03087872 Submitted on 24 Dec 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Public Domain Benjamin Fagard, José Pinto de Lima, Elena Smirnova & Dejan Stosic Introduction: Complex Adpositions and Complex Nominal Relators Benjamin Fagard CNRS, ENS & Paris Sorbonne Nouvelle; PSL Lattice laboratory, Ecole Normale Supérieure, 1 rue Maurice Arnoux, 92120 Montrouge, France [email protected] -
The Translation Equivalents Database (Treq) As a Lexicographer’S Aid
The Translation Equivalents Database (Treq) as a Lexicographer’s Aid Michal Škrabal, Martin Vavřín Institute of the Czech National Corpus, Charles University, Czech Republic E-mail: [email protected], [email protected] Abstract The aim of this paper is to introduce a tool that has recently been developed at the Institute of the Czech National Corpus, the Treq (Translation Equivalents) database, and to explore its possible uses, especially in the field of lexicography. Equivalent candidates offered by Treq can also be considered as potential equivalents in a bilingual dictionary (we will focus on the Latvian–Czech combination in this paper). Lexicographers instantly receive a list of candidates for target language counterparts and their frequencies (expressed both in absolute numbers and percentages) that suggest the probability that a given candidate is functionally equivalent. A significant advantage is the possibility to click on any one of these candidates and immediately verify their individual occurrences in a given context; and thus more easily distinguish the relevant translation candidates from the misleading ones. This utility, which is based on data stored in the InterCorp parallel corpus, is continually being upgraded and enriched with new functions (the recent integration of multi-word units, adding English as the primary language of the dictionaries, an improved interface, etc.), and the accuracy of the results is growing as the volume of data keeps increasing. Keywords: InterCorp; Treq; translation equivalents; alignment; Latvian–Czech dictionary 1. Introduction The aim of this paper is to introduce one of the tools that has been developed recently at the Institute of the Czech National Corpus (ICNC) and which could be especially helpful to lexicographers: namely, the Treq translation equivalents database1. -
Diminutive and Augmentative Functions of Some Luganda Noun Class Markers Samuel Namugala MA Thesis in Linguistics Norwegian Un
Diminutive and Augmentative Functions of some Luganda Noun Class Markers Samuel Namugala MA Thesis in Linguistics Norwegian University of Science and Technology (NTNU) Faculty of Humanities Department of Language and Literature Trondheim, April, 2014 To my parents, Mr. and Mrs. Wampamba, and my siblings, Polycarp, Lydia, Christine, Violet, and Joyce ii Acknowledgements I wish to express my gratitude to The Norwegian Government for offering me a grant to pursue the master’s program at NTNU. Without this support, I would perhaps not have achieved my dream of pursuing the master’s degree in Norway. Special words of thanks go to my supervisors, Professor Kaja Borthen and Professor Assibi Amidu for guiding me in writing this thesis. Your scholarly guidance, constructive comments and critical revision of the drafts has made it possible for me to complete this thesis. I appreciate the support and the knowledge that you have shared with me. I look forward to learn more from you. My appreciation also goes to my lecturers and the entire staff at the Department of Language and Literature. I am grateful to Professor Lars Hellan, Assoc. Professor Dorothee Beermann, Professor Wim Van Dommelen, and Assoc. Professor Jardar Abrahamsen for the knowledge you have shared with me since I joined NTNU. You have made me the linguist that I desired to be. I also wish to thank the authors that didn’t mind to help me when contacted for possible relevant literature for my thesis. My appreciation goes to Prof. Nana Aba Appiah Amfo (University of Ghana), Assistant Prof. George J. Xydopoulos (Linguistics School of Philology, University of Patras, Greece), Prof. -
The Pile: an 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe Charles Foster Jason Phang Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy EleutherAI [email protected] Abstract versity leads to better downstream generalization capability (Rosset, 2019). Additionally, large-scale Recent work has demonstrated that increased training dataset diversity improves general language models have been shown to effectively cross-domain knowledge and downstream gen- acquire knowledge in a novel domain with only eralization capability for large-scale language relatively small amounts of training data from that models. With this in mind, we present the domain (Rosset, 2019; Brown et al., 2020; Carlini Pile: an 825 GiB English text corpus tar- et al., 2020). These results suggest that by mix- geted at training large-scale language mod- ing together a large number of smaller, high qual- els. The Pile is constructed from 22 diverse ity, diverse datasets, we can improve the general high-quality subsets—both existing and newly cross-domain knowledge and downstream general- constructed—many of which derive from aca- demic or professional sources. Our evalua- ization capabilities of the model compared to mod- tion of the untuned performance of GPT-2 and els trained on only a handful of data sources. GPT-3 on the Pile shows that these models struggle on many of its components, such as To address this need, we introduce the Pile: a academic writing. Conversely, models trained 825:18 GiB English text dataset designed for train- on the Pile improve significantly over both ing large scale language models. -
Fusion, Exponence, and Flexivity in Hindukush Languages
Fusion, exponence, and flexivity in Hindukush languages An areal-typological study Hanna Rönnqvist Department of Linguistics Independent Project for the Degree of Master 30 HE credits General Linguistics Spring term 2015 Supervisor: Henrik Liljegren Examiner: Henrik Liljegren Fusion, exponence, and flexivity in Hindukush languages An areal-typological study Hanna Rönnqvist Abstract Surrounding the Hindukush mountain chain is a stretch of land where as many as 50 distinct languages varieties of several language meet, in the present study referred to as “The Greater Hindukush” (GHK). In this area a large number of languages of at least six genera are spoken in a multi-linguistic setting. As the region is in part characterised by both contact between languages as well as isolation, it constitutes an interesting field of study of similarities and diversity, contact phenomena and possible genealogical connections. The present study takes in the region as a whole and attempts to characterise the morphology of the many languages spoken in it, by studying three parameters: phonological fusion, exponence, and flexivity in view of grammatical markers for Tense-Mood-Aspect, person marking, case marking, and plural marking on verbs and nouns. The study was performed with the perspective of areal typology, employed grammatical descriptions, and was in part inspired by three studies presented in the World Atlas of Language Structures (WALS). It was found that the region is one of high linguistic diversity, even if there are common traits, especially between languages of closer contact, such as the Iranian and the Indo-Aryan languages along the Pakistani-Afghan border where purely concatenative formatives are more common.