Arxiv:1812.10464V2 [Cs.CL] 25 Sep 2019 While the Recent Advent of Deep Learning Has Led to Sentations for 93 Languages (See Table1)

Total Page:16

File Type:pdf, Size:1020Kb

Arxiv:1812.10464V2 [Cs.CL] 25 Sep 2019 While the Recent Advent of Deep Learning Has Led to Sentations for 93 Languages (See Table1) Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond Mikel Artetxe Holger Schwenk University of the Basque Country (UPV/EHU)∗ Facebook AI Research [email protected] [email protected] Abstract Pennington et al., 2014), but has recently been su- perseded by sentence-level representations (Peters We introduce an architecture to learn joint et al., 2018; Devlin et al., 2019). Nevertheless, all multilingual sentence representations for 93 these works learn a separate model for each lan- languages, belonging to more than 30 dif- guage and are thus unable to leverage information ferent families and written in 28 differ- across different languages, greatly limiting their ent scripts. Our system uses a single BiLSTM encoder with a shared BPE vo- potential performance for low-resource languages. cabulary for all languages, which is cou- In this work, we are interested in universal pled with an auxiliary decoder and trained on publicly available parallel corpora. This language agnostic sentence embeddings, that is, enables us to learn a classifier on top of vector representations of sentences that are gen- the resulting embeddings using English an- eral with respect to two dimensions: the input lan- notated data only, and transfer it to any guage and the NLP task. The motivations for such of the 93 languages without any modifi- representations are multiple: the hope that lan- cation. Our experiments in cross-lingual guages with limited resources benefit from joint natural language inference (XNLI dataset), training over many languages, the desire to per- cross-lingual document classification (ML- Doc dataset) and parallel corpus mining form zero-shot transfer of an NLP model from (BUCC dataset) show the effectiveness of one language (typically English) to another, and our approach. We also introduce a new the possibility to handle code-switching. To that test set of aligned sentences in 112 lan- end, we train a single encoder to handle multiple guages, and show that our sentence em- languages, so that semantically similar sentences beddings obtain strong results in multilin- in different languages are close in the embedding gual similarity search even for low-resource space. languages. Our implementation, the pre- trained encoder and the multilingual test set While previous work in multilingual NLP has are available at https://github.com/ been limited to either a few languages (Schwenk facebookresearch/LASER. and Douze, 2017; Yu et al., 2018) or specific appli- cations like typology prediction (Malaviya et al., 1 Introduction 2017) or machine translation (Neubig and Hu, 2018), we learn general purpose sentence repre- arXiv:1812.10464v2 [cs.CL] 25 Sep 2019 While the recent advent of deep learning has led to sentations for 93 languages (see Table1). Using a impressive progress in Natural Language Process- single pre-trained BiLSTM encoder for all the 93 ing (NLP), these techniques are known to be par- languages, we obtain very strong results in various ticularly data hungry, limiting their applicability in scenarios without any fine-tuning, including cross- many practical scenarios. An increasingly popular lingual natural language inference (XNLI dataset), approach to alleviate this issue is to first learn gen- cross-lingual classification (MLDoc dataset), bi- eral language representations on unlabeled data, text mining (BUCC dataset) and a new multilin- which are then integrated in task-specific down- gual similarity search dataset we introduce cover- stream systems. This approach was first popular- ing 112 languages. To the best of our knowledge, ized by word embeddings (Mikolov et al., 2013b; this is the first exploration of general purpose mas- This∗ work was performed during an internship at Face- sively multilingual sentence representations across book AI Research. a large variety of tasks. 2 Related work length vector representation, which is used by the decoder to create the target sequence. This de- Following the success of word embeddings coder is then discarded, and the encoder is kept to (Mikolov et al., 2013b; Pennington et al., 2014), embed sentences in any of the training languages. there has been an increasing interest in learn- While some proposals use a separate encoder for ing continuous vector representations of longer each language (Schwenk and Douze, 2017), shar- linguistic units like sentences (Le and Mikolov, ing a single encoder for all languages also gives 2014; Kiros et al., 2015). These sentence embed- strong results (Schwenk, 2018). dings are commonly obtained using a Recurrent Nevertheless, most existing work is either lim- Neural Network (RNN) encoder, which is typi- ited to few, rather close languages (Schwenk and cally trained in an unsupervised way over large Douze, 2017; Yu et al., 2018) or, more commonly, collections of unlabelled corpora. For instance, consider pairwise joint embeddings with English the skip-thought model of Kiros et al.(2015) cou- and one foreign language (España-Bonet et al., ple the encoder with an auxiliary decoder, and 2017; Guo et al., 2018). To the best of our knowl- train the entire system to predict the surrounding edge, existing work on learning multilingual rep- sentences over a collection of books. It was later resentations for a large number of languages is shown that more competitive results could be ob- limited to word embeddings (Ammar et al., 2016; tained by training the encoder over labeled Natu- Dufter et al., 2018) or specific applications like ty- ral Language Inference (NLI) data (Conneau et al., pology prediction (Malaviya et al., 2017) or ma- 2017). This was later extended to multitask learn- chine translation (Neubig and Hu, 2018), ours be- ing, combining different training objectives like ing the first paper exploring general purpose mas- that of skip-thought, NLI and machine translation sively multilingual sentence representations. (Cer et al., 2018; Subramanian et al., 2018). All the previous approaches learn a fixed-length While the previous methods consider a single representation for each sentence. A recent re- language at a time, multilingual representations search line has obtained very strong results using have attracted a large attention in recent times. variable-length representations instead, consisting Most of this research focuses on cross-lingual of contextualized embeddings of the words in the word embeddings (Ruder et al., 2017), which are sentence (Dai and Le, 2015; Peters et al., 2018; commonly learned jointly from parallel corpora Howard and Ruder, 2018; Devlin et al., 2019). For (Gouws et al., 2015; Luong et al., 2015). An al- that purpose, these methods train either an RNN ternative approach that is becoming increasingly or self-attentional encoder over unnanotated cor- popular is to separately train word embeddings for pora using some form of language modeling. A each language, and map them to a shared space classifier can then be learned on top of the result- based on a bilingual dictionary (Mikolov et al., ing encoder, which is commonly further fine-tuned 2013a; Artetxe et al., 2018a) or even in a fully un- during this supervised training. Concurrent to supervised manner (Conneau et al., 2018a; Artetxe our work, Lample and Conneau(2019) propose a et al., 2018b). Cross-lingual word embeddings are cross-lingual extension of these models, and report often used to build bag-of-word representations of strong results in cross-lingual natural language in- longer linguistic units by taking their respective ference, machine translation and language mod- (IDF-weighted) average (Klementiev et al., 2012; eling. In contrast, our focus is on scaling to a Dufter et al., 2018). While this approach has the large number of languages, for which we argue advantage of requiring weak or no cross-lingual that fixed-length approaches provide a more ver- signal, it has been shown that the resulting sen- satile and compatible representation form.1 Also, tence embeddings work poorly in practical cross- our approach achieves strong results without task- lingual transfer settings (Conneau et al., 2018b). specific fine-tuning, which makes it interesting for A more competitive approach that we follow tasks with limited resources. here is to use a sequence-to-sequence encoder- decoder architecture (Schwenk and Douze, 2017; 1For instance, there is not always a one-to-one correspon- Hassan et al., 2018). The full system is trained dence among words in different languages (e.g. a single word end-to-end on parallel corpora akin to multilingual of a morphologically complex language might correspond to several words of a morphologically simple language), so hav- neural machine translation (Johnson et al., 2017): ing a separate vector for each word might not transfer as well the encoder maps the source sequence into a fixed- across languages. ENCODER sent emb DECODER max pooling y1 y2 </s> BiLSTM BiLSTM … BiLSTM softmax softmax … softmax … … … W LSTM LSTM … LSTM BiLSTM BiLSTM … BiLSTM sent BPE L sent BPE L … sent BPE L BPE emb BPE emb … BPE emb id id id <s> y1 yn x1 x2 </s> Figure 1: Architecture of our system to learn multilingual sentence embeddings. 3 Proposed method resulting sentence representations (after concate- nating both directions) are 1024 dimensional. The We use a single, language agnostic BiLSTM en- decoder has always one layer of dimension 2048. coder to build our sentence embeddings, which The input embedding size is set to 320, while the is coupled with an auxiliary decoder and trained language ID embedding has 32 dimensions. on parallel corpora. From Section 3.1 to 3.3, we describe its architecture, our training strategy to 3.2 Training strategy scale to 93 languages, and the training data used In preceding work (Schwenk and Douze, 2017; for that purpose. Schwenk, 2018), each input sentence was jointly 3.1 Architecture translated into all other languages. However, this approach has two obvious drawbacks when trying Figure1 illustrates the architecture of the proposed to scale to a large number of languages. First, it system, which is based on Schwenk(2018).
Recommended publications
  • Les Drapeaux Des Langues Construites
    Les Drapeaux des Langues Construites Patrice de La Condamine Résumé Depuis toujours, les hommes oscillent entre la préservation de leurs identités particulières et leur besoin d’appartenance à des communautés globales. L’idée d’universel et de recherche de la “fusion des origines” hante leur cœur. Dans cet esprit, des langues construites ont été élaborées. Qu’elles soient à vocation auxiliaire ou internationale, destinées à de vastes aires culturelles ou à but strictement philosophique. Des noms connus comme Volapük, Espéranto, Ido, Bolak, Interlingua, Occidental. Mais aussi Glosa, Kotava, Lingua Franca Nova, Atlango. Ou encore Folskpraat, Slovio, Nordien, Afrihili, Slovianski, Hedšdël. Sans parler du langage philosophique Lojban1. Le plus intéressant est de constater que toutes ces langues ont des drapeaux qui traduisent les messages et idéaux des groupes en question! La connaissance des drapeaux des langues construites est primordiale pour plusieurs raisons: elle nous permet de comprendre que tous les drapeaux sans exception délivrent des messages d’une part; que l’existence des drapeaux n’est pas forcément liée à l’unique notion de territoire d’autre part. Le drapeau est d’abord et avant tout, à travers son dessin et ses couleurs, un “territoire mental”. Après avoir montré et expliqué ces différents drapeaux2, nous conclurons avec la présentation du drapeau des Conlang, sorte d’ONU des Langues construites! Folkspraak Proceedings of the 24th International Congress of Vexillology, Washington, D.C., USA 1–5 August 2011 © 2011 North American Vexillological Association (www.nava.org) 1 Sélection de noms parmi d’autres. 2 Une trentaine environ. 175 LES DRAPEAUX DES LANGUES CONSTRUITES introduction A nous tous qui sommes réunis ici pour ce XXIVème Congrès International de la vexillologie à Washington, personne n’a plus besoin d’expliquer la nécessité vitale qu’ont les hommes de se représenter au moyen d’emblèmes, et nous savons la place primordiale qu’occupent les drapeaux dans cette fonction.
    [Show full text]
  • Multilingual Facilitation
    Multilingual Facilitation Honoring the career of Jack Rueter Mika Hämäläinen, Niko Partanen and Khalid Alnajjar (eds.) Multilingual Facilitation This book has been authored for Jack Rueter in honor of his 60th birthday. Mika Hämäläinen, Niko Partanen and Khalid Alnajjar (eds.) All papers accepted to appear in this book have undergone a rigorous peer review to ensure high scientific quality. The call for papers has been open to anyone interested. We have accepted submissions in any language that Jack Rueter speaks. Hämäläinen, M., Partanen N., & Alnajjar K. (eds.) (2021) Multilingual Facilitation. University of Helsinki Library. ISBN (print) 979-871-33-6227-0 (Independently published) ISBN (electronic) 978-951-51-5025-7 (University of Helsinki Library) DOI: https://doi.org/10.31885/9789515150257 The contents of this book have been published under the CC BY 4.0 license1. 1 https://creativecommons.org/licenses/by/4.0/ Tabula Gratulatoria Jack Rueter has been in an important figure in our academic lives and we would like to congratulate him on his 60th birthday. Mika Hämäläinen, University of Helsinki Niko Partanen, University of Helsinki Khalid Alnajjar, University of Helsinki Alexandra Kellner, Valtioneuvoston kanslia Anssi Yli-Jyrä, University of Helsinki Cornelius Hasselblatt Elena Skribnik, LMU München Eric & Joel Rueter Heidi Jauhiainen, University of Helsinki Helene Sterr Henry Ivan Rueter Irma Reijonen, Kansalliskirjasto Janne Saarikivi, Helsingin yliopisto Jeremy Bradley, University of Vienna Jörg Tiedemann, University of Helsinki Joshua Wilbur, Tartu Ülikool Juha Kuokkala, Helsingin yliopisto Jukka Mettovaara, Oulun yliopisto Jussi-Pekka Hakkarainen, Kansalliskirjasto Jussi Ylikoski, University of Oulu Kaisla Kaheinen, Helsingin yliopisto Karina Lukin, University of Helsinki Larry Rueter LI Līvõd institūt Lotta Jalava, Kotimaisten kielten keskus Mans Hulden, University of Colorado Marcus & Jackie James Mari Siiroinen, Helsingin yliopisto Marja Lappalainen, M.
    [Show full text]
  • In 2018 Linguapax Review
    linguapax review6 62018 Languages, Worlds and Action Llengües, mons i acció Linguapax Review 2018 Languages, Worlds and Actions Llengües, mons i acció Editat per: Amb el suport de: Generalitat de Catalunya Departament de Cultura Generalitat de Catalunya Departament d’Acció Exterior Relacions Institucionals i Transparència Secretaria d’Acció Exterior i de la Unió Europea Coordinació editorial: Alícia Fuentes Calle Disseny i maquetació: Maria Cabrera Callís Traduccions: Marc Alba / Violeta Roca Font Aquesta obra està subjecta a una llicència de Reconeixement-NoComercial-CompartirIgual 4.0 Internacional de Creative Commons CONTENTS - CONTINGUTS Introduction. Languages, Worlds and action. Alícia Fuentes-Calle 5 Introducció. Llengües, mons i acció. Alícia Fuentes-Calle Túumben Maaya K’aay: De-stigmatising Maya Language in the 14 Yucatan Region Genner Llanes-Ortiz Túumben Maaya K’aay: desestigmatitzant la llengua maia a la regió del Yucatán. Genner Llanes-Ortiz Into the Heimat. Transcultural theatre. Sonia Antinori 37 En el Heimat. Teatre transcultural. Sonia Antinori Sustaining multimodal diversity: Narrative practices from the 64 Central Australian deserts. Jennifer Green La preservació de la diversitat multimodal: els costums narratius dels deserts d’Austràlia central. Jennifer Green A new era in the history of language invention. Jan van Steenbergen 101 Una nova era en la història de la invenció de llengües. Jan van Steenbergen Tribalingual, a startup for endangered languages. Inky Gibbens 183 Tribalingual, una start-up per a llengües amenaçades. Inky Gibbens The Web Alternative, Dimensions of Literacy, and Newer Prospects 200 for African Languages in Today’s World. Kọ́lá Túbọ̀sún L’alternativa web, els aspectes de l’alfabetització i les perspectives més recents de les llengües africanes en el món actual.
    [Show full text]
  • Microsoft Word
    Klaba va vegedunaveem koe witafa golera Bak bata 21 -eafa decemda, akoyepesa klaba tir. Mali mon 150 tanda, jontik rietavik va yona warzafa ava reduyud ise drageyed, enide tana va pelafa golerava wal kote sane ke Tawava zanudar, va ava novesa va mialodafa is miltafa golera isu dokalira pu kottan ke yona amidafa araya, meviduson va ava ke felis seltay. Batlize va riwe teniskafa vexala dem bat abdumimaks me fu ozwatá, dem lodekemaf ik lotestaf. Ise bata vexala tandatandon dun tulodabrotcawer, dumede jonsanayana drikaca ke brubeopa is witaf rotisaceem ta dokalira sopuyud, dum brayarsafa tawa tove jontik avopaf ravesik iku vulegas yambik. Soe, vanmiae konak decemoy abdumimaks leon ok loon bunyeyen, va abic gogrupet : va telyon loon kalskuyun isu bitonaf isu tuseramayas. Kle ozwá : - Esperantoava : arse tela vegedunava loon jupekayasa mali redura bak 1887 gan L- Zamenhof. Rotir celemoy avusik, dem dace gadavaf konaktan. Aotceks va Latinava is abica europafa ava, moe vertapaf pulviropaf bolk. - Volapukava : guazaf abdumimaks sedielon jupekayas arti 19 -eafa decemda. Awalkaf, kore abic rilitik ware dile dulaped. - Idoava : reduyuna bak 1907, solpeks sol Esperantoava. Lotuwavafa ava. Nutir tuviapawesa, damo rotir decitoy avusik oku tol-decitoy. - Interlinguava : abdumimaks ke 1950 sanda, djutis tuwavapaf, vixarinda va romanaveem. Tce dure tela toleafa vegedunava kadime Esperantoava, rotir 3000 avusik iku 5000. Nunubagiwer. - Occidentalava is Novialava : toloy abdumimaks ke 1920 sanda, walefa gu Esperantoava is Interlinguava. Abic avusik dene kottol. Edeme, vanmiae yon lowitaf is tidfis abdumimaks, tid : - Kotava : reduyuna bak 1978. Tolpulvití. - Lojbanava : reduyuna bak 1987, opafa is solokseropafa aneyara rigavayana mo Sapir-Whorf sazdara. Maletiks : ovopapafa ava, vols vanstana gan kon opelik, is xayepesa golde intafa envara lidamu pulviropa lidamu naken ravlemeem, kore mo yon ravlem ke aluboya lozolonafa tuwavava va int zober.
    [Show full text]
  • The Role of Languages in Intercultural Communication Rolo De Lingvoj En Interkultura Komunikado Rola Języków W Komunikacji Międzykulturowej
    Cross-linguistic and Cross-cultural Studies 1 The Role of Languages in Intercultural Communication Rolo de lingvoj en interkultura komunikado Rola języków w komunikacji międzykulturowej Editors – Redaktoroj – Redakcja Ilona Koutny & Ida Stria & Michael Farris Poznań 2020 The Role of Languages in Intercultural Communication Rolo de lingvoj en interkultura komunikado Rola języków w komunikacji międzykulturowej 1 2 Uniwersytet im. Adama Mickiewicza – Adam Mickiewicz University Instytut Etnolingwistyki – Institute of Ethnolinguistics The Role of Languages in Intercultural Communication Rolo de lingvoj en interkultura komunikado Rola języków w komunikacji międzykulturowej Editors – Redaktoroj – Redakcja Ilona Koutny & Ida Stria & Michael Farris Poznań 2020 3 Cross-linguistic and Cross-cultural Studies 1 Redaktor serii – Series editor: Ilona Koutny Recenzenci: Věra Barandovská-Frank Probal Dasgupta Nicolau Dols Salas Michael Farris Sabine Fiedler Federico Gobbo Wim Jansen Kimura Goro Ilona Koutny Timothy Reagan Ida Stria Bengt-Arne Wickström Projekt okładki: Ilona Koutny Copyright by: Aŭtoroj – Authors – Autorzy Copyright by: Wydawnictwo Rys Wydanie I, Poznań 2020 ISBN 978-83-66666-28-3 DOI 10.48226/978-83-66666-28-3 Wydanie: Wydawnictwo Rys ul. Kolejowa 41 62-070 Dąbrówka tel. 600 44 55 80 e-mail: [email protected] www.wydawnictworys.com 4 Contents – Enhavtabelo – Spis treści Foreword / Antaŭparolo / Przedmowa ................................................................................... 7 1. Intercultural communication:
    [Show full text]
  • Conlangs, Línguas Construídas Em Tempos De Internet Conlangs
    Conlangs, LínguasA ConstruídasRTIGO ORIGIN emAL Tempos/ ORIGIN deAL AInternetRTICLE Conlangs, Línguas Construídas em Tempos de Internet Conlangs, Constructed Languages in Times of Internet Amábile de Lourdes Ciampa Mirandaa* Resumo Este estudo teve por objetivo geral levantar dados sobre a existência das conlangs, bem como sua influência na utilização da linguagem da sociedade contemporânea. Do ponto de vista metodológico, este trabalho foi elaborado utilizando a pesquisa bibliográfica. A revisão da literatura demonstrou que as conlangs representam um fenômeno contemporâneo, que tem na internet seu espaço privilegiado de disseminação, nela encontrando as possibilidades virtuais de interatividade e sociabilização, apesar das dificuldades inerentes à construção de linguagens sofisticadas que exigem conhecimentos linguísticos além de uma linguagem puramente tribal e espontânea. Palavras-chave: Linguagens construídas. Sociabilização contemporânea. Interação virtual. Abstract This study aimed to collect data about the general existence of conlangs and its influence on language use in contemporary society. From the methodological point of view, this essay was prepared using the literature research. The literature review showed that conlangs represent a contemporary phenomenon, which has the Internet as its privileged place for dissemination, finding in the web the possibilities of virtual interaction and socialization, despite the difficulties inherent to the construction of languages that require sophisticated language skills and a purely tribal and spontaneous language. Key-words: Constructed Languages. Contemporary Socialization. Virtual Interaction. a Mestre em Comunicação - Universidade Anhembi Morumbi na utilização da linguagem da sociedade contemporânea. (UAM). Docente da Universidade Anhembi Morumbi (UAM). Já os objetivos específicos procuram refletir sobre a E-mail: [email protected] natureza das conlangs, seu significado e impacto no mundo * Endereço para correspondência: Rua Das Cobéias, 178.
    [Show full text]
  • Věra Barandovská-Frank: Vicipaedia Latina
    InterlinguistischeInterlinguistische InformationenInformationen Mitteilungsblatt der Gesellschaft für Interlinguistik e.V. Beiheft 19 Berlin, November 2012 ISSN 1432-3567 FachkommunikationFachkommunikation –– interlinguistischeinterlinguistische AspekteAspekte Beiträge der 21. Jahrestagung der Gesellschaft für Interlinguistik e.V.,, 18. – 20. November 2011 in Berlin Herausgegeben von Cyril Brosch und Sabine Fiedler Berlin 2012 Über die Gesellschaft für Interlinguistik e.V. (GIL) Die GIL konzentriert ihre wissenschaftliche Arbeit vor allem auf Probleme der internationalen sprachlichen Kommunikation, der Plansprachenwissenschaft und der Esperantologie. Die Gesellschaft gibt das Bulletin ,,Interlinguistische Informationen“ (ISSN 1430-2888) heraus und informiert darin über die international und in Deutschland wichtigsten interlinguistischen/ esperantologischen Aktivitäten und Neuerscheinungen. Im Rahmen ihrer Jahreshauptversammlungen führt sie Fachveranstaltungen zu interlinguistischen Problemen durch und veröffentlicht die Akten und andere Materialien. Vorstand der GIL Vorsitzende: Prof. Dr. Sabine Fiedler stellv. Vorsitzender/Schatzmeister: PD Dr. Dr. Rudolf-Josef Fischer Mitglied: Dr. Cyril Brosch Mitglied: Dr. habil. Cornelia Mannewitz Mitglied: Prof. Dr. Velimir Piškorec Berlin 2012 Herausgegeben von der Gesellschaft für Interlinguistik e.V. (GIL) Prof. Dr. Sabine Fiedler (Vorsitzende) Universität Leipzig Institut für Anglistik Beethovenstr. 15, 04107 Leipzig [email protected] www.interlinguistik-gil.de © bei den Autoren der Beiträge
    [Show full text]
  • Flags of Constructed Languages
    Flags of Constructed Languages Patrice de La Condamine Abstract Since the dawn of humanity, humans have oscillated between the preservation of their own individuality and identity and their need to belong to a global community. The ideal of embracing and of researching the unification of its origins haunts their hearts. With this central to its spirit, constructed languages were developed, whether destined for secondary or international use, intended for vast cultural arenas or with strictly a philosophical use in mind. One will encounter well-known names such as Volapuk, Esperanto, Ido, Bolak, Interlingua, Occidental but also Glosa, Kotava, Lingua Franca Nova, and Atlango. Or even Folkspraat, Slovio, Norden, Afrihili, Slovianski, Hedsdel, etc. And we haven’t even mentioned the philosophical language Lojban. The most interesting aspect is establishing that all these languages have flags which translate the messages and the ideals of the groups in question. Interpreting the flags of constructed languages is essential for several reasons; firstly it allows us to understand that all these flags without exception, deliver the same message that the their existence isn’t necessarily linked to the simple notion of territory, and secondly, the flag is first and foremost, by means of its design and colours, a mental territory. After having shown and explained the different flags we will conclude with the presentation of the flag of Conlang, a language created by the United Nations of constructed languages. Folkspraak Proceedings of the 24th International Congress of Vexillology, Washington, D.C., USA 1–5 August 2011 © 2011 North American Vexillological Association (www.nava.org) 159 FLAGS OF CONSTRUCTED LANGUAGES Introduction For everyone present here for this 24th International Congress of Vexillology in Washington, it is not necessary to explain the vital need that one has to represent oneself through the use of emblems, and we all know the essential place which flags occupy in this role.
    [Show full text]
  • The Development of a Comprehensive Data Set for Systematic Studies of Machine Translation
    The Development of a Comprehensive Data Set for Systematic Studies of Machine Translation J¨orgTiedemann1[0000−0003−3065−7989] University of Helsinki, Department of Digital Humanities P.O. Box 24, FI-00014 Helsinki, Finland [email protected] Abstract. This paper presents our on-going efforts to develop a com- prehensive data set and benchmark for machine translation beyond high- resource languages. The current release includes 500GB of compressed parallel data for almost 3,000 language pairs covering over 500 languages and language variants. We present the structure of the data set and demonstrate its use for systematic studies based on baseline experiments with multilingual neural machine translation between Uralic languages and other language groups. Our initial results show the capabilities of training effective multilingual translation models with skewed training data but also stress the shortcomings with low-resource settings and the difficulties to obtain sufficient information through straightforward transfer from related languages. Keywords: machine translation · low-resource languages · multilingual NLP 1 Introduction Massively parallel data sets are valuable resources for various research fields ranging from cross-linguistic research, language typology and translation studies to neural representation learning and cross-lingual transfer of NLP tools and applications. The most obvious application is certainly machine translation (MT) that typically relies on data-driven approaches and heavily draws on aligned parallel corpora as their essential training material. Even though parallel data sets can easily be collected from human transla- tions that naturally appear, their availability is still a huge problem for most languages and domains in the world. This leads to a skewed focus in cross- linguistic research and MT development in particular where sufficient amounts of real-world examples of reasonable quality are only available for a few well- resourced languages.
    [Show full text]
  • Berlina Komentita Bibliografio De Vortaroj Kaj Terminaroj En Esperanto 1887-2014
    Bernhard Pabst Berlina Komentita Bibliografio de Vortaroj kaj Terminaroj en Esperanto 1887-2014 BKB Berliner Kommentierte Bibliographie der Wörterbücher und Terminologien des Esperanto 1887-2014 The Berlin Annotated Bibliography of Esperanto Dictionaries and Terminologies 1887-2014 Esperanto-Akademio Berlin 2015 Bernhard Pabst Berlina Komentita Bibliografio de Vortaroj kaj Terminaroj en Esperanto 1887-2014 BKB Berliner Kommentierte Bibliographie der Wörterbücher und Terminologien des Esperanto 1887-2014 The Berlin Annotated Bibliography of Esperanto Dictionaries and Terminologies 1887-2014 Memore al Rüdiger Eichholz Stralsund (Pomerio) 1922 – Cobourg (Ontario) 2000 Pioniro de perkomputila Esperanto-leksikografio Pabst, Bernhard. Berlina Komentita Bibliografio de Vortaroj kaj Terminaroj en Esperanto 1887-2014. Berliner Kommentierte Bibliographie der Wörterbücher und Terminologien des Esperanto 1887-2014. The Berlin Annotated Bibliography of Esperanto Dictionaries and Terminologies 1887-2014. [1a eld.] Berlin: Esperanto- Akademio 2015. Retejo: http://esperanto-akademio.wikispaces.com/eniro . Kontakto: bernhard.pabst[ ĉe]gmx.de (anstata ŭigu „[ ĉe]“ per @). Propono por citado: BKB (dato). Stato: 2015-05-10. © 2015 Bernhard Pabst Enhavotabelo – Inhaltsverzeichnis – Table of Content Enhavotabelo – Inhaltsverzeichnis – Table of Content................................................ 7 Enkonduko ................................................................................................................. 14 Einführung.................................................................................................................
    [Show full text]
  • E: Les Langues Construites Délimitation, Historique Et Typologie Suivies D’Une Illustration Du Processus De Création D’Une Langue Naturaliste Nommée «Tüchte»
    Fiat Lingua Title: Les langues construites Délimitation, historique et typologie suivies d’une illustration du processus de création d’une langue naturaliste nommée «tüchte» Author: Alexis Huchelmann MS Date: 04-16-2018 FL Date: 04-01-2019 FL Number: FL-00005B-00 Citation: Huchelmann, Alexis. 2018. "Les langues construites Délimitation, historique et typologie suivies d’une illustration du processus de création d’une langue naturaliste nommée «tüchte»." FL-00005B- 00, Fiat Lingua, <http://fiatlingua.org>. Web. 01 April 2019. Copyright: © 2018 Alexis Huchelmann. This work is licensed under a Creative Commons Attribution- NonCommercial-NoDerivs 3.0 Unported License. http://creativecommons.org/licenses/by-nc-nd/3.0/ Fiat Lingua is produced and maintained by the Language Creation Society (LCS). For more information about the LCS, visit http://www.conlang.org/ Université de Strasbourg Faculté des Lettres Année universitaire 2017-2018 Mémoire de Master 2e année Les langues construites Délimitation, historique et typologie suivies d’une illustration du processus de création d’une langue naturaliste nommée « tüchte » Rédigé par Alexis Huchelmann sous la direction de Mme Hélène Vassiliadou Soutenu le 16 avril 2018 devant un jury composé de : M. le Professeur Thierry Revol M. le Professeur Rudolph Sock Remerciements Ce travail n’aurait pas été possible, ou tout du moins de bien plus piètre qualité, sans le soutien de plusieurs (groupes de) personnes. Tout d’abord, je tiens à remercier Mme Vassiliadou, ma directrice de recherche, qui m’a plusieurs fois sauvé du doute et s’est battue pour que je finisse. Tepə ni kí sproats zampanel, madam! Les membres du jury Messieurs les professeurs Revol et Sock, pour avoir accepté de lire mon travail.
    [Show full text]
  • Conlangs ”, Línguas Construídas Em Tempos De Internet
    UNIVERSIDADE ANHEMBI MORUMBI AMÁBILE DE LOURDES CIAMPA MIRANDA “CONLANGS ”, LÍNGUAS CONSTRUÍDAS EM TEMPOS DE INTERNET SÃO PAULO 2010 AMÁBILE DE LOURDES CIAMPA MIRANDA “CONLANGS ”, LÍNGUAS CONSTRUÍDAS EM TEMPOS DE INTERNET Dissertação de Mestrado apresentada à Banca Examinadora, como exigência para a obtenção do título de Mestre em Comunicação, pelo Programa de Mestrado em Comunicação, área de concentração em Comunicação Contemporânea, da Universidade Anhembi Morumbi, sob a orientação da Profa. Dra. Maria Ignês Carlos Magno. SÃO PAULO 2010 M64 Miranda, Amábile de Lourdes Ciampa “Conlangs”: línguas construídas em tempos de Internet / Amábile de Lourdes Ciampa Miranda. – 2010. 69f.: il.; 30 cm. Orientadora: Profa. Dra. Maria Ignês Carlos Magno. Dissertação (Mestrado em Comunicação) - Universidade Anhembi Morumbi, São Paulo, 2010. Bibliografia: f. 63-66. 1. Comunicação. 2. Linguagens Construídas - Conlangs. 3. Internet. 4. Interação Virtual. I. Título. CDD 302.2 AMÁBILE DE LOURDES CIAMPA MIRANDA “CONLANGS ”, LÍNGUAS CONSTRUÍDAS EM TEMPOS DE INTERNET Dissertação de Mestrado apresentada à Banca Examinadora, como exigência para a obtenção do título de Mestre em Comunicação, pelo Programa de Mestrado em Comunicação, área de concentração em Comunicação Contemporânea, da Universidade Anhembi Morumbi, sob a orientação da Profa. Dra. Maria Ignês Carlos Magno. Aprovada em 07/04/2010 ____________________________________ Profa. Dra. Maria Ignês Carlos Magno ____________________________________ Prof. Dr. Gelson Santana Penha ____________________________________ Profa. Dra. Maria Elisabete Antonioli Laurenti DEDICATÓRIA Dedico este trabalho ao mestre dos mestres: Jesus Cristo. Ele não tinha Internet, Orkut, Twitter, Facebook , Blackberry , celular, televisão a cabo, jornal impresso, revistas etc. Ele usou uma linguagem, que não é uma conlang , mas que muitos até hoje, em pleno século XXI, estudam, pesquisam e proferem, porém, ainda não compreendem ou conseguem traduzir com exatidão o seu significado.
    [Show full text]