Exploiting Alignment in Multiparallel Corpora for Applications in Linguistics and Language Learning

Total Page:16

File Type:pdf, Size:1020Kb

Exploiting Alignment in Multiparallel Corpora for Applications in Linguistics and Language Learning Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2018 Exploiting alignment in multiparallel corpora for applications in linguistics and language learning Graën, Johannes Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-153213 Dissertation Published Version Originally published at: Graën, Johannes. Exploiting alignment in multiparallel corpora for applications in linguistics and lan- guage learning. 2018, University of Zurich, Faculty of Arts. Exploiting Alignment in Multiparallel Corpora for Applications in Linguistics and Language Learning Thesis presented to the Faculty of Arts and Social Sciences of the University of Zurich for the degree of Doctor of Philosophy by Johannes Graën Accepted in the spring semester 2018 on the recommendation of the doctoral committee: Prof. Dr. Martin Volk (main supervisor) Prof. Dr. Marianne Hundt Prof. Dr. Stefan Evert Zurich, 2018 Abstract This thesis exploits the automatic identification of semantically corresponding units in parallel and multiparallel corpora, which is referred to as alignment. Mul- tiparallel corpora are text collections of more than two languages that comprise reciprocal translations. The contributions of this thesis are threefold: • First, we prepare a large multiparallel corpus by adding several layers of annotation and alignment. Annotation is first performed on each language individually, while alignment is applied to two or more languages. For the latter case, we use the term multilingual alignment. We show that word alignment on parallel corpora can improve language-specific annotation by means of disambiguation. • Our second contribution consists in the development and evaluation of pro- totypical algorithms for multilingual alignment on both sentence and word level. As languages vary considerably with regard to how content is real- ized in sentences and words, multilingual alignment needs to be represented by a hierarchical structure rather than by bidirectional links as prevailing representation of bilingual alignment. • Based on our corpus, we thirdly show how word alignment in combination with different types of annotation can be employed to benefit linguists and language learners, among others. All tools developed in the context of this thesis, in particular the publicly available web applications, are driven by efficient database queries on a complex data structure. iii Zusammenfassung Gegenstand dieser Dissertation ist die Auswertung von semantischen Korrespon- denzrelationen in parallelen und multiparallen Korpora, welche als Alignierungen bezeichnet werden. Multiparallele Korpora sind Textsammlungen wechselseitiger Übersetzungen zwischen mehr als zwei Sprachen. Diese Arbeit umfasst drei Beiträge: • Zum einen die Aufbereitung eines grossen multiparallelen Korpus durch Hin- zufügen mehrerer Annotations- und Alignierungsebenen. Während jede Spra- che zuerst separat annotiert wird, erstreckt sich die Alignierung über zwei oder mehr Sprachen. Letzteren Fall bezeichnen wir als multilinguale Alignie- rung. Wir zeigen, dass Wortalignierung in parallelen Korpora helfen kann, die sprachspezifischen Annotationen mittels Disambiguierung zu verbessern. • Zum anderen die Entwicklung und Evaluierung prototypischer Algorithmen für multilinguale Alignierung sowohl auf Satz- als auch auf Wortebene. Auf- grund der starken Variation zwischen Sprachen bezüglich der Realisierung von Inhalt in Sätze und Wörter benötigt multilinguale Alignierung zur Dar- stellung der Korrespondenzen eine hierarchische Struktur, anstelle von bidi- rektionalen Verbindungen, wie sie bei bilingualer Alignierung üblich sind. • Des weiteren zeigen wir, wie Wortalignierung in Verbindung mit verschie- denen Annotationsarten zum Nutzen u.a. von Linguisten und Sprachlernern eingesetzt werden kann. Allen Werkzeugen, die im Rahmen dieser Disser- tation entwickelt wurden, insbesondere den öffentlich verfügbaren Weban- wendungen, liegen effiziente Datenbankanfragen auf einer komplexen Da- tenstruktur zugrunde. iv Acknowledgments First of all, I wish to thank my supervisor Martin Volk who guided me through the initial troubles, gave me room to realize my own ideas and had the necessary confidence in me to finish this big project of mine. I am likewise indebtedto my colleagues in the Sparcling project, Marianne Hundt, Simon Clematide and Elena Callegaro, and our student collaborators Dolores Batinic, Christof Bless and Mathias Müller.1 Other, smaller contributions are indicated at the beginning of each chapter. I am thankful for Stefan Evert for accepting our invitation to become part of my doctoral committee. During the last years, I had time to become acquainted with the other members of the Institute of Computational Linguistics. I really appreciate the supportive atmosphere that gave rise to interesting and fruitful discussions, often accompa- nied by chocolate and delicious pastry. I would like to give a special thanks to Noah Bubenhofer, Tilia Ellendorff, Anne Göhring, Samuel Läubli, Laura Mas- carell, Jeannette Roth, Gerold Schneider and Don Tuggener. Aside from people in Zurich, I am particularly grateful for the Språkbanken group in Gothenburg for kindly receiving me in spring 2017. The technical parts described in this thesis were quite demanding. My thanks therefore goes to our institute for making it possible to acquire new servers and letting me implement my concept for a new computer cluster that allows for dis- tributed computing. This would not have been possible without the comprehensive support by the technicians of the Department of Informatics: Hanspeter Kunz, Beat Rageth and Enrico Solcà. Los agradecimientos más importantes suelen venir últimos. Quisiera darles las gracias a mis amigos de Barcelona, particularmente a Alicia Burga, Graham Cole- man, Gabriela Ferraro y Simon Mille, sin los cuales probablemente no hubiese empezado un doctorado. Les expreso mi gratitud a mi familia y a mi novia Mónica por el soporte incondicional durante todos estos años. 1The Sparcling project was kindly funded by the Swiss National Science Foundation under grant 105215_146781/1. v Contents Abstract iii Zusammenfassung iv Acknowledgments v 1 Introduction 1 1.1 The Sparcling Project . 3 1.2 Research Questions . 4 1.3 Outline . 5 2 Parallel Text Corpora 7 2.1 Monolingual Corpora . 11 2.2 Parallel Corpora . 14 2.3 Multiparallel Corpora . 16 2.3.1 Our CoStEP Corpus . 17 3 Corpus Annotation 21 3.1 Tokenization . 26 3.1.1 Cutter: Our Flexible Tokenizer for Many Languages . 27 3.2 Part-of-speech Tagging and Lemmatization . 37 3.2.1 Interlingual Lemma Disambiguation . 44 3.2.2 Particle Verbs in German . 49 3.3 Dependency Parsing . 55 3.4 Database Corpus . 57 4 Alignment Methods 61 4.1 Text Alignment . 63 4.2 Sentence Alignment . 65 4.2.1 Approaches . 68 4.2.2 Evaluating Sentence Alignment . 74 4.3 Multilingual Sentence Alignment . 75 4.3.1 Our Approach to Multilingual Sentence Alignment . 79 4.3.2 Evaluation . 91 4.4 Word Alignment . 106 4.4.1 Approaches . 108 4.4.2 Evaluating Word Alignment . 118 4.5 Multilingual Word Alignment . 123 4.5.1 Our Approach to Multilingual Word Alignment . 124 4.5.2 Evaluation and Outlook . 136 5 Linguistic Applications of Word Alignment 151 5.1 Overlap of Lemma Alignment Distributions as Measure for Seman- tic Relatedness . 153 5.2 Multilingual Translation Spotting . 160 5.3 Phraseme Identification . 165 5.4 Backtranslating Prepositions for Prediction of Language Learners’ Transfer Errors . 179 6 Conclusions 187 Appendices 193 Appendix A Linguistic Annotation 195 A.1 Universal Dependency Labels . 195 A.2 Our Hierarchical Alignment Tool . 198 Appendix B Alignment Quality 201 B.1 Relation of Alignment Error Rate (AER) and F1-Score . 201 Appendix C Data Sets from Joint Measures 203 C.1 Semantic Relatedness of German Particle Verbs . 203 C.2 Generated Recommendations for Learners of English of Different L1 Backgrounds . 226 C.2.1 Verb Preposition Combinations . 226 C.2.2 Adjective Preposition Combinations . 237 Chapter 1 Introduction Collections of texts, known as text corpora, have been subject to linguists’ interest for a long time. They served, for instance, lexicographers as a source for dictionary compilation or linguists and historians for the investigation of language change over time. Parallel text corpora, parallel corpora for short, sometimes also referred to as bitexts, are text collections in two or more languages where textual units, such as articles, sentences or words in one language correspond to textual units of the same kind in another language. If there is more than one other language, we will refer to these collections as multiparallel corpora. The term parallel corpus covers merely translated material and not collections of texts that only connect to each other in terms of content. The latter ones are named comparable corpora since they describe the same topic in a comparable way without the necessity of texts being translations of each other. Wikipedia articles, as an example for comparable corpora, deal with the same topic in several languages and can either be translations from one or more existing articles, or be written independently of corresponding articles in other languages (for an overview see Plamada and Volk 2013). The size of typical corpora impedes manual examination and, hence, calls for automatic processing. Natural language processing (NLP) deals with the auto- mated treatment of natural language, predominantly
Recommended publications
  • Machine-Translation Inspired Reordering As Preprocessing for Cross-Lingual Sentiment Analysis
    Master in Cognitive Science and Language Master’s Thesis September 2018 Machine-translation inspired reordering as preprocessing for cross-lingual sentiment analysis Alejandro Ramírez Atrio Supervisor: Toni Badia Abstract In this thesis we study the effect of word reordering as preprocessing for Cross-Lingual Sentiment Analysis. We try different reorderings in two target languages (Spanish and Catalan) so that their word order more closely resembles the one from our source language (English). Our original expectation was that a Long Short Term Memory classifier trained on English data with bilingual word embeddings would internalize English word order, resulting in poor performance when tested on a target language with different word order. We hypothesized that the more the word order of any of our target languages resembles the one of our source language, the better the overall performance of our sentiment classifier would be when analyzing the target language. We tested five sets of transformation rules for our Part of Speech reorderings of Spanish and Catalan, extracted mainly from two sources: two papers by Crego and Mariño (2006a and 2006b) and our own empirical analysis of two corpora: CoStEP and Tatoeba. The results suggest that the bilingual word embeddings that we are training our Long Short Term Memory model with do not improve any English word order learning by part of the model when used cross-lingually. There is no improvement when reordering the Spanish and Catalan texts so that their word order more closely resembles English, and no significant drop in result score even when applying a random reordering to them making them almost unintelligible, neither when classifying between 2 options (positive-negative) nor between 4 (strongly positive, positive, negative, strongly negative).
    [Show full text]
  • Arabic Corpus and Word Sketches
    Journal of King Saud University – Computer and Information Sciences (2014) 26, 357–371 King Saud University Journal of King Saud University – Computer and Information Sciences www.ksu.edu.sa www.sciencedirect.com arTenTen: Arabic Corpus and Word Sketches Tressy Arts a, Yonatan Belinkov b, Nizar Habash c,*, Adam Kilgarriff d, Vit Suchomel e,d a Chief Editor Oxford Arabic Dictionary, UK b MIT, USA c New York University Abu Dhabi, United Arab Emirates d Lexical Computing Ltd, UK e Masaryk Univ., Czech Republic Available online 7 October 2014 KEYWORDS Abstract We present arTenTen, a web-crawled corpus of Arabic, gathered in 2012. arTenTen con- Corpora; sists of 5.8-billion words. A chunk of it has been lemmatized and part-of-speech (POS) tagged with Lexicography; the MADA tool and subsequently loaded into Sketch Engine, a leading corpus query tool, where it Morphology; is open for all to use. We have also created ‘word sketches’: one-page, automatic, corpus-derived Concordance; summaries of a word’s grammatical and collocational behavior. We use examples to demonstrate Arabic what the corpus can show us regarding Arabic words and phrases and how this can support lexi- cography and inform linguistic research. The article also presents the ‘sketch grammar’ (the basis for the word sketches) in detail, describes the process of building and processing the corpus, and considers the role of the corpus in additional research on Arabic. Ó 2014 Production and hosting by Elsevier B.V. on behalf of King Saud University. 1. Introduction ber of the TenTen Corpus Family (Jakubı´cˇek et al., 2013).
    [Show full text]
  • Student Research Workshop Associated with RANLP 2011, Pages 1–8, Hissar, Bulgaria, 13 September 2011
    RANLPStud 2011 Proceedings of the Student Research Workshop associated with The 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011) 13 September, 2011 Hissar, Bulgaria STUDENT RESEARCH WORKSHOP ASSOCIATED WITH THE INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING’2011 PROCEEDINGS Hissar, Bulgaria 13 September 2011 ISBN 978-954-452-016-8 Designed and Printed by INCOMA Ltd. Shoumen, BULGARIA ii Preface The Recent Advances in Natural Language Processing (RANLP) conference, already in its eight year and ranked among the most influential NLP conferences, has always been a meeting venue for scientists coming from all over the world. Since 2009, we decided to give arena to the younger and less experienced members of the NLP community to share their results with an international audience. For this reason, further to the first successful and highly competitive Student Research Workshop associated with the conference RANLP 2009, we are pleased to announce the second edition of the workshop which is held during the main RANLP 2011 conference days on 13 September 2011. The aim of the workshop is to provide an excellent opportunity for students at all levels (Bachelor, Master, and Ph.D.) to present their work in progress or completed projects to an international research audience and receive feedback from senior researchers. We have received 31 high quality submissions, among which 6 papers have been accepted as regular oral papers, and 18 as posters. Each submission has been reviewed by
    [Show full text]
  • Jazykovedný Ústav Ľudovíta Štúra
    JAZYKOVEDNÝ ÚSTAV ĽUDOVÍTA ŠTÚRA SLOVENSKEJ AKADÉMIE VIED 2 ROČNÍK 70, 2019 JAZYKOVEDNÝ ČASOPIS ________________________________________________________________VEDEcKÝ ČASOPIS PRE OTáZKY TEóRIE JAZYKA JOURNAL Of LINGUISTIcS ScIENTIfIc JOURNAL fOR ThE Theory Of LANGUAGE ________________________________________________________________ hlavná redaktorka/Editor-in-chief: doc. Mgr. Gabriela Múcsková, PhD. Výkonní redaktori/Managing Editors: PhDr. Ingrid Hrubaničová, PhD., Mgr. Miroslav Zumrík, PhD. Redakčná rada/Editorial Board: PhDr. Klára Buzássyová, CSc. (Bratislava), prof. PhDr. Juraj Dolník, DrSc. (Bratislava), PhDr. Ingrid Hrubaničová, PhD. (Bra tislava), doc. Mgr. Martina Ivanová, PhD. (Prešov), Mgr. Nicol Janočková, PhD. (Bratislava), Mgr. Alexandra Jarošová, CSc. (Bratislava), prof. PaedDr. Jana Kesselová, CSc. (Prešov), PhDr. Ľubor Králik, CSc. (Bratislava), PhDr. Viktor Krupa, DrSc. (Bratislava), doc. Mgr. Gabriela Múcsková, PhD. (Bratislava), Univ. Prof. Mag. Dr. Stefan Michael Newerkla (Viedeň – Ra- kúsko), Associate Prof. Mark Richard Lauersdorf, Ph.D. (Kentucky – USA), prof. Mgr. Martin Ološtiak, PhD. (Prešov), prof. PhDr. Slavomír Ondrejovič, DrSc. (Bratislava), prof. PaedDr. Vladimír Patráš, CSc. (Banská Bystrica), prof. PhDr. Ján Sabol, DrSc. (Košice), prof. PhDr. Juraj Vaňko, CSc. (Nitra), Mgr. Miroslav Zumrík, PhD. (Bratislava), prof. PhDr. Pavol Žigo, CSc. (Bratislava). Technický_______________________________________________________________ redaktor/Technical editor: Mgr. Vladimír Radik Vydáva/Published by: Jazykovedný
    [Show full text]
  • ALW2), Pages 1–10 Brussels, Belgium, October 31, 2018
    EMNLP 2018 Second Workshop on Abusive Language Online Proceedings of the Workshop, co-located with EMNLP 2018 October 31, 2018 Brussels, Belgium Sponsors Primary Sponsor Platinum Sponsors Gold Sponsors Silver Sponsors Bronze Sponsors ii c 2018 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-948087-68-1 iii Introduction Interaction amongst users on social networking platforms can enable constructive and insightful conversations and civic participation; however, on many sites that encourage user interaction, verbal abuse has become commonplace. Abusive behavior such as cyberbullying, hate speech, and scapegoating can poison the social climates within online communities. The last few years have seen a surge in such abusive online behavior, leaving governments, social media platforms, and individuals struggling to deal with the consequences. As a field that works directly with computational analysis of language, the NLP community is uniquely positioned to address the difficult problem of abusive language online; encouraging collaborative and innovate work in this area is the goal of this workshop. The first year of the workshop saw 14 papers presented in a day-long program including interdisciplinary panels and active discussion. In this second edition, we have aimed to build on the success of the first year, maintaining a focus on computationally detecting abusive language and encouraging interdisciplinary work. Reflecting the growing research focus on this topic, the number of submissions received more than doubled from 22 in last year’s edition of the workshop to 48 this year.
    [Show full text]
  • Using Morphemes from Agglutinative Languages Like Quechua and Finnish to Aid in Low-Resource Translation
    Using Morphemes from Agglutinative Languages like Quechua and Finnish to Aid in Low-Resource Translation John E. Ortega [email protected] Dept. de Llenguatges i Sistemes Informatics, Universitat d’Alacant, E-03071, Alacant, Spain Krishnan Pillaipakkamnatt [email protected] Department of Computer Science, Hofstra University, Hempstead, NY 11549, USA Abstract Quechua is a low-resource language spoken by nearly 9 million persons in South America (Hintz and Hintz, 2017). Yet, in recent times there are few published accounts of successful adaptations of machine translation systems for low-resource languages like Quechua. In some cases, machine translations from Quechua to Spanish are inadequate due to error in alignment. We attempt to improve previous alignment techniques by aligning two languages that are simi- lar due to agglutination: Quechua and Finnish. Our novel technique allows us to add rules that improve alignment for the prediction algorithm used in common machine translation systems. 1 Introduction The NP-complete problem of translating natural languages as they are spoken by humans to machine readable text is a complex problem; yet, is partially solvable due to the accuracy of machine language translations when compared to human translations (Kleinberg and Tardos, 2005). Statistical machine translation (SMT) systems such as Moses 1, require that an algo- rithm be combined with enough parallel corpora, text from distinct languages that can be com- pared sentence by sentence, to build phrase translation tables from language models. For many European languages, the translation task of bringing words together in a sequential sentence- by-sentence format for modeling, known as word alignment, is not hard due to the abundance of parallel corpora in large data sets such as Europarl 2.
    [Show full text]
  • A Massively Parallel Corpus: the Bible in 100 Languages
    Lang Resources & Evaluation DOI 10.1007/s10579-014-9287-y ORIGINAL PAPER A massively parallel corpus: the Bible in 100 languages Christos Christodouloupoulos • Mark Steedman Ó The Author(s) 2014. This article is published with open access at Springerlink.com Abstract We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora. Keywords Parallel corpus Á Multilingual corpus Á Comparative corpus linguistics 1 Introduction Parallel corpora are a valuable resource for linguistic research and natural language processing (NLP) applications. One of the main uses of the latter kind is as training material for statistical machine translation (SMT), where large amounts of aligned data are standardly used to learn word alignment models between the lexica of two languages (for example, in the Giza?? system of Och and Ney 2003). Another interesting use of parallel corpora in NLP is projected learning of linguistic structure. In this approach, supervised data from a resource-rich language is used to guide the unsupervised learning algorithm in a target language. Although there are some techniques that do not require parallel texts (e.g. Cohen et al. 2011), the most successful models use sentence-aligned corpora (Yarowsky and Ngai 2001; Das and Petrov 2011). C. Christodouloupoulos (&) Department of Computer Science, UIUC, 201 N.
    [Show full text]
  • Korpusleksikograafia Uued Võimalused Eesti Keele Kollokatsioonisõnastiku Näitel
    doi:10.5128/ERYa11.05 korpuSLekSikograafia uued võimaLuSed eeSti keeLe koLLokatSiooniSõnaStiku näiteL Jelena Kallas, Kristina Koppel, Maria Tuulik Ülevaade. Artiklis tutvustame korpusleksikograafia üldisi arengu- tendentse ja uusi meetodeid. Käsitleme korpuse kui leksikograafilise 11, 75–94 EESTI RAKENDUSLINGVISTIKA ÜHINGU AASTARAAMAT info allika potentsiaali ning analüüsime, kuidas saab leksikograafilisi andmebaase pool- ja täisautomaatselt genereerida. Vaatleme, mil määral on uusi tehnoloogilisi lahendusi võimalik rakendada Eesti õppe- leksikograafias, täpsemalt eesti keele kollokatsioonisõnastiku (KOLS) koostamisel. KOLS on esimene eestikeelne sõnastik, kus rakendatakse andmebaasi automaatset genereerimist nii märksõnastiku kui ka sõnaartikli sisu (kollokatiivse info ja näitelausete) tasandil. Tutvustame sõnastiku koostamise üldisi põhimõtteid ja esitame näidisartikli.* Võtmesõnad: korpusleksikograafia, kollokatsioonisõnastik, korpus- päringusüsteem, sõnastikusüsteem, eesti keel 1. Sõnastike korpuspõhine koostamine internetiajastul: vahendid ja meetodid Korpusleksikograafia eesmärk on luua meetodid, mis võimaldaksid leksikograafilisi üksusi automaatselt tuvastada ja valida. Korpusanalüüsi tulemusi rakendatakse tänapäeval märksõnastiku loomisel, tähendusjaotuste uurimisel, leksikaalsete üksuste erinevate omaduste tuvastamisel (süntaktiline käitumine, kollokatsioonid, semantiline prosoodia eri tüüpi tekstides, leksikaal-semantilised suhted), näitelau- sete valikul ja tõlkevastete leidmisel (Kilgarriff 2013: 77). Korpuspõhine koostamine
    [Show full text]
  • A Corpus-Based Study of Unaccusative Verbs and Auxiliary Selection
    To be or not to be: A corpus-based study of unaccusative verbs and auxiliary selection Master's Thesis Presented to The Faculty of the Graduate School of Arts and Sciences Brandeis University Department of Computer Science James Pustejovsky, Advisor In Partial Fulfillment of the Requirements for Master's Degree By Richard A. Brutti Jr. May 2012 c Copyright by Richard A. Brutti Jr. 2012 All Rights Reserved ABSTRACT To be or not to be: A corpus-based study of unaccusative verbs and auxiliary selection A thesis presented to the Department of Computer Science Graduate School of Arts and Sciences Brandeis University Waltham, Massachusetts Richard A. Brutti Jr. Since the introduction of the Unaccusative Hypothesis (Perlmutter, 1978), there have been many further attempts to explain the mechanisms behind the division in intransitive verbs. This paper aims to analyze and test some of theories of unac- cusativity using computational linguistic tools. Specifically, I focus on verbs that exhibit split intransitivity, that is, verbs that can appear in both unaccusative and unergative constructions, and in determining the distinguishing features that make this alternation possible. Many formal linguistic theories of unaccusativity involve the interplay of semantic roles and temporal event markers, both of which can be analyzed using statistical computational linguistic tools, including semantic role labelers, semantic parses, and automatic event classification. I use auxiliary verb selection as a surface-level indicator of unaccusativity in Italian and Dutch, and iii test various classes of verbs extracted from the Europarl corpus (Koehn, 2005). Additionally, I provide some historical background for the evolution of this dis- tinction, and analyze how my results fit into the larger theoretical framework.
    [Show full text]
  • The Translation Equivalents Database (Treq) As a Lexicographer’S Aid
    The Translation Equivalents Database (Treq) as a Lexicographer’s Aid Michal Škrabal, Martin Vavřín Institute of the Czech National Corpus, Charles University, Czech Republic E-mail: [email protected], [email protected] Abstract The aim of this paper is to introduce a tool that has recently been developed at the Institute of the Czech National Corpus, the Treq (Translation Equivalents) database, and to explore its possible uses, especially in the field of lexicography. Equivalent candidates offered by Treq can also be considered as potential equivalents in a bilingual dictionary (we will focus on the Latvian–Czech combination in this paper). Lexicographers instantly receive a list of candidates for target language counterparts and their frequencies (expressed both in absolute numbers and percentages) that suggest the probability that a given candidate is functionally equivalent. A significant advantage is the possibility to click on any one of these candidates and immediately verify their individual occurrences in a given context; and thus more easily distinguish the relevant translation candidates from the misleading ones. This utility, which is based on data stored in the InterCorp parallel corpus, is continually being upgraded and enriched with new functions (the recent integration of multi-word units, adding English as the primary language of the dictionaries, an improved interface, etc.), and the accuracy of the results is growing as the volume of data keeps increasing. Keywords: InterCorp; Treq; translation equivalents; alignment; Latvian–Czech dictionary 1. Introduction The aim of this paper is to introduce one of the tools that has been developed recently at the Institute of the Czech National Corpus (ICNC) and which could be especially helpful to lexicographers: namely, the Treq translation equivalents database1.
    [Show full text]
  • ESP Corpus Construction: a Plea for a Needs-Driven Approach Bâtir Des Corpus En Anglais De Spécialité : Plaidoyer Pour Une Approche Fondée Sur L’Analyse Des Besoins
    ASp la revue du GERAS 68 | 2015 Varia ESP corpus construction: a plea for a needs-driven approach Bâtir des corpus en anglais de spécialité : plaidoyer pour une approche fondée sur l’analyse des besoins Hilary Nesi Electronic version URL: http://journals.openedition.org/asp/4682 DOI: 10.4000/asp.4682 ISSN: 2108-6354 Publisher Groupe d'étude et de recherche en anglais de spécialité Printed version Date of publication: 23 October 2015 Number of pages: 7-24 ISSN: 1246-8185 Electronic reference Hilary Nesi, « ESP corpus construction: a plea for a needs-driven approach », ASp [Online], 68 | 2015, Online since 01 November 2016, connection on 02 November 2020. URL : http:// journals.openedition.org/asp/4682 ; DOI : https://doi.org/10.4000/asp.4682 This text was automatically generated on 2 November 2020. Tous droits réservés ESP corpus construction: a plea for a needs-driven approach 1 ESP corpus construction: a plea for a needs-driven approach Bâtir des corpus en anglais de spécialité : plaidoyer pour une approche fondée sur l’analyse des besoins Hilary Nesi 1. Introduction 1 A huge amount of corpus data is now available for English language teachers and learners, making it difficult for them to decide which kind of corpus is most suited to their needs. The corpora that they probably know best are the large, publicly accessible general corpora which inform general reference works and textbooks; the ones that aim to cover the English language as a whole and to represent many different categories of text. The 100 million word British National Corpus (BNC), created in the 1990s, is a famous example of this type and has been used extensively for language learning purposes, but at 100 million words it is too small to contain large numbers of texts within each of its subcomponents, and it is now too old to represent modern usage in the more fast-moving fields.
    [Show full text]
  • The Pile: an 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding
    The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe Charles Foster Jason Phang Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy EleutherAI [email protected] Abstract versity leads to better downstream generalization capability (Rosset, 2019). Additionally, large-scale Recent work has demonstrated that increased training dataset diversity improves general language models have been shown to effectively cross-domain knowledge and downstream gen- acquire knowledge in a novel domain with only eralization capability for large-scale language relatively small amounts of training data from that models. With this in mind, we present the domain (Rosset, 2019; Brown et al., 2020; Carlini Pile: an 825 GiB English text corpus tar- et al., 2020). These results suggest that by mix- geted at training large-scale language mod- ing together a large number of smaller, high qual- els. The Pile is constructed from 22 diverse ity, diverse datasets, we can improve the general high-quality subsets—both existing and newly cross-domain knowledge and downstream general- constructed—many of which derive from aca- demic or professional sources. Our evalua- ization capabilities of the model compared to mod- tion of the untuned performance of GPT-2 and els trained on only a handful of data sources. GPT-3 on the Pile shows that these models struggle on many of its components, such as To address this need, we introduce the Pile: a academic writing. Conversely, models trained 825:18 GiB English text dataset designed for train- on the Pile improve significantly over both ing large scale language models.
    [Show full text]