Semantic Operations for Transfer-Based Machine Translation

Total Page:16

File Type:pdf, Size:1020Kb

Semantic Operations for Transfer-Based Machine Translation ©Copyright 2018 Michael Wayne Goodman Semantic Operations for Transfer-based Machine Translation Michael Wayne Goodman A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2018 Reading Committee: Emily M. Bender, Chair Francis Bond, Chair Fei Xia Program Authorized to Offer Degree: Linguistics University of Washington Abstract Semantic Operations for Transfer-based Machine Translation Michael Wayne Goodman Co-Chairs of the Supervisory Committee: Chair Emily M. Bender Department of Linguistics Chair Francis Bond Division of Linguistics and Multilingual Studies, Nanyang Technological University This dissertation describes a new approach to the automatic extraction of semantic map- pings (transfer rules) for rule-based machine translation. This approach continues previous work in combining HPSG rule-based grammars, whose precise bidirectional implementation facilitates deep semantic analysis of sentences and the enumeration of grammatical realiza- tions of semantic representations, and data-driven techniques of machine translation, whose automatic extraction of knowledge and statistical inference allow models to be quickly built from bitexts and to rank extracted patterns by their frequency. I define two new methods for bilingually aligning semantic fragments (or semantic subgraphs) and a heuristic strategy for aligning nodes between source and target subgraphs, which together allow me to design transfer systems that meet, and at times exceed, the translation coverage and quality of the prior state of the art with a significantly reduced dependence on idiosyncratic language-pair definitions (i.e., improved language independence). These improvements are made possible by a number of semantic operations, either designed or implemented by me and defined within this dissertation, that fully model the semantic representations and allow for inspec- tion and transformation as graph operations. I apply my methods to the task of translating Japanese sentences into English—a typologically distant language pair. TABLE OF CONTENTS Page List of Figures ....................................... v List of Tables ........................................ viii Glossary ........................................... xi Notational Conventions .................................. xviii Chapter 1: Introduction ................................ 1 1.1 Translating Human Language .......................... 3 1.2 Machine Translation using Semantic Representations ............. 6 1.3 Main Contributions ................................ 10 1.4 Document Outline ................................ 12 Chapter 2: DELPH-IN Semantics ........................... 15 2.1 Philosophical Basics ............................... 17 2.2 Structural Semantic Preliminaries and Terminology .............. 19 2.3 Underspecified Representation .......................... 22 2.4 Visual Presentation of DELPH-IN Semantics .................. 24 2.5 SEM-I: The Semantic Interface ......................... 25 2.6 MRS: Minimal Recursion Semantics ....................... 28 2.7 RMRS: Robust Minimal Recursion Semantics ................. 31 2.8 EDS: Elementary Dependency Structures .................... 31 2.9 DM: Bilexical Dependencies ........................... 33 2.10 DMRS: Dependency Minimal Recursion Semantics .............. 34 2.11 Summary of DELPH-IN Representations .................... 36 2.12 Comparison to Other Frameworks ........................ 36 i 2.13 Chapter Summary ................................ 39 Chapter 3: Machine Translation ............................ 41 3.1 Rule-based Machine Translation ......................... 41 3.2 Statistical and Neural Machine Translation ................... 42 3.3 Chapter Summary ................................ 45 Chapter 4: System Overview .............................. 46 4.1 Translation Pipeline ............................... 46 4.2 Pipeline Information Management ........................ 49 4.3 Parsing ...................................... 50 4.4 Transfer ...................................... 51 4.5 Generation .................................... 56 4.6 Translation Selection ............................... 56 4.7 Chapter Summary ................................ 58 Chapter 5: Semantic Operations ............................ 59 5.1 MRS to DMRS Conversion ............................ 59 5.2 DMRS to MRS Conversion ............................ 64 5.3 Semantic Tests .................................. 68 5.4 Semantic Graph Traversal ............................ 76 5.5 PENMAN Serialization .............................. 82 5.6 Subgraph Extraction ............................... 86 5.7 DMRS Simplification ............................... 90 5.8 Chapter Summary ................................ 95 Chapter 6: Bilingual Semantic Subgraph Alignment ................. 96 6.1 Bilingually Aligned Predicate Phrases ...................... 97 6.2 Top-Down Subgraph Enumeration ........................ 102 6.3 Filtering Subgraphs and Subgraph Pairs .................... 111 6.4 Related Work ................................... 112 6.5 Chapter Summary ................................ 115 ii Chapter 7: Transfer Grammar Augmentation ..................... 116 7.1 The LOGON Transfer Machinery ........................ 117 7.2 Challenges and Strategies in Building Transfer Grammars .......... 121 7.3 Converting Subgraphs to Transfer Rules .................... 124 7.4 Transfer Grammar Maintenance ......................... 131 7.5 Chapter Summary ................................ 132 Chapter 8: Data Exploration .............................. 133 8.1 Data Sources and Divisions ........................... 133 8.2 Basic Analysis ................................... 138 8.3 Analysis of Parsing Performance ......................... 145 8.4 Analysis of Generation Performance ....................... 151 8.5 The Bisem Corpus ................................ 152 8.6 Chapter Summary ................................ 154 Chapter 9: Experimental Design ............................ 155 9.1 Evaluation Metrics ................................ 155 9.2 Pipeline Parameters for Transfer-based Systems ................ 158 9.3 Data Preparation ................................. 159 9.4 Moses Baseline .................................. 160 9.5 Haugereid and Bond Baseline .......................... 162 9.6 LPA: Bilingually-aligned Predicate Phrases ................... 164 9.7 SGA: High-frequency Coincident Subgraphs .................. 170 9.8 Chapter Summary ................................ 179 Chapter 10: Results ................................... 180 10.1 Baseline Development Results .......................... 181 10.2 LPA Development Results ............................ 183 10.3 SGA Development Results ............................ 185 10.4 Combined Test Results .............................. 185 10.5 Translation Examples .............................. 189 10.6 Chapter Summary ................................ 191 iii Chapter 11: Analysis ................................... 192 11.1 Translation Analysis ............................... 192 11.2 Comparing Systems by Automatic Quality Estimations ............ 195 11.3 SGA Subgraph Topologies ............................ 201 11.4 Semantic Analysis ................................ 204 11.5 Performance Comparison ............................. 206 11.6 Translation Examples .............................. 211 11.7 Conclusion ..................................... 215 Chapter 12: Conclusion ................................. 218 12.1 Methodological and Artifactual Contributions ................. 219 12.2 Next Steps and Future Research ......................... 222 12.3 Closing Words ................................... 228 Bibliography ........................................ 229 Appendix A: Setting Up the Translation Environment ................. 245 A.1 Moses ....................................... 245 A.2 ERG ........................................ 247 Appendix B: Database Schema .............................. 249 iv LIST OF FIGURES Figure Number Page 1.1 Vauquois triangle ................................. 7 1.2 Vauquois inverted funnel with a very long spout ................ 8 2.1 MRS graph view for The dog sleeps. ....................... 16 2.2 MRS graph view for Every dog chased some cat. ................ 25 2.3 Variable hierarchy ................................ 26 2.4 MRS fragments of the large and gentle dog sleeps ............... 32 2.5 DMRS fragments of the large and gentle dog sleeps .............. 35 2.6 Ordering of information density among MRS, DMRS, EDS, and DM .... 37 4.1 Simplified translation pipeline .......................... 46 4.2 Pipeline fan-out .................................. 48 4.3 Expanded view of the parsing component .................... 50 4.4 Monolingual MTR in JaEn for normalizing alternate orthographies ..... 53 4.5 Hand-written MTR for the idiomatic 嘘をつく uso-wo tsuku “tell a lie [lit: breathe a lie]” ................................... 53 4.6 Building the transfer pair store ......................... 54 4.7 Expanded view of the transfer component ................... 55 4.8 Expanded view of the generation component .................. 56 4.9 Expanded view of the selection component: First method ........... 57 4.10 Expanded view of the selection component: Oracle method .......... 57 4.11 First-selection of translation hypotheses .................... 57 5.1 MRS for The dog whose tail is long barked. ................... 62 5.2 DMRS for The dog whose
Recommended publications
  • Machine-Translation Inspired Reordering As Preprocessing for Cross-Lingual Sentiment Analysis
    Master in Cognitive Science and Language Master’s Thesis September 2018 Machine-translation inspired reordering as preprocessing for cross-lingual sentiment analysis Alejandro Ramírez Atrio Supervisor: Toni Badia Abstract In this thesis we study the effect of word reordering as preprocessing for Cross-Lingual Sentiment Analysis. We try different reorderings in two target languages (Spanish and Catalan) so that their word order more closely resembles the one from our source language (English). Our original expectation was that a Long Short Term Memory classifier trained on English data with bilingual word embeddings would internalize English word order, resulting in poor performance when tested on a target language with different word order. We hypothesized that the more the word order of any of our target languages resembles the one of our source language, the better the overall performance of our sentiment classifier would be when analyzing the target language. We tested five sets of transformation rules for our Part of Speech reorderings of Spanish and Catalan, extracted mainly from two sources: two papers by Crego and Mariño (2006a and 2006b) and our own empirical analysis of two corpora: CoStEP and Tatoeba. The results suggest that the bilingual word embeddings that we are training our Long Short Term Memory model with do not improve any English word order learning by part of the model when used cross-lingually. There is no improvement when reordering the Spanish and Catalan texts so that their word order more closely resembles English, and no significant drop in result score even when applying a random reordering to them making them almost unintelligible, neither when classifying between 2 options (positive-negative) nor between 4 (strongly positive, positive, negative, strongly negative).
    [Show full text]
  • Actes Des 2Èmes Journées Scientifiques Du Groupement De Recherche Linguistique Informatique Formelle Et De Terrain (LIFT)
    Actes des 2èmes journées scientifiques du Groupement de Recherche Linguistique Informatique Formelle et de Terrain (LIFT). Thierry Poibeau, Yannick Parmentier, Emmanuel Schang To cite this version: Thierry Poibeau, Yannick Parmentier, Emmanuel Schang. Actes des 2èmes journées scientifiques du Groupement de Recherche Linguistique Informatique Formelle et de Terrain (LIFT).. Poibeau, Thierry and Parmentier, Yannick and Schang, Emmanuel. Dec 2020, Montrouge, France. CNRS, 2020. hal-03066031 HAL Id: hal-03066031 https://hal.archives-ouvertes.fr/hal-03066031 Submitted on 3 Jan 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. LIFT 2020 Actes des 2èmes journées scientifiques du Groupement de Recherche Linguistique Informatique Formelle et de Terrain (LIFT) Thierry Poibeau, Yannick Parmentier, Emmanuel Schang (Éds.) 10-11 décembre 2020 Montrouge, France (Virtuel) Comités Comité d’organisation Thierry Poibeau, LATTICE/CNRS Yannick Parmentier, LORIA/Université de Lorraine Emmanuel Schang, LLL/Université d’Orléans Comité de programme Angélique Amelot LPP/CNRS
    [Show full text]
  • Projecto Em Engenharia Informatica
    UNIVERSIDADE DE LISBOA Faculdade de Cienciasˆ Departamento de Informatica´ SPEECH-TO-SPEECH TRANSLATION TO SUPPORT MEDICAL INTERVIEWS Joao˜ Antonio´ Santos Gomes Rodrigues PROJETO MESTRADO EM ENGENHARIA INFORMATICA´ Especializac¸ao˜ em Interacc¸ao˜ e Conhecimento 2013 UNIVERSIDADE DE LISBOA Faculdade de Cienciasˆ Departamento de Informatica´ SPEECH-TO-SPEECH TRANSLATION TO SUPPORT MEDICAL INTERVIEWS Joao˜ Antonio´ Santos Gomes Rodrigues PROJETO MESTRADO EM ENGENHARIA INFORMATICA´ Especializac¸ao˜ em Interacc¸ao˜ e Conhecimento Projeto orientado pelo Prof. Doutor Antonio´ Manuel Horta Branco 2013 Agradecimentos Agradec¸o ao meu orientador, Antonio´ Horta Branco, por conceder a oportunidade de realizar este trabalho e pela sua sagacidade e instruc¸ao˜ oferecida para a concretizac¸ao˜ do mesmo. Agradec¸o aos meus parceiros do grupo NLX, por terem aberto os caminhos e cri- ado inumeras´ ferramentas e conhecimento para o processamento de linguagem natural do portugues,ˆ das quais fac¸o uso neste trabalho. Assim como pela imprescind´ıvel partilha e companhia. Agradec¸o a` minha fam´ılia e a` Joana, pelo apoio inefavel.´ Este trabalho foi apoiado pela Comissao˜ Europeia no ambitoˆ do projeto METANET4U (ECICTPSP:270893), pela Agenciaˆ de Inovac¸ao˜ no ambitoˆ do projeto S4S (S4S / PLN) e pela Fundac¸ao˜ para a Cienciaˆ e Tecnologia no ambitoˆ do projeto DP4LT (PTDC / EEISII / 1940 / 2012). As` tresˆ expresso a minha sincera gratidao.˜ iii Resumo Este relatorio´ apresenta a criac¸ao˜ de um sistema de traduc¸ao˜ fala-para-fala. O sistema consiste na captac¸ao˜ de voz na forma de sinal audio´ que de seguida e´ interpretado, tra- duzido e sintetizado para voz. Tendo como entrada um enunciado numa linguagem de origem e como sa´ıda um enunciado numa linguagem destino.
    [Show full text]
  • New Kazakh Parallel Text Corpora with On-Line Access
    New Kazakh parallel text corpora with on-line access Zhandos Zhumanov1, Aigerim Madiyeva2 and Diana Rakhimova3 al-Farabi Kazakh National University, Laboratory of Intelligent Information Systems, Almaty, Kazakhstan [email protected], [email protected], [email protected] Abstract. This paper presents a new parallel resource – text corpora – for Ka- zakh language with on-line access. We describe 3 different approaches to col- lecting parallel text and how much data we managed to collect using them, par- allel Kazakh-English text corpora collected from various sources and aligned on sentence level, and web accessible corpus management system that was set up using open source tools – corpus manager Mantee and web GUI KonText. As a result of our work we present working web-accessible corpus management sys- tem to work with collected corpora. Keywords: parallel text corpora, Kazakh language, corpus management system 1 Introduction Linguistic text corpora are large collections of text used for different language studies. They are used in linguistics and other fields that deal with language studies as an ob- ject of study or as a resource. Text corpora are needed for almost any language study since they are basically a representation of the language itself. In computer linguistics text corpora are used for various parsing, machine translation, speech recognition, etc. As shown in [1] text corpora can be classified by many categories: Size: small (to 1 million words); medium (from 1 to 10 million words); large (more than 10 million words). Number of Text Languages: monolingual, bilingual, multilingual. Language of Texts: English, German, Ukrainian, etc. Mode: spoken; written; mixed.
    [Show full text]
  • Multilingual Semantic Role Labeling with Unified Labels
    POLYGLOT: Multilingual Semantic Role Labeling with Unified Labels Alan Akbik Yunyao Li IBM Research Almaden Research Center 650 Harry Road, San Jose, CA 95120, USA fakbika,[email protected] Abstract Semantic role labeling (SRL) identifies the predicate-argument structure in text with semantic labels. It plays a key role in un- derstanding natural language. In this pa- per, we present POLYGLOT, a multilingual semantic role labeling system capable of semantically parsing sentences in 9 differ- Figure 1: Example of POLYGLOT predicting English Prop- ent languages from 4 different language Bank labels for a simple German sentence: The verb “kaufen” is correctly identified to evoke the BUY.01 frame, while “ich” groups. The core of POLYGLOT are SRL (I) is recognized as the buyer,“ein neues Auto”(a new car) models for individual languages trained as the thing bought, and “dir”(for you) as the benefactive. with automatically generated Proposition Banks (Akbik et al., 2015). The key fea- Not surprisingly, enabling SRL for languages ture of the system is that it treats the other than English has received increasing atten- semantic labels of the English Proposi- tion. One relevant key effort is to create Proposi- tion Bank as “universal semantic labels”: tion Bank-style resources for different languages, Given a sentence in any of the supported such as Chinese (Xue and Palmer, 2005) and languages, POLYGLOT applies the corre- Hindi (Bhatt et al., 2009). However, the conven- sponding SRL and predicts English Prop- tional approach of manually generating such re- Bank frame and role annotation. The re- sources is costly in terms of time and experts re- sults are then visualized to facilitate the quired, hindering the expansion of SRL to new tar- understanding of multilingual SRL with get languages.
    [Show full text]
  • Learning and Applications of Paraphrastic Representations for Natural Language, Ph.D
    LEARNINGANDAPPLICATIONSOFPARAPHRASTIC REPRESENTATIONSFORNATURALLANGUAGE john wieting Ph.D. Thesis Language Technologies Institute Carnegie Mellon University July 27th 2020 John Wieting: Learning and Applications of Paraphrastic Representations for Natural Language, Ph.D. Thesis, © July 27th 2020 ABSTRACT Representation learning has had a tremendous impact in machine learning and natural lan- guage processing (NLP), especially in recent years. Learned representations provide useful features needed for downstream tasks, allowing models to incorporate knowledge from billions of tokens of text. The result is better performance and generalization on many important problems of interest. Often these representations can also be used in an un- supervised manner to determine the degree of semantic similarity of text or for finding semantically similar items, the latter useful for mining paraphrases or parallel text. Lastly, representations can be probed to better understand what aspects of language have been learned, bringing an additional element of interpretability to our models. This thesis focuses on the problem of learning paraphrastic representations for units of language. These units span from sub-words, to words, to phrases, and to full sentences – the latter being a focal point. Our primary goal is to learn models that can encode arbitrary word sequences into a vector with the property that sequences with similar semantics are near each other in the learned vector space, and that this property transfers across domains. We first show several effective and simple models, paragram and charagram, to learn word and sentence representations on noisy paraphrases automatically extracted from bilin- gual corpora. These models outperform contemporary and more complicated models on a variety of semantic evaluations.
    [Show full text]
  • *Λcross-Lingual Semantic Parsing with Categorial Grammars
    ∨ ¬ = Cross-lingual Semantic Parsing with Categorial Grammars Kilian Evang ◇ λ+ ∧* > ; Cross-lingual Semantic Parsing with Categorial Grammars Kilian Evang The work in this thesis has been carried out under the auspices of the Center for Lan- guage and Cognition Groningen of the Faculty of Arts of the University of Groningen. Groningen Dissertations in Linguistics 155 ISSN: 0928-0030 ISBN: 978-90-367-9474-9 (printed version) ISBN: 978-90-367-9473-2 (electronic version) c 2016, Kilian Evang Document prepared with LATEX 2" and typeset by pdfTEX (TEX Gyre Pagella font) Cover design by Kilian Evang Printed by CPI Thesis (www.cpithesis.nl) on Matt MC 115g paper Cross-lingual Semantic Parsing with Categorial Grammars Proefschrift ter verkrijging van de graad van doctor aan de Rijksuniversiteit Groningen op gezag van de rector magnificus prof. dr. E. Sterken en volgens besluit van het College voor Promoties. De openbare verdediging zal plaatsvinden op donderdag 26 januari 2017 om 12.45 uur door Kilian Evang geboren op 1 mei 1986 te Siegburg, Duitsland Promotores Prof. dr. J. Bos Beoordelingscommissie Prof. dr. M. Butt Prof. dr. J. Hoeksema Prof. dr. M. Steedman Acknowledgments Whew, that was quite a ride! People had warned me that getting a PhD is not for the faint of heart, and they were right. But it’s finally done! Everybody whose guidance, support and friendship I have been able to count on during these five years certainly deserves a big “thank you”. Dear Johan, you created the unique research program on deep se- mantic annotation that lured me to Groningen, you hired me and agreed to be my PhD supervisor.
    [Show full text]
  • Towards a Corsican Basic Language Resource Kit
    Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2726–2735 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Towards a Corsican Basic Language Resource Kit Laurent Kevers, Stella Retali-Medori UMR CNRS 6240 LISA, Universita` di Corsica - Pasquale Paoli Avenue Jean Nicoli, 20250 Corte, France fkevers l, medori [email protected] Abstract The current situation regarding the existence of natural language processing (NLP) resources and tools for Corsican reveals their virtual non-existence. Our inventory contains only a few rare digital resources, lexical or corpus databases, requiring adaptation work. Our ob- jective is to use the Banque de Donnees´ Langue Corse project (BDLC) to improve the availability of resources and tools for the Corsican language and, in the long term, provide a complete Basic Language Ressource Kit (BLARK). We have defined a roadmap setting out the actions to be undertaken: the collection of corpora and the setting up of a consultation interface (concordancer), and of a language detection tool, an electronic dictionary and a part-of-speech tagger. The first achievements regarding these topics have already been reached and are presented in this article. Some elements are also available on our project page (http://bdlc.univ-corse.fr/tal/). Keywords: less-resourced languages, Corsican, corpora, lexical resources, language identification, POS tagging We will first contextualize our research and present some a polynomic approach (Marcellesi, 1984) that encompasses general information about the Corsican language (1.1., its all dialectal variants, the writing of the language is not stan- digital presence (1.2.) and the BDLC project (1.3.), as pre- dardised, which is an obstacle for its automatic processing.
    [Show full text]
  • D1.4 Semantics in Shallow Models
    This document is part of the Research and Innovation Action “Quality Translation 21 (QT21)”. This project has received funding from the European Union’s Horizon 2020 program for ICT under grant agreement no. 645452. Deliverable D1.4 Semantics in Shallow Models Ondřej Bojar (CUNI), Rico Sennrich (UEDIN), Philip Williams (UEDIN), Khalil Sima’an (UvA), Stella Frank (UvA), Inguna Skadiņa (Tilde), Daiga Deksne (Tilde) Dissemination Level: Public 31st January, 2018 Quality Translation 21 D1.4: Semantics in Shallow Models Grant agreement no. 645452 Project acronym QT21 Project full title Quality Translation 21 Type of action Research and Innovation Action Coordinator Prof. Josef van Genabith (DFKI) Start date, duration 1st February, 2015, 36 months Dissemination level Public Contractual date of delivery 31st July, 2016 Actual date of delivery 31st July, 2016 Deliverable number D1.4 Deliverable title Semantics in Shallow Models Type Report Status and version Final (Version 1.0) Number of pages 107 Contributing partners CUNI, UEDIN, RWTH, UvA, Tilde WP leader CUNI Author(s) Ondřej Bojar (CUNI), Rico Sennrich (UEDIN), Philip Williams (UEDIN), Khalil Sima’an (UvA), Stella Frank (UvA), Inguna Skadiņa (Tilde), Daiga Deksne (Tilde) EC project officer Susan Fraser The partners in QT21 are: • Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Germany • Rheinisch-Westfälische Technische Hochschule Aachen (RWTH), Germany • Universiteit van Amsterdam (UvA), Netherlands • Dublin City University (DCU), Ireland • University of Edinburgh (UEDIN), United Kingdom • Karlsruher Institut für Technologie (KIT), Germany • Centre National de la Recherche Scientifique (CNRS), France • Univerzita Karlova v Praze (CUNI), Czech Republic • Fondazione Bruno Kessler (FBK), Italy • University of Sheffield (USFD), United Kingdom • TAUS b.v.
    [Show full text]
  • Proceedings of the Celtic Language Technology Workshop 2019
    Machine Translation Summit XVII Proceedings of the Celtic Language Technology Workshop 2019 http://cl.indiana.edu/cltw19/ 19th August, 2019 Dublin, Ireland © 2019 The authors. These articles are licensed under a Creative Commons 4.0 licence, no derivative works, attribution, CC-BY-ND. Proceedings of the Celtic Language Technology Workshop 2019 http://cl.indiana.edu/cltw19/ 19th August, 2019 Dublin, Ireland © 2019 The authors. These articles are licensed under a Creative Commons 4.0 licence, no derivative works, attribution, CC-BY-ND. Preface from the co-chairs of the workshop These proceedings include the program and papers presented at the third Celtic Language Technology Workshop held in conjunction with the Machine Translation Summit in Dublin in August 2019. Celtic Languages are spoken in regions extending from the British Isles to the western seaboard of continental Europe, and communities in Argentina and Canada. They include Irish, Scottish Gaelic, Breton, Manx, Welsh and Cornish. Irish is an official EU language, both Irish and Welsh have official language status in their respective countries, and Scottish Gaelic recently saw an improvement in its prospects through the Gaelic Language (Scotland) Act 2005. The majority of others have co-official status, except Breton which has not yet received recognition as an official or regional language. Until recently, the Celtic languages have lagged behind in the area of natural language processing (NLP) and machine translation (MT). Consequently, language technology research and resource provision for this language group was poor. In recent years, as the resource provision for minority and under-resourced languages has been improving, it has been extended and applied to some Celtic languages.
    [Show full text]
  • Corpus Linguistics Main Issues
    HG3051 Corpus Linquistics Introduction to Corpus Linguistics Main Issues Francis Bond Division of Linguistics and Multilingual Studies http://www3.ntu.edu.sg/home/fcbond/ [email protected] Lecture 1 https://github.com/bond-lab/Corpus-Linguistics HG3051 (2020) Introduction ã Administrivia ã What is Corpus Linguistics ã What this course is (and isn’t) ã Getting to know each other (what do you want?) Introduction to Corpus Linguistics 1 Corpora I have been involved with ã Semantic markup of the LDC Call Home Corpus sense tagging of Japanese telephone transcripts ã Hinoki Treebank of Japanese HPSG parses of Japanese definitions, examples and newspaper text sense tagging of same ã Tanaka Corpus of aligned Japanese-English text Now the Tatoeba multilingual project: www.tatoeba.org ã NICT English learner corpus (advisor) ã Japanese WordNet gloss corpus, jSEMCOR corpus aligned Japanese-English text, sense tagging Introduction to Corpus Linguistics 2 Corpora I am building now ã NTU Multilingual Corpus: NTU-MC with help from Tan Liling, HG2002, HG8011 and many more ­ Arabic, Chinese, English, Indonesian, Japanese, Korean, Vietnamese ∗ Essays ∗ Short Stories (Sherlock Holmes) ∗ News Text ∗ Singapore Tourist Web Sites ­ Wordnet sense tagging, HPSG parses ­ Cross lingual alignment ­ Tagging various phenomena ­ Used in URECA and FYP research, we will use it in this course ã NTU Corpus of Learner English: NTUCLE ­ With help from LCC (Tan and Bond, 2012; Winder et al., 2017) 3 100% Continuous Assessment ã Individual Lab Work (4x10%) ã Individual Project
    [Show full text]
  • Dirt Cheap Web-Scale Parallel Text from the Common Crawl Jason R
    Dirt Cheap Web-Scale Parallel Text from the Common Crawl Jason R. Smith1,2 Herve Saint-Amand3 Magdalena Plamada4 [email protected] [email protected] [email protected] Philipp Koehn3 Chris Callison-Burch1,2,5 Adam Lopez1,2 [email protected] [email protected][email protected] 1Department of Computer Science, Johns Hopkins University 2Human Language Technology Center of Excellence, Johns Hopkins University 3School of Informatics, University of Edinburgh 4Institute of Computational Linguistics, University of Zurich 5Computer and Information Science Department, University of Pennsylvania Abstract are readily available, ordering in the hundreds of millions of words for Chinese-English and Arabic- Parallel text is the fuel that drives modern English, and in tens of millions of words for many machine translation systems. The Web is a European languages (Koehn, 2005). In each case, comprehensive source of preexisting par- much of this data consists of government and news allel text, but crawling the entire web is text. However, for most language pairs and do- impossible for all but the largest compa- mains there is little to no curated parallel data nies. We bring web-scale parallel text to available. Hence discovery of parallel data is an the masses by mining the Common Crawl, important first step for translation between most a public Web crawl hosted on Amazon’s of the world’s languages. Elastic Cloud. Starting from nothing more The Web is an important source of parallel than a set of common two-letter language text. Many websites are available in multiple codes, our open-source extension of the languages, and unlike other potential sources— STRAND algorithm mined 32 terabytes of such as multilingual news feeds (Munteanu and the crawl in just under a day, at a cost of Marcu, 2005) or Wikipedia (Smith et al., 2010)— about $500.
    [Show full text]