Learning and Applications of Paraphrastic Representations for Natural Language, Ph.D

Total Page:16

File Type:pdf, Size:1020Kb

Learning and Applications of Paraphrastic Representations for Natural Language, Ph.D LEARNINGANDAPPLICATIONSOFPARAPHRASTIC REPRESENTATIONSFORNATURALLANGUAGE john wieting Ph.D. Thesis Language Technologies Institute Carnegie Mellon University July 27th 2020 John Wieting: Learning and Applications of Paraphrastic Representations for Natural Language, Ph.D. Thesis, © July 27th 2020 ABSTRACT Representation learning has had a tremendous impact in machine learning and natural lan- guage processing (NLP), especially in recent years. Learned representations provide useful features needed for downstream tasks, allowing models to incorporate knowledge from billions of tokens of text. The result is better performance and generalization on many important problems of interest. Often these representations can also be used in an un- supervised manner to determine the degree of semantic similarity of text or for finding semantically similar items, the latter useful for mining paraphrases or parallel text. Lastly, representations can be probed to better understand what aspects of language have been learned, bringing an additional element of interpretability to our models. This thesis focuses on the problem of learning paraphrastic representations for units of language. These units span from sub-words, to words, to phrases, and to full sentences – the latter being a focal point. Our primary goal is to learn models that can encode arbitrary word sequences into a vector with the property that sequences with similar semantics are near each other in the learned vector space, and that this property transfers across domains. We first show several effective and simple models, paragram and charagram, to learn word and sentence representations on noisy paraphrases automatically extracted from bilin- gual corpora. These models outperform contemporary and more complicated models on a variety of semantic evaluations. We then propose techniques to enable deep networks to learn effective semantic represen- tations, addressing a limitation of our prior work. We found that in order to learn represen- tations for sentences with deeper, more expressive neural networks, we need large amounts of sentential paraphrase data. Since this did not exist yet, we utilized neural machine trans- lation models to create ParaNMT-50M, a corpus of 50 million English paraphrases which has found numerous uses by NLP researchers, in addition to providing further gains on our learned paraphrastic sentence representations. We next propose models for bilingual paraphrastic sentence representations. We first pro- pose a simple and effective approach that outperforms more complicated methods on cross- lingual sentence similarity and mining bitext, and we also show that we can also achieve strong monolingual performance without paraphrase corpora by just using parallel text. iii We then propose a generative model capable of concentrating semantic information into our embeddings and separating out extraneous information by viewing parallel text as two different views of a semantic concept. We found that this model has improved performance on both monolingual and cross-lingual tasks. Lastly, we extend this bilingual model to the multilingual setting and show it can be effective on multiple languages simultaneously, significantly surpassing contemporary multilingual models. Finally, this thesis concludes by showing applications of our learned representations and ParaNMT-50M. The first of these is on generating paraphrases with syntactic control for making classifiers more robust to adversarial attacks. We found that we can generate a controlled paraphrase for a sentence by supplying just the top production of the desired constituent parse – and the generated sentence will follow this structure, filling in the rest of the tree as needed to create the paraphrase. The second application is applying our representations for fine-tuning neural machine translation systems using minimum risk training. The conventional approach is to use BLEU (Papineni et al., 2002), since that is what is commonly used for evaluation. However, we found that using an embedding model to evaluate similarity allows the range of possible scores to be continuous and, as a result, introduces fine-grained distinctions between similar translations. The result is better performance on both human evaluations and BLEU score, along with faster convergence during training. iv CONTENTS 1 introduction8 2 background 14 2.1 Representation Learning in Natural Language Processing . 14 2.1.1 Subword Embeddings . 14 2.1.2 Word Embeddings . 16 2.1.3 Sentence Embeddings . 17 2.1.4 Contextualized Embeddings . 19 2.2 Evaluating Representations . 20 2.2.1 Word Embedding Evaluation . 20 2.2.2 Sentence Embedding Evaluation . 20 2.3 Paraphrase Corpora . 24 2.3.1 Paraphrase Database . 24 2.3.2 Sentential Paraphrase Corpora . 25 i paraphrastic embeddings 26 3 recursive neural networks and paragram word embeddings 29 3.1 Introduction . 29 3.2 Related Work . 31 3.3 New Paraphrase Datasets . 32 3.3.1 Annotated-PPDB . 32 3.3.2 ML-Paraphrase . 34 3.4 Paraphrase Models . 35 3.4.1 Objective Functions . 36 3.5 Experiments – Word Paraphrasing . 38 3.5.1 Training Procedure . 38 3.5.2 Extracting Training Data . 39 3.5.3 Tuning and Evaluation . 39 3.5.4 Results . 39 3.5.5 Sentiment Analysis . 40 3.5.6 Scaling up to 300 Dimensions . 41 v contents vi 3.6 Experiments – Compositional Paraphrasing . 42 3.6.1 Training Procedure . 43 3.6.2 Evaluation and Baselines . 43 3.6.3 Bigram Paraphrasability . 43 3.6.4 Phrase Paraphrasability . 45 3.7 Qualitative Analysis . 47 3.8 Conclusion . 49 4 paragram sentence representations defined and explored 51 4.1 Introduction . 52 4.2 Related Work . 54 4.3 Models and Training . 56 4.3.1 Training . 57 4.4 Experiments . 59 4.4.1 Data ...................................... 59 4.4.2 Transfer Learning . 59 4.4.3 paragram-phrase XXL . 62 4.4.4 Using Representations in Learned Models . 62 4.5 Discussion . 67 4.5.1 Under-Trained Embeddings . 68 4.5.2 Using More PPDB . 69 4.6 Qualitative Analysis . 70 4.7 Conclusion . 72 5 charagram: improved generalization by embedding words and sen- tences via character ngrams 73 5.1 Introduction . 73 5.2 Related Work . 74 5.3 Models . 75 5.4 Experiments . 76 5.4.1 Word and Sentence Similarity . 77 5.4.2 Selecting Negative Examples . 77 5.4.3 POS Tagging Experiments . 84 5.4.4 Convergence . 85 5.4.5 Model Size Experiments . 86 5.5 Analysis . 87 contents vii 5.5.1 Quantitative Analysis . 87 5.5.2 Qualitative Analysis . 88 5.6 Conclusion . 90 ii automatically creating paraphrase corpora for improved sen- tence embedding and generation tasks 91 6 revisiting recurrent networks for paraphrastic sentence embed- dings 92 6.1 Introduction . 92 6.2 Related Work . 93 6.3 Models and Training . 94 6.3.1 Models . 94 6.3.2 Training . 96 6.4 Experiments . 97 6.4.1 Transfer Learning . 97 6.4.2 Experiments with Regularization . 99 6.4.3 Supervised Text Similarity . 103 6.5 Analysis . 105 6.5.1 Error Analysis . 105 6.5.2 GRAN Gate Analysis . 106 6.6 Conclusion . 108 7 learning paraphrastic sentence embeddings from back-translated bitext 109 7.1 Introduction . 109 7.2 Related Work . 110 7.3 Neural Machine Translation . 112 7.4 Models and Training . 113 7.4.1 Models . 114 7.4.2 Training . 114 7.5 Experiments . 115 7.5.1 Evaluation . 115 7.5.2 Experimental Setup . 116 7.5.3 Dataset Comparison . 116 7.5.4 Filtering Methods . ..
Recommended publications
  • Machine-Translation Inspired Reordering As Preprocessing for Cross-Lingual Sentiment Analysis
    Master in Cognitive Science and Language Master’s Thesis September 2018 Machine-translation inspired reordering as preprocessing for cross-lingual sentiment analysis Alejandro Ramírez Atrio Supervisor: Toni Badia Abstract In this thesis we study the effect of word reordering as preprocessing for Cross-Lingual Sentiment Analysis. We try different reorderings in two target languages (Spanish and Catalan) so that their word order more closely resembles the one from our source language (English). Our original expectation was that a Long Short Term Memory classifier trained on English data with bilingual word embeddings would internalize English word order, resulting in poor performance when tested on a target language with different word order. We hypothesized that the more the word order of any of our target languages resembles the one of our source language, the better the overall performance of our sentiment classifier would be when analyzing the target language. We tested five sets of transformation rules for our Part of Speech reorderings of Spanish and Catalan, extracted mainly from two sources: two papers by Crego and Mariño (2006a and 2006b) and our own empirical analysis of two corpora: CoStEP and Tatoeba. The results suggest that the bilingual word embeddings that we are training our Long Short Term Memory model with do not improve any English word order learning by part of the model when used cross-lingually. There is no improvement when reordering the Spanish and Catalan texts so that their word order more closely resembles English, and no significant drop in result score even when applying a random reordering to them making them almost unintelligible, neither when classifying between 2 options (positive-negative) nor between 4 (strongly positive, positive, negative, strongly negative).
    [Show full text]
  • Actes Des 2Èmes Journées Scientifiques Du Groupement De Recherche Linguistique Informatique Formelle Et De Terrain (LIFT)
    Actes des 2èmes journées scientifiques du Groupement de Recherche Linguistique Informatique Formelle et de Terrain (LIFT). Thierry Poibeau, Yannick Parmentier, Emmanuel Schang To cite this version: Thierry Poibeau, Yannick Parmentier, Emmanuel Schang. Actes des 2èmes journées scientifiques du Groupement de Recherche Linguistique Informatique Formelle et de Terrain (LIFT).. Poibeau, Thierry and Parmentier, Yannick and Schang, Emmanuel. Dec 2020, Montrouge, France. CNRS, 2020. hal-03066031 HAL Id: hal-03066031 https://hal.archives-ouvertes.fr/hal-03066031 Submitted on 3 Jan 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. LIFT 2020 Actes des 2èmes journées scientifiques du Groupement de Recherche Linguistique Informatique Formelle et de Terrain (LIFT) Thierry Poibeau, Yannick Parmentier, Emmanuel Schang (Éds.) 10-11 décembre 2020 Montrouge, France (Virtuel) Comités Comité d’organisation Thierry Poibeau, LATTICE/CNRS Yannick Parmentier, LORIA/Université de Lorraine Emmanuel Schang, LLL/Université d’Orléans Comité de programme Angélique Amelot LPP/CNRS
    [Show full text]
  • Projecto Em Engenharia Informatica
    UNIVERSIDADE DE LISBOA Faculdade de Cienciasˆ Departamento de Informatica´ SPEECH-TO-SPEECH TRANSLATION TO SUPPORT MEDICAL INTERVIEWS Joao˜ Antonio´ Santos Gomes Rodrigues PROJETO MESTRADO EM ENGENHARIA INFORMATICA´ Especializac¸ao˜ em Interacc¸ao˜ e Conhecimento 2013 UNIVERSIDADE DE LISBOA Faculdade de Cienciasˆ Departamento de Informatica´ SPEECH-TO-SPEECH TRANSLATION TO SUPPORT MEDICAL INTERVIEWS Joao˜ Antonio´ Santos Gomes Rodrigues PROJETO MESTRADO EM ENGENHARIA INFORMATICA´ Especializac¸ao˜ em Interacc¸ao˜ e Conhecimento Projeto orientado pelo Prof. Doutor Antonio´ Manuel Horta Branco 2013 Agradecimentos Agradec¸o ao meu orientador, Antonio´ Horta Branco, por conceder a oportunidade de realizar este trabalho e pela sua sagacidade e instruc¸ao˜ oferecida para a concretizac¸ao˜ do mesmo. Agradec¸o aos meus parceiros do grupo NLX, por terem aberto os caminhos e cri- ado inumeras´ ferramentas e conhecimento para o processamento de linguagem natural do portugues,ˆ das quais fac¸o uso neste trabalho. Assim como pela imprescind´ıvel partilha e companhia. Agradec¸o a` minha fam´ılia e a` Joana, pelo apoio inefavel.´ Este trabalho foi apoiado pela Comissao˜ Europeia no ambitoˆ do projeto METANET4U (ECICTPSP:270893), pela Agenciaˆ de Inovac¸ao˜ no ambitoˆ do projeto S4S (S4S / PLN) e pela Fundac¸ao˜ para a Cienciaˆ e Tecnologia no ambitoˆ do projeto DP4LT (PTDC / EEISII / 1940 / 2012). As` tresˆ expresso a minha sincera gratidao.˜ iii Resumo Este relatorio´ apresenta a criac¸ao˜ de um sistema de traduc¸ao˜ fala-para-fala. O sistema consiste na captac¸ao˜ de voz na forma de sinal audio´ que de seguida e´ interpretado, tra- duzido e sintetizado para voz. Tendo como entrada um enunciado numa linguagem de origem e como sa´ıda um enunciado numa linguagem destino.
    [Show full text]
  • New Kazakh Parallel Text Corpora with On-Line Access
    New Kazakh parallel text corpora with on-line access Zhandos Zhumanov1, Aigerim Madiyeva2 and Diana Rakhimova3 al-Farabi Kazakh National University, Laboratory of Intelligent Information Systems, Almaty, Kazakhstan [email protected], [email protected], [email protected] Abstract. This paper presents a new parallel resource – text corpora – for Ka- zakh language with on-line access. We describe 3 different approaches to col- lecting parallel text and how much data we managed to collect using them, par- allel Kazakh-English text corpora collected from various sources and aligned on sentence level, and web accessible corpus management system that was set up using open source tools – corpus manager Mantee and web GUI KonText. As a result of our work we present working web-accessible corpus management sys- tem to work with collected corpora. Keywords: parallel text corpora, Kazakh language, corpus management system 1 Introduction Linguistic text corpora are large collections of text used for different language studies. They are used in linguistics and other fields that deal with language studies as an ob- ject of study or as a resource. Text corpora are needed for almost any language study since they are basically a representation of the language itself. In computer linguistics text corpora are used for various parsing, machine translation, speech recognition, etc. As shown in [1] text corpora can be classified by many categories: Size: small (to 1 million words); medium (from 1 to 10 million words); large (more than 10 million words). Number of Text Languages: monolingual, bilingual, multilingual. Language of Texts: English, German, Ukrainian, etc. Mode: spoken; written; mixed.
    [Show full text]
  • Multilingual Semantic Role Labeling with Unified Labels
    POLYGLOT: Multilingual Semantic Role Labeling with Unified Labels Alan Akbik Yunyao Li IBM Research Almaden Research Center 650 Harry Road, San Jose, CA 95120, USA fakbika,[email protected] Abstract Semantic role labeling (SRL) identifies the predicate-argument structure in text with semantic labels. It plays a key role in un- derstanding natural language. In this pa- per, we present POLYGLOT, a multilingual semantic role labeling system capable of semantically parsing sentences in 9 differ- Figure 1: Example of POLYGLOT predicting English Prop- ent languages from 4 different language Bank labels for a simple German sentence: The verb “kaufen” is correctly identified to evoke the BUY.01 frame, while “ich” groups. The core of POLYGLOT are SRL (I) is recognized as the buyer,“ein neues Auto”(a new car) models for individual languages trained as the thing bought, and “dir”(for you) as the benefactive. with automatically generated Proposition Banks (Akbik et al., 2015). The key fea- Not surprisingly, enabling SRL for languages ture of the system is that it treats the other than English has received increasing atten- semantic labels of the English Proposi- tion. One relevant key effort is to create Proposi- tion Bank as “universal semantic labels”: tion Bank-style resources for different languages, Given a sentence in any of the supported such as Chinese (Xue and Palmer, 2005) and languages, POLYGLOT applies the corre- Hindi (Bhatt et al., 2009). However, the conven- sponding SRL and predicts English Prop- tional approach of manually generating such re- Bank frame and role annotation. The re- sources is costly in terms of time and experts re- sults are then visualized to facilitate the quired, hindering the expansion of SRL to new tar- understanding of multilingual SRL with get languages.
    [Show full text]
  • *Λcross-Lingual Semantic Parsing with Categorial Grammars
    ∨ ¬ = Cross-lingual Semantic Parsing with Categorial Grammars Kilian Evang ◇ λ+ ∧* > ; Cross-lingual Semantic Parsing with Categorial Grammars Kilian Evang The work in this thesis has been carried out under the auspices of the Center for Lan- guage and Cognition Groningen of the Faculty of Arts of the University of Groningen. Groningen Dissertations in Linguistics 155 ISSN: 0928-0030 ISBN: 978-90-367-9474-9 (printed version) ISBN: 978-90-367-9473-2 (electronic version) c 2016, Kilian Evang Document prepared with LATEX 2" and typeset by pdfTEX (TEX Gyre Pagella font) Cover design by Kilian Evang Printed by CPI Thesis (www.cpithesis.nl) on Matt MC 115g paper Cross-lingual Semantic Parsing with Categorial Grammars Proefschrift ter verkrijging van de graad van doctor aan de Rijksuniversiteit Groningen op gezag van de rector magnificus prof. dr. E. Sterken en volgens besluit van het College voor Promoties. De openbare verdediging zal plaatsvinden op donderdag 26 januari 2017 om 12.45 uur door Kilian Evang geboren op 1 mei 1986 te Siegburg, Duitsland Promotores Prof. dr. J. Bos Beoordelingscommissie Prof. dr. M. Butt Prof. dr. J. Hoeksema Prof. dr. M. Steedman Acknowledgments Whew, that was quite a ride! People had warned me that getting a PhD is not for the faint of heart, and they were right. But it’s finally done! Everybody whose guidance, support and friendship I have been able to count on during these five years certainly deserves a big “thank you”. Dear Johan, you created the unique research program on deep se- mantic annotation that lured me to Groningen, you hired me and agreed to be my PhD supervisor.
    [Show full text]
  • Towards a Corsican Basic Language Resource Kit
    Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2726–2735 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Towards a Corsican Basic Language Resource Kit Laurent Kevers, Stella Retali-Medori UMR CNRS 6240 LISA, Universita` di Corsica - Pasquale Paoli Avenue Jean Nicoli, 20250 Corte, France fkevers l, medori [email protected] Abstract The current situation regarding the existence of natural language processing (NLP) resources and tools for Corsican reveals their virtual non-existence. Our inventory contains only a few rare digital resources, lexical or corpus databases, requiring adaptation work. Our ob- jective is to use the Banque de Donnees´ Langue Corse project (BDLC) to improve the availability of resources and tools for the Corsican language and, in the long term, provide a complete Basic Language Ressource Kit (BLARK). We have defined a roadmap setting out the actions to be undertaken: the collection of corpora and the setting up of a consultation interface (concordancer), and of a language detection tool, an electronic dictionary and a part-of-speech tagger. The first achievements regarding these topics have already been reached and are presented in this article. Some elements are also available on our project page (http://bdlc.univ-corse.fr/tal/). Keywords: less-resourced languages, Corsican, corpora, lexical resources, language identification, POS tagging We will first contextualize our research and present some a polynomic approach (Marcellesi, 1984) that encompasses general information about the Corsican language (1.1., its all dialectal variants, the writing of the language is not stan- digital presence (1.2.) and the BDLC project (1.3.), as pre- dardised, which is an obstacle for its automatic processing.
    [Show full text]
  • D1.4 Semantics in Shallow Models
    This document is part of the Research and Innovation Action “Quality Translation 21 (QT21)”. This project has received funding from the European Union’s Horizon 2020 program for ICT under grant agreement no. 645452. Deliverable D1.4 Semantics in Shallow Models Ondřej Bojar (CUNI), Rico Sennrich (UEDIN), Philip Williams (UEDIN), Khalil Sima’an (UvA), Stella Frank (UvA), Inguna Skadiņa (Tilde), Daiga Deksne (Tilde) Dissemination Level: Public 31st January, 2018 Quality Translation 21 D1.4: Semantics in Shallow Models Grant agreement no. 645452 Project acronym QT21 Project full title Quality Translation 21 Type of action Research and Innovation Action Coordinator Prof. Josef van Genabith (DFKI) Start date, duration 1st February, 2015, 36 months Dissemination level Public Contractual date of delivery 31st July, 2016 Actual date of delivery 31st July, 2016 Deliverable number D1.4 Deliverable title Semantics in Shallow Models Type Report Status and version Final (Version 1.0) Number of pages 107 Contributing partners CUNI, UEDIN, RWTH, UvA, Tilde WP leader CUNI Author(s) Ondřej Bojar (CUNI), Rico Sennrich (UEDIN), Philip Williams (UEDIN), Khalil Sima’an (UvA), Stella Frank (UvA), Inguna Skadiņa (Tilde), Daiga Deksne (Tilde) EC project officer Susan Fraser The partners in QT21 are: • Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Germany • Rheinisch-Westfälische Technische Hochschule Aachen (RWTH), Germany • Universiteit van Amsterdam (UvA), Netherlands • Dublin City University (DCU), Ireland • University of Edinburgh (UEDIN), United Kingdom • Karlsruher Institut für Technologie (KIT), Germany • Centre National de la Recherche Scientifique (CNRS), France • Univerzita Karlova v Praze (CUNI), Czech Republic • Fondazione Bruno Kessler (FBK), Italy • University of Sheffield (USFD), United Kingdom • TAUS b.v.
    [Show full text]
  • Proceedings of the Celtic Language Technology Workshop 2019
    Machine Translation Summit XVII Proceedings of the Celtic Language Technology Workshop 2019 http://cl.indiana.edu/cltw19/ 19th August, 2019 Dublin, Ireland © 2019 The authors. These articles are licensed under a Creative Commons 4.0 licence, no derivative works, attribution, CC-BY-ND. Proceedings of the Celtic Language Technology Workshop 2019 http://cl.indiana.edu/cltw19/ 19th August, 2019 Dublin, Ireland © 2019 The authors. These articles are licensed under a Creative Commons 4.0 licence, no derivative works, attribution, CC-BY-ND. Preface from the co-chairs of the workshop These proceedings include the program and papers presented at the third Celtic Language Technology Workshop held in conjunction with the Machine Translation Summit in Dublin in August 2019. Celtic Languages are spoken in regions extending from the British Isles to the western seaboard of continental Europe, and communities in Argentina and Canada. They include Irish, Scottish Gaelic, Breton, Manx, Welsh and Cornish. Irish is an official EU language, both Irish and Welsh have official language status in their respective countries, and Scottish Gaelic recently saw an improvement in its prospects through the Gaelic Language (Scotland) Act 2005. The majority of others have co-official status, except Breton which has not yet received recognition as an official or regional language. Until recently, the Celtic languages have lagged behind in the area of natural language processing (NLP) and machine translation (MT). Consequently, language technology research and resource provision for this language group was poor. In recent years, as the resource provision for minority and under-resourced languages has been improving, it has been extended and applied to some Celtic languages.
    [Show full text]
  • Corpus Linguistics Main Issues
    HG3051 Corpus Linquistics Introduction to Corpus Linguistics Main Issues Francis Bond Division of Linguistics and Multilingual Studies http://www3.ntu.edu.sg/home/fcbond/ [email protected] Lecture 1 https://github.com/bond-lab/Corpus-Linguistics HG3051 (2020) Introduction ã Administrivia ã What is Corpus Linguistics ã What this course is (and isn’t) ã Getting to know each other (what do you want?) Introduction to Corpus Linguistics 1 Corpora I have been involved with ã Semantic markup of the LDC Call Home Corpus sense tagging of Japanese telephone transcripts ã Hinoki Treebank of Japanese HPSG parses of Japanese definitions, examples and newspaper text sense tagging of same ã Tanaka Corpus of aligned Japanese-English text Now the Tatoeba multilingual project: www.tatoeba.org ã NICT English learner corpus (advisor) ã Japanese WordNet gloss corpus, jSEMCOR corpus aligned Japanese-English text, sense tagging Introduction to Corpus Linguistics 2 Corpora I am building now ã NTU Multilingual Corpus: NTU-MC with help from Tan Liling, HG2002, HG8011 and many more ­ Arabic, Chinese, English, Indonesian, Japanese, Korean, Vietnamese ∗ Essays ∗ Short Stories (Sherlock Holmes) ∗ News Text ∗ Singapore Tourist Web Sites ­ Wordnet sense tagging, HPSG parses ­ Cross lingual alignment ­ Tagging various phenomena ­ Used in URECA and FYP research, we will use it in this course ã NTU Corpus of Learner English: NTUCLE ­ With help from LCC (Tan and Bond, 2012; Winder et al., 2017) 3 100% Continuous Assessment ã Individual Lab Work (4x10%) ã Individual Project
    [Show full text]
  • Dirt Cheap Web-Scale Parallel Text from the Common Crawl Jason R
    Dirt Cheap Web-Scale Parallel Text from the Common Crawl Jason R. Smith1,2 Herve Saint-Amand3 Magdalena Plamada4 [email protected] [email protected] [email protected] Philipp Koehn3 Chris Callison-Burch1,2,5 Adam Lopez1,2 [email protected] [email protected][email protected] 1Department of Computer Science, Johns Hopkins University 2Human Language Technology Center of Excellence, Johns Hopkins University 3School of Informatics, University of Edinburgh 4Institute of Computational Linguistics, University of Zurich 5Computer and Information Science Department, University of Pennsylvania Abstract are readily available, ordering in the hundreds of millions of words for Chinese-English and Arabic- Parallel text is the fuel that drives modern English, and in tens of millions of words for many machine translation systems. The Web is a European languages (Koehn, 2005). In each case, comprehensive source of preexisting par- much of this data consists of government and news allel text, but crawling the entire web is text. However, for most language pairs and do- impossible for all but the largest compa- mains there is little to no curated parallel data nies. We bring web-scale parallel text to available. Hence discovery of parallel data is an the masses by mining the Common Crawl, important first step for translation between most a public Web crawl hosted on Amazon’s of the world’s languages. Elastic Cloud. Starting from nothing more The Web is an important source of parallel than a set of common two-letter language text. Many websites are available in multiple codes, our open-source extension of the languages, and unlike other potential sources— STRAND algorithm mined 32 terabytes of such as multilingual news feeds (Munteanu and the crawl in just under a day, at a cost of Marcu, 2005) or Wikipedia (Smith et al., 2010)— about $500.
    [Show full text]
  • Download Through Lepage-Lab.Ips.Waseda.Ac.Jp/En/Projects/Kakenhi-15K00317/ (Accessed on 20 September 2020) Or the Links Provided in the Body of the Paper
    information Article A Study of Analogical Density in Various Corpora at Various Granularity Rashel Fam *,† and Yves Lepage † Graduate School of IPS, Waseda University, Kitakyushu-shi, Fukuoka-ken 808-0135, Japan; [email protected] * Correspondence: [email protected]; Tel.: +81-80-2776-8224 † 2-7 Hibikino, Wakamatsu-ku, Kitakyushu-shi, Fukuoka-ken 808-0135, Japan. Abstract: In this paper, we inspect the theoretical problem of counting the number of analogies between sentences contained in a text. Based on this, we measure the analogical density of the text. We focus on analogy at the sentence level, based on the level of form rather than on the level of semantics. Experiments are carried on two different corpora in six European languages known to have various levels of morphological richness. Corpora are tokenised using several tokenisation schemes: character, sub-word and word. For the sub-word tokenisation scheme, we employ two popular sub-word models: unigram language model and byte-pair-encoding. The results show that the corpus with a higher Type-Token Ratio tends to have higher analogical density. We also observe that masking the tokens based on their frequency helps to increase the analogical density. As for the tokenisation scheme, the results show that analogical density decreases from the character to word. However, this is not true when tokens are masked based on their frequencies. We find that tokenising the sentences using sub-word models and masking the least frequent tokens increase analogical density. Keywords: proportional analogy; language productivity; automatic acquisition Citation: Fam, R.; Lepage, Y. A Study of Analogical Density in Various Corpora at Various Granularity.
    [Show full text]