Multilingual Semantic Role Labeling with Unified Labels

Total Page:16

File Type:pdf, Size:1020Kb

Multilingual Semantic Role Labeling with Unified Labels POLYGLOT: Multilingual Semantic Role Labeling with Unified Labels Alan Akbik Yunyao Li IBM Research Almaden Research Center 650 Harry Road, San Jose, CA 95120, USA fakbika,[email protected] Abstract Semantic role labeling (SRL) identifies the predicate-argument structure in text with semantic labels. It plays a key role in un- derstanding natural language. In this pa- per, we present POLYGLOT, a multilingual semantic role labeling system capable of semantically parsing sentences in 9 differ- Figure 1: Example of POLYGLOT predicting English Prop- ent languages from 4 different language Bank labels for a simple German sentence: The verb “kaufen” is correctly identified to evoke the BUY.01 frame, while “ich” groups. The core of POLYGLOT are SRL (I) is recognized as the buyer,“ein neues Auto”(a new car) models for individual languages trained as the thing bought, and “dir”(for you) as the benefactive. with automatically generated Proposition Banks (Akbik et al., 2015). The key fea- Not surprisingly, enabling SRL for languages ture of the system is that it treats the other than English has received increasing atten- semantic labels of the English Proposi- tion. One relevant key effort is to create Proposi- tion Bank as “universal semantic labels”: tion Bank-style resources for different languages, Given a sentence in any of the supported such as Chinese (Xue and Palmer, 2005) and languages, POLYGLOT applies the corre- Hindi (Bhatt et al., 2009). However, the conven- sponding SRL and predicts English Prop- tional approach of manually generating such re- Bank frame and role annotation. The re- sources is costly in terms of time and experts re- sults are then visualized to facilitate the quired, hindering the expansion of SRL to new tar- understanding of multilingual SRL with get languages. this unified semantic representation. An alternative approach is annotation projec- tion (Pado´ and Lapata, 2009; Van der Plas et 1 Introduction al., 2011) that utilizes parallel corpora to trans- Semantic role labeling (SRL) is the task of la- fer predicted SRL labels from English sentences beling predicate-argument structure in sentences onto sentences in a target language. It has shown with shallow semantic information. One promi- great promise in automatically generating such re- nent labeling scheme for the English language is sources for arbitrary target languages. In previ- the Proposition Bank (Palmer et al., 2005) which ous work, we presented an approach based on fil- annotates predicates with frame labels and argu- tered projection and bootstrapped learning to auto- ments with role labels. Role labels roughly con- generate Proposition Bank-style resources for 7 form to simple questions (who, what, when, where, languages, namely Arabic, Chinese, French, Ger- how much, with whom) with regards to the predi- man, Hindi, Russian and Spanish (Akbik et al., cate. SRL is important for understanding natural 2015). language; it has been found useful for many ap- Unified semantic labels across all languages. plications such as information extraction (Fader et One key difference between auto-generated al., 2011) and question answering (Shen and Lap- PropBanks and manually created ones is that the ata, 2007; Maqsud et al., 2014). former use English Proposition Bank labels for Figure 2: Side by side view of English, Chinese and Spanish sentences parsed in POLYGLOT’s Web UI. English PropBank frame and role labels are predicted for all languages: All example sentences evoke the BUY.01 frame and have constituents accordingly labeled with roles such as the buyer, the thing bought, the price paid and the benefactive. all target languages, while the latter use language- experiment with the tool to understand the breadth specific labels. As such, the auto-generated Prop- of shallow semantic concepts currently covered, Banks allow us to train an SRL system to consume and to discuss limitations and the potential of such text in various languages and make predictions in an approach for downstream applications. a shared semantic label set, namely English Prop- Bank labels. Refer to Figures 1 and 2 for examples 2 System Overview of how our system predicts frame and role labels Figure 3 depicts the overall architecture of POLY- from the English Proposition Bank for sentences GLOT. First, we automatically generate labeled in German, English, Chinese and Spanish. training data for each target language with anno- Similar to how Stanford dependencies are one tation projection (Akbik et al., 2015) (Figure 3 basis of universal dependencies (De Marneffe et Step 1). We use the labeled data to train for each al., 2014; Nivre, 2015), we believe that English language an SRL system (Figure 3 Step 2) that PropBank labels have the potential to eventu- predicts English PropBank frame and role labels. ally become a basis of “universal” shallow se- Both the creation of the training data and the train- mantic labels. Such a unified representation of ing of the SRL instances are one-time processes. shallow semantics, we argue, may facilitate ap- POLYGLOT provides a Web-based GUI to al- plications such as multilingual information ex- low users to interact with the SRL systems (Fig- traction and question answering, much in the ure 3 Step 3). Given a natural language sentence, same way that universal dependencies facilitate depending on the language of the input sentence, tasks such as crosslingual learning and the devel- POLYGLOT selects the appropriate SRL instance opment and evaluation of multilingual syntactic and displays on the GUI the semantic parse, as parsers (Nivre, 2015). The key questions, how- well as syntactic information. ever, are (1) to what degree English PropBank frame and role labels are appropriate for different target languages; and (2) how far this approach can handle language-specific phenomena or semantic concepts. Contributions. To facilitate the discussions of the above questions, we present POLYGLOT, an SRL system trained on auto-generated PropBanks for 8 languages plus English, namely Arabic, Chi- nese, French, German, Hindi, Japanese, Russian and Spanish. Given a sentence in one of these 9 languages, the system applies the correspond- ing SRL and visualizes the shallow semantic parse with predicted English PropBank labels. POLY- GLOT allows us to illustrate our envisioned ap- proach of parsing different languages into a shared shallow semantic abstraction based on the English Proposition Bank. It also enables researchers to Figure 3: System overview AM-TMP A0 buy.01 A4 A1 Yesterday, Diego bought his girlfriend some flowers pings. In (Akbik et al., 2015), we additionally proposed a process of filtered projection and boot- strapped learning, and successfully created Propo- sition Banks for 7 target languages. We found the quality of the generated PropBanks to be moderate Ayer, Diego compró flores para su novia to high, depending on the target language. Table 1 AM-TMP A0 buy.01 A1 A4 shows estimated precision, recall and F1-score for Figure 4: Annotation projection for a word-aligned English- Spanish sentence pair. each language with two evaluation methods. Par- tial evaluation counts correctly labeled incomplete In the next sections, we briefly describe the cre- constituents as true positives while exact evalua- ation of the labeled data and the training of the tion only counts correctly labeled complete con- SRL systems, followed by a tour of the Web UI. stituents as true positives. For more details we re- fer the reader to (Akbik et al., 2015) . 3 Auto-Generation of Labeled Data 4 Semantic Role Labeling We followed an annotation projection approach to automatically generate the labeled data for dif- Using the auto-generated labeled data, we ferent languages. This approach takes as input a train the semantic role labeler of the MATE word-aligned parallel corpus of English sentences toolkit (Bjorkelund¨ et al., 2009), which achieved and their translations in a target language (TL). A state-of-the-art semantic F1-score in the multilin- semantic role labeler then predicts labels for the gual semantic role labeling task of the CoNLL- English sentences. In a projection step, these la- 2009 shared task (Hajicˇ et al., 2009). The parser is bels are transferred along word alignments onto implemented as a sequence of local logistic regres- the target language sentences. The underlying the- sion classifiers for the four steps of predicate iden- ory is that translated sentence pairs share a de- tification, predicate classification, argument iden- gree of semantic similarity, making such projec- tification and argument classification. In addition, tion possible (Pado´ and Lapata, 2009). it implements a global reranker to rerank sets of Figure 4 illustrates an example of annotation local predictions. It uses a standard feature set of projection: Using an SRL system trained with the lexical and syntactic features. English Proposition Bank, the English sentence is Preprocessing. Before SRL, we execute a labeled with the appropriate frame (BUY.01) and pipeline of NLP tools to extract the required lex- role labels: “Diego” as the buyer (A0 in PropBank ical, morphological and syntactic features. To fa- annotation), “some flowers” as the thing bought cilitate reproducability of the presented work, we (A1) and “his girlfriend” as the benefactive (A4). use publicly available open source tools and pre- In addition, “yesterday” is labeled AM-TMP, sig- nifying a temporal context of this frame. These PREDICATE ARGUMENT labels are then projected onto the aligned Spanish LANG. Match P R F1 P R F1 Agr κ words. For instance, “compro´” is word-aligned to part. 0.97 0.89 0.93 0.86 0.69 0.77 0.92 0.87 Arabic “bought” and thus labeled as BUY.01. The pro- exact 0.97 0.89 0.93 0.67 0.63 0.65 0.85 0.77 jection produces a Spanish sentence labeled with part. 0.97 0.88 0.92 0.93 0.83 0.88 0.95 0.91 Chinese English PropBank labels; such data can in turn be exact 0.97 0.88 0.92 0.83 0.81 0.82 0.92 0.86 used to train an SRL system for Spanish.
Recommended publications
  • Machine-Translation Inspired Reordering As Preprocessing for Cross-Lingual Sentiment Analysis
    Master in Cognitive Science and Language Master’s Thesis September 2018 Machine-translation inspired reordering as preprocessing for cross-lingual sentiment analysis Alejandro Ramírez Atrio Supervisor: Toni Badia Abstract In this thesis we study the effect of word reordering as preprocessing for Cross-Lingual Sentiment Analysis. We try different reorderings in two target languages (Spanish and Catalan) so that their word order more closely resembles the one from our source language (English). Our original expectation was that a Long Short Term Memory classifier trained on English data with bilingual word embeddings would internalize English word order, resulting in poor performance when tested on a target language with different word order. We hypothesized that the more the word order of any of our target languages resembles the one of our source language, the better the overall performance of our sentiment classifier would be when analyzing the target language. We tested five sets of transformation rules for our Part of Speech reorderings of Spanish and Catalan, extracted mainly from two sources: two papers by Crego and Mariño (2006a and 2006b) and our own empirical analysis of two corpora: CoStEP and Tatoeba. The results suggest that the bilingual word embeddings that we are training our Long Short Term Memory model with do not improve any English word order learning by part of the model when used cross-lingually. There is no improvement when reordering the Spanish and Catalan texts so that their word order more closely resembles English, and no significant drop in result score even when applying a random reordering to them making them almost unintelligible, neither when classifying between 2 options (positive-negative) nor between 4 (strongly positive, positive, negative, strongly negative).
    [Show full text]
  • Actes Des 2Èmes Journées Scientifiques Du Groupement De Recherche Linguistique Informatique Formelle Et De Terrain (LIFT)
    Actes des 2èmes journées scientifiques du Groupement de Recherche Linguistique Informatique Formelle et de Terrain (LIFT). Thierry Poibeau, Yannick Parmentier, Emmanuel Schang To cite this version: Thierry Poibeau, Yannick Parmentier, Emmanuel Schang. Actes des 2èmes journées scientifiques du Groupement de Recherche Linguistique Informatique Formelle et de Terrain (LIFT).. Poibeau, Thierry and Parmentier, Yannick and Schang, Emmanuel. Dec 2020, Montrouge, France. CNRS, 2020. hal-03066031 HAL Id: hal-03066031 https://hal.archives-ouvertes.fr/hal-03066031 Submitted on 3 Jan 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. LIFT 2020 Actes des 2èmes journées scientifiques du Groupement de Recherche Linguistique Informatique Formelle et de Terrain (LIFT) Thierry Poibeau, Yannick Parmentier, Emmanuel Schang (Éds.) 10-11 décembre 2020 Montrouge, France (Virtuel) Comités Comité d’organisation Thierry Poibeau, LATTICE/CNRS Yannick Parmentier, LORIA/Université de Lorraine Emmanuel Schang, LLL/Université d’Orléans Comité de programme Angélique Amelot LPP/CNRS
    [Show full text]
  • Projecto Em Engenharia Informatica
    UNIVERSIDADE DE LISBOA Faculdade de Cienciasˆ Departamento de Informatica´ SPEECH-TO-SPEECH TRANSLATION TO SUPPORT MEDICAL INTERVIEWS Joao˜ Antonio´ Santos Gomes Rodrigues PROJETO MESTRADO EM ENGENHARIA INFORMATICA´ Especializac¸ao˜ em Interacc¸ao˜ e Conhecimento 2013 UNIVERSIDADE DE LISBOA Faculdade de Cienciasˆ Departamento de Informatica´ SPEECH-TO-SPEECH TRANSLATION TO SUPPORT MEDICAL INTERVIEWS Joao˜ Antonio´ Santos Gomes Rodrigues PROJETO MESTRADO EM ENGENHARIA INFORMATICA´ Especializac¸ao˜ em Interacc¸ao˜ e Conhecimento Projeto orientado pelo Prof. Doutor Antonio´ Manuel Horta Branco 2013 Agradecimentos Agradec¸o ao meu orientador, Antonio´ Horta Branco, por conceder a oportunidade de realizar este trabalho e pela sua sagacidade e instruc¸ao˜ oferecida para a concretizac¸ao˜ do mesmo. Agradec¸o aos meus parceiros do grupo NLX, por terem aberto os caminhos e cri- ado inumeras´ ferramentas e conhecimento para o processamento de linguagem natural do portugues,ˆ das quais fac¸o uso neste trabalho. Assim como pela imprescind´ıvel partilha e companhia. Agradec¸o a` minha fam´ılia e a` Joana, pelo apoio inefavel.´ Este trabalho foi apoiado pela Comissao˜ Europeia no ambitoˆ do projeto METANET4U (ECICTPSP:270893), pela Agenciaˆ de Inovac¸ao˜ no ambitoˆ do projeto S4S (S4S / PLN) e pela Fundac¸ao˜ para a Cienciaˆ e Tecnologia no ambitoˆ do projeto DP4LT (PTDC / EEISII / 1940 / 2012). As` tresˆ expresso a minha sincera gratidao.˜ iii Resumo Este relatorio´ apresenta a criac¸ao˜ de um sistema de traduc¸ao˜ fala-para-fala. O sistema consiste na captac¸ao˜ de voz na forma de sinal audio´ que de seguida e´ interpretado, tra- duzido e sintetizado para voz. Tendo como entrada um enunciado numa linguagem de origem e como sa´ıda um enunciado numa linguagem destino.
    [Show full text]
  • New Kazakh Parallel Text Corpora with On-Line Access
    New Kazakh parallel text corpora with on-line access Zhandos Zhumanov1, Aigerim Madiyeva2 and Diana Rakhimova3 al-Farabi Kazakh National University, Laboratory of Intelligent Information Systems, Almaty, Kazakhstan [email protected], [email protected], [email protected] Abstract. This paper presents a new parallel resource – text corpora – for Ka- zakh language with on-line access. We describe 3 different approaches to col- lecting parallel text and how much data we managed to collect using them, par- allel Kazakh-English text corpora collected from various sources and aligned on sentence level, and web accessible corpus management system that was set up using open source tools – corpus manager Mantee and web GUI KonText. As a result of our work we present working web-accessible corpus management sys- tem to work with collected corpora. Keywords: parallel text corpora, Kazakh language, corpus management system 1 Introduction Linguistic text corpora are large collections of text used for different language studies. They are used in linguistics and other fields that deal with language studies as an ob- ject of study or as a resource. Text corpora are needed for almost any language study since they are basically a representation of the language itself. In computer linguistics text corpora are used for various parsing, machine translation, speech recognition, etc. As shown in [1] text corpora can be classified by many categories: Size: small (to 1 million words); medium (from 1 to 10 million words); large (more than 10 million words). Number of Text Languages: monolingual, bilingual, multilingual. Language of Texts: English, German, Ukrainian, etc. Mode: spoken; written; mixed.
    [Show full text]
  • Learning and Applications of Paraphrastic Representations for Natural Language, Ph.D
    LEARNINGANDAPPLICATIONSOFPARAPHRASTIC REPRESENTATIONSFORNATURALLANGUAGE john wieting Ph.D. Thesis Language Technologies Institute Carnegie Mellon University July 27th 2020 John Wieting: Learning and Applications of Paraphrastic Representations for Natural Language, Ph.D. Thesis, © July 27th 2020 ABSTRACT Representation learning has had a tremendous impact in machine learning and natural lan- guage processing (NLP), especially in recent years. Learned representations provide useful features needed for downstream tasks, allowing models to incorporate knowledge from billions of tokens of text. The result is better performance and generalization on many important problems of interest. Often these representations can also be used in an un- supervised manner to determine the degree of semantic similarity of text or for finding semantically similar items, the latter useful for mining paraphrases or parallel text. Lastly, representations can be probed to better understand what aspects of language have been learned, bringing an additional element of interpretability to our models. This thesis focuses on the problem of learning paraphrastic representations for units of language. These units span from sub-words, to words, to phrases, and to full sentences – the latter being a focal point. Our primary goal is to learn models that can encode arbitrary word sequences into a vector with the property that sequences with similar semantics are near each other in the learned vector space, and that this property transfers across domains. We first show several effective and simple models, paragram and charagram, to learn word and sentence representations on noisy paraphrases automatically extracted from bilin- gual corpora. These models outperform contemporary and more complicated models on a variety of semantic evaluations.
    [Show full text]
  • *Λcross-Lingual Semantic Parsing with Categorial Grammars
    ∨ ¬ = Cross-lingual Semantic Parsing with Categorial Grammars Kilian Evang ◇ λ+ ∧* > ; Cross-lingual Semantic Parsing with Categorial Grammars Kilian Evang The work in this thesis has been carried out under the auspices of the Center for Lan- guage and Cognition Groningen of the Faculty of Arts of the University of Groningen. Groningen Dissertations in Linguistics 155 ISSN: 0928-0030 ISBN: 978-90-367-9474-9 (printed version) ISBN: 978-90-367-9473-2 (electronic version) c 2016, Kilian Evang Document prepared with LATEX 2" and typeset by pdfTEX (TEX Gyre Pagella font) Cover design by Kilian Evang Printed by CPI Thesis (www.cpithesis.nl) on Matt MC 115g paper Cross-lingual Semantic Parsing with Categorial Grammars Proefschrift ter verkrijging van de graad van doctor aan de Rijksuniversiteit Groningen op gezag van de rector magnificus prof. dr. E. Sterken en volgens besluit van het College voor Promoties. De openbare verdediging zal plaatsvinden op donderdag 26 januari 2017 om 12.45 uur door Kilian Evang geboren op 1 mei 1986 te Siegburg, Duitsland Promotores Prof. dr. J. Bos Beoordelingscommissie Prof. dr. M. Butt Prof. dr. J. Hoeksema Prof. dr. M. Steedman Acknowledgments Whew, that was quite a ride! People had warned me that getting a PhD is not for the faint of heart, and they were right. But it’s finally done! Everybody whose guidance, support and friendship I have been able to count on during these five years certainly deserves a big “thank you”. Dear Johan, you created the unique research program on deep se- mantic annotation that lured me to Groningen, you hired me and agreed to be my PhD supervisor.
    [Show full text]
  • Towards a Corsican Basic Language Resource Kit
    Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2726–2735 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Towards a Corsican Basic Language Resource Kit Laurent Kevers, Stella Retali-Medori UMR CNRS 6240 LISA, Universita` di Corsica - Pasquale Paoli Avenue Jean Nicoli, 20250 Corte, France fkevers l, medori [email protected] Abstract The current situation regarding the existence of natural language processing (NLP) resources and tools for Corsican reveals their virtual non-existence. Our inventory contains only a few rare digital resources, lexical or corpus databases, requiring adaptation work. Our ob- jective is to use the Banque de Donnees´ Langue Corse project (BDLC) to improve the availability of resources and tools for the Corsican language and, in the long term, provide a complete Basic Language Ressource Kit (BLARK). We have defined a roadmap setting out the actions to be undertaken: the collection of corpora and the setting up of a consultation interface (concordancer), and of a language detection tool, an electronic dictionary and a part-of-speech tagger. The first achievements regarding these topics have already been reached and are presented in this article. Some elements are also available on our project page (http://bdlc.univ-corse.fr/tal/). Keywords: less-resourced languages, Corsican, corpora, lexical resources, language identification, POS tagging We will first contextualize our research and present some a polynomic approach (Marcellesi, 1984) that encompasses general information about the Corsican language (1.1., its all dialectal variants, the writing of the language is not stan- digital presence (1.2.) and the BDLC project (1.3.), as pre- dardised, which is an obstacle for its automatic processing.
    [Show full text]
  • D1.4 Semantics in Shallow Models
    This document is part of the Research and Innovation Action “Quality Translation 21 (QT21)”. This project has received funding from the European Union’s Horizon 2020 program for ICT under grant agreement no. 645452. Deliverable D1.4 Semantics in Shallow Models Ondřej Bojar (CUNI), Rico Sennrich (UEDIN), Philip Williams (UEDIN), Khalil Sima’an (UvA), Stella Frank (UvA), Inguna Skadiņa (Tilde), Daiga Deksne (Tilde) Dissemination Level: Public 31st January, 2018 Quality Translation 21 D1.4: Semantics in Shallow Models Grant agreement no. 645452 Project acronym QT21 Project full title Quality Translation 21 Type of action Research and Innovation Action Coordinator Prof. Josef van Genabith (DFKI) Start date, duration 1st February, 2015, 36 months Dissemination level Public Contractual date of delivery 31st July, 2016 Actual date of delivery 31st July, 2016 Deliverable number D1.4 Deliverable title Semantics in Shallow Models Type Report Status and version Final (Version 1.0) Number of pages 107 Contributing partners CUNI, UEDIN, RWTH, UvA, Tilde WP leader CUNI Author(s) Ondřej Bojar (CUNI), Rico Sennrich (UEDIN), Philip Williams (UEDIN), Khalil Sima’an (UvA), Stella Frank (UvA), Inguna Skadiņa (Tilde), Daiga Deksne (Tilde) EC project officer Susan Fraser The partners in QT21 are: • Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Germany • Rheinisch-Westfälische Technische Hochschule Aachen (RWTH), Germany • Universiteit van Amsterdam (UvA), Netherlands • Dublin City University (DCU), Ireland • University of Edinburgh (UEDIN), United Kingdom • Karlsruher Institut für Technologie (KIT), Germany • Centre National de la Recherche Scientifique (CNRS), France • Univerzita Karlova v Praze (CUNI), Czech Republic • Fondazione Bruno Kessler (FBK), Italy • University of Sheffield (USFD), United Kingdom • TAUS b.v.
    [Show full text]
  • Proceedings of the Celtic Language Technology Workshop 2019
    Machine Translation Summit XVII Proceedings of the Celtic Language Technology Workshop 2019 http://cl.indiana.edu/cltw19/ 19th August, 2019 Dublin, Ireland © 2019 The authors. These articles are licensed under a Creative Commons 4.0 licence, no derivative works, attribution, CC-BY-ND. Proceedings of the Celtic Language Technology Workshop 2019 http://cl.indiana.edu/cltw19/ 19th August, 2019 Dublin, Ireland © 2019 The authors. These articles are licensed under a Creative Commons 4.0 licence, no derivative works, attribution, CC-BY-ND. Preface from the co-chairs of the workshop These proceedings include the program and papers presented at the third Celtic Language Technology Workshop held in conjunction with the Machine Translation Summit in Dublin in August 2019. Celtic Languages are spoken in regions extending from the British Isles to the western seaboard of continental Europe, and communities in Argentina and Canada. They include Irish, Scottish Gaelic, Breton, Manx, Welsh and Cornish. Irish is an official EU language, both Irish and Welsh have official language status in their respective countries, and Scottish Gaelic recently saw an improvement in its prospects through the Gaelic Language (Scotland) Act 2005. The majority of others have co-official status, except Breton which has not yet received recognition as an official or regional language. Until recently, the Celtic languages have lagged behind in the area of natural language processing (NLP) and machine translation (MT). Consequently, language technology research and resource provision for this language group was poor. In recent years, as the resource provision for minority and under-resourced languages has been improving, it has been extended and applied to some Celtic languages.
    [Show full text]
  • Corpus Linguistics Main Issues
    HG3051 Corpus Linquistics Introduction to Corpus Linguistics Main Issues Francis Bond Division of Linguistics and Multilingual Studies http://www3.ntu.edu.sg/home/fcbond/ [email protected] Lecture 1 https://github.com/bond-lab/Corpus-Linguistics HG3051 (2020) Introduction ã Administrivia ã What is Corpus Linguistics ã What this course is (and isn’t) ã Getting to know each other (what do you want?) Introduction to Corpus Linguistics 1 Corpora I have been involved with ã Semantic markup of the LDC Call Home Corpus sense tagging of Japanese telephone transcripts ã Hinoki Treebank of Japanese HPSG parses of Japanese definitions, examples and newspaper text sense tagging of same ã Tanaka Corpus of aligned Japanese-English text Now the Tatoeba multilingual project: www.tatoeba.org ã NICT English learner corpus (advisor) ã Japanese WordNet gloss corpus, jSEMCOR corpus aligned Japanese-English text, sense tagging Introduction to Corpus Linguistics 2 Corpora I am building now ã NTU Multilingual Corpus: NTU-MC with help from Tan Liling, HG2002, HG8011 and many more ­ Arabic, Chinese, English, Indonesian, Japanese, Korean, Vietnamese ∗ Essays ∗ Short Stories (Sherlock Holmes) ∗ News Text ∗ Singapore Tourist Web Sites ­ Wordnet sense tagging, HPSG parses ­ Cross lingual alignment ­ Tagging various phenomena ­ Used in URECA and FYP research, we will use it in this course ã NTU Corpus of Learner English: NTUCLE ­ With help from LCC (Tan and Bond, 2012; Winder et al., 2017) 3 100% Continuous Assessment ã Individual Lab Work (4x10%) ã Individual Project
    [Show full text]
  • Dirt Cheap Web-Scale Parallel Text from the Common Crawl Jason R
    Dirt Cheap Web-Scale Parallel Text from the Common Crawl Jason R. Smith1,2 Herve Saint-Amand3 Magdalena Plamada4 [email protected] [email protected] [email protected] Philipp Koehn3 Chris Callison-Burch1,2,5 Adam Lopez1,2 [email protected] [email protected][email protected] 1Department of Computer Science, Johns Hopkins University 2Human Language Technology Center of Excellence, Johns Hopkins University 3School of Informatics, University of Edinburgh 4Institute of Computational Linguistics, University of Zurich 5Computer and Information Science Department, University of Pennsylvania Abstract are readily available, ordering in the hundreds of millions of words for Chinese-English and Arabic- Parallel text is the fuel that drives modern English, and in tens of millions of words for many machine translation systems. The Web is a European languages (Koehn, 2005). In each case, comprehensive source of preexisting par- much of this data consists of government and news allel text, but crawling the entire web is text. However, for most language pairs and do- impossible for all but the largest compa- mains there is little to no curated parallel data nies. We bring web-scale parallel text to available. Hence discovery of parallel data is an the masses by mining the Common Crawl, important first step for translation between most a public Web crawl hosted on Amazon’s of the world’s languages. Elastic Cloud. Starting from nothing more The Web is an important source of parallel than a set of common two-letter language text. Many websites are available in multiple codes, our open-source extension of the languages, and unlike other potential sources— STRAND algorithm mined 32 terabytes of such as multilingual news feeds (Munteanu and the crawl in just under a day, at a cost of Marcu, 2005) or Wikipedia (Smith et al., 2010)— about $500.
    [Show full text]
  • Download Through Lepage-Lab.Ips.Waseda.Ac.Jp/En/Projects/Kakenhi-15K00317/ (Accessed on 20 September 2020) Or the Links Provided in the Body of the Paper
    information Article A Study of Analogical Density in Various Corpora at Various Granularity Rashel Fam *,† and Yves Lepage † Graduate School of IPS, Waseda University, Kitakyushu-shi, Fukuoka-ken 808-0135, Japan; [email protected] * Correspondence: [email protected]; Tel.: +81-80-2776-8224 † 2-7 Hibikino, Wakamatsu-ku, Kitakyushu-shi, Fukuoka-ken 808-0135, Japan. Abstract: In this paper, we inspect the theoretical problem of counting the number of analogies between sentences contained in a text. Based on this, we measure the analogical density of the text. We focus on analogy at the sentence level, based on the level of form rather than on the level of semantics. Experiments are carried on two different corpora in six European languages known to have various levels of morphological richness. Corpora are tokenised using several tokenisation schemes: character, sub-word and word. For the sub-word tokenisation scheme, we employ two popular sub-word models: unigram language model and byte-pair-encoding. The results show that the corpus with a higher Type-Token Ratio tends to have higher analogical density. We also observe that masking the tokens based on their frequency helps to increase the analogical density. As for the tokenisation scheme, the results show that analogical density decreases from the character to word. However, this is not true when tokens are masked based on their frequencies. We find that tokenising the sentences using sub-word models and masking the least frequent tokens increase analogical density. Keywords: proportional analogy; language productivity; automatic acquisition Citation: Fam, R.; Lepage, Y. A Study of Analogical Density in Various Corpora at Various Granularity.
    [Show full text]