And the Causality Detection Benchmark
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Machine-Translation Inspired Reordering As Preprocessing for Cross-Lingual Sentiment Analysis
Master in Cognitive Science and Language Master’s Thesis September 2018 Machine-translation inspired reordering as preprocessing for cross-lingual sentiment analysis Alejandro Ramírez Atrio Supervisor: Toni Badia Abstract In this thesis we study the effect of word reordering as preprocessing for Cross-Lingual Sentiment Analysis. We try different reorderings in two target languages (Spanish and Catalan) so that their word order more closely resembles the one from our source language (English). Our original expectation was that a Long Short Term Memory classifier trained on English data with bilingual word embeddings would internalize English word order, resulting in poor performance when tested on a target language with different word order. We hypothesized that the more the word order of any of our target languages resembles the one of our source language, the better the overall performance of our sentiment classifier would be when analyzing the target language. We tested five sets of transformation rules for our Part of Speech reorderings of Spanish and Catalan, extracted mainly from two sources: two papers by Crego and Mariño (2006a and 2006b) and our own empirical analysis of two corpora: CoStEP and Tatoeba. The results suggest that the bilingual word embeddings that we are training our Long Short Term Memory model with do not improve any English word order learning by part of the model when used cross-lingually. There is no improvement when reordering the Spanish and Catalan texts so that their word order more closely resembles English, and no significant drop in result score even when applying a random reordering to them making them almost unintelligible, neither when classifying between 2 options (positive-negative) nor between 4 (strongly positive, positive, negative, strongly negative). -
Student Research Workshop Associated with RANLP 2011, Pages 1–8, Hissar, Bulgaria, 13 September 2011
RANLPStud 2011 Proceedings of the Student Research Workshop associated with The 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011) 13 September, 2011 Hissar, Bulgaria STUDENT RESEARCH WORKSHOP ASSOCIATED WITH THE INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING’2011 PROCEEDINGS Hissar, Bulgaria 13 September 2011 ISBN 978-954-452-016-8 Designed and Printed by INCOMA Ltd. Shoumen, BULGARIA ii Preface The Recent Advances in Natural Language Processing (RANLP) conference, already in its eight year and ranked among the most influential NLP conferences, has always been a meeting venue for scientists coming from all over the world. Since 2009, we decided to give arena to the younger and less experienced members of the NLP community to share their results with an international audience. For this reason, further to the first successful and highly competitive Student Research Workshop associated with the conference RANLP 2009, we are pleased to announce the second edition of the workshop which is held during the main RANLP 2011 conference days on 13 September 2011. The aim of the workshop is to provide an excellent opportunity for students at all levels (Bachelor, Master, and Ph.D.) to present their work in progress or completed projects to an international research audience and receive feedback from senior researchers. We have received 31 high quality submissions, among which 6 papers have been accepted as regular oral papers, and 18 as posters. Each submission has been reviewed by -
From CHILDES to Talkbank
From CHILDES to TalkBank Brian MacWhinney Carnegie Mellon University MacWhinney, B. (2001). New developments in CHILDES. In A. Do, L. Domínguez & A. Johansen (Eds.), BUCLD 25: Proceedings of the 25th annual Boston University Conference on Language Development (pp. 458-468). Somerville, MA: Cascadilla. a similar article appeared as: MacWhinney, B. (2001). From CHILDES to TalkBank. In M. Almgren, A. Barreña, M. Ezeizaberrena, I. Idiazabal & B. MacWhinney (Eds.), Research on Child Language Acquisition (pp. 17-34). Somerville, MA: Cascadilla. Recent years have seen a phenomenal growth in computer power and connectivity. The computer on the desktop of the average academic researcher now has the power of room-size supercomputers of the 1980s. Using the Internet, we can connect in seconds to the other side of the world and transfer huge amounts of text, programs, audio and video. Our computers are equipped with programs that allow us to view, link, and modify this material without even having to think about programming. Nearly all of the major journals are now available in electronic form and the very nature of journals and publication is undergoing radical change. These new trends have led to dramatic advances in the methodology of science and engineering. However, the social and behavioral sciences have not shared fully in these advances. In large part, this is because the data used in the social sciences are not well- structured patterns of DNA sequences or atomic collisions in super colliders. Much of our data is based on the messy, ill-structured behaviors of humans as they participate in social interactions. Categorizing and coding these behaviors is an enormous task in itself. -
The General Regionally Annotated Corpus of Ukrainian (GRAC, Uacorpus.Org): Architecture and Functionality
The General Regionally Annotated Corpus of Ukrainian (GRAC, uacorpus.org): Architecture and Functionality Maria Shvedova[0000-0002-0759-1689] Kyiv National Linguistic University [email protected] Abstract. The paper presents the General Regionally Annotated Corpus of Ukrainian, which is publicly available (GRAC: uacorpus.org), searchable online and counts more than 400 million tokens, representing most genres of written texts. It also features regional annotation, i. e. about 50 percent of the texts are attributed with regard to the different regions of Ukraine or countries of the di- aspora. If the author is known, the text is linked to their home region(s). The journalistic texts are annotated with regard to the place where the edition is pub- lished. This feature differs the GRAC from a majority of general linguistic cor- pora. Keywords: Ukrainian language, corpus, diachronic evolution, regional varia- tion. 1 Introduction Currently many major national languages have large universal corpora, known as “reference” corpora (cf. das Deutsche Referenzkorpus – DeReKo) or “national” cor- pora (a label dating back ultimately to the British National Corpus). These corpora are large, representative for different genres of written language, have a certain depth of (usually morphological and metatextual) annotation and can be used for many differ- ent linguistic purposes. The Ukrainian language lacks a publicly available linguistic corpus. Still there is a need of a corpus in the present-day linguistics. Independently researchers compile dif- ferent corpora of Ukrainian for separate research purposes with different size and functionality. As the community lacks a universal tool, a researcher may build their own corpus according to their needs. -
Conference Abstracts
EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Patronage of Ms Neelie Kroes, Vice-President of the European Commission, Digital Agenda Commissioner MAY 23-24-25, 2012 ISTANBUL LÜTFI KIRDAR CONVENTION & EXHIBITION CENTRE ISTANBUL, TURKEY CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis. Assistant Editors: Hélène Mazo, Sara Goggi, Olivier Hamon © ELRA – European Language Resources Association. All rights reserved. LREC 2012, EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2012 Conference Abstracts Distributed by: ELRA – European Language Resources Association 55-57, rue Brillat Savarin 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] Copyright by the European Language Resources Association ISBN 978-2-9517408-7-7 EAN 9782951740877 All rights reserved. No part of this book may be reproduced in any form without the prior permission of the European Language Resources Association ii Introduction of the Conference Chair Nicoletta Calzolari I wish first to express to Ms Neelie Kroes, Vice-President of the European Commission, Digital agenda Commissioner, the gratitude of the Program Committee and of all LREC participants for her Distinguished Patronage of LREC 2012. Even if every time I feel we have reached the top, this 8th LREC is continuing the tradition of breaking previous records: this edition we received 1013 submissions and have accepted 697 papers, after reviewing by the impressive number of 715 colleagues. -
ALW2), Pages 1–10 Brussels, Belgium, October 31, 2018
EMNLP 2018 Second Workshop on Abusive Language Online Proceedings of the Workshop, co-located with EMNLP 2018 October 31, 2018 Brussels, Belgium Sponsors Primary Sponsor Platinum Sponsors Gold Sponsors Silver Sponsors Bronze Sponsors ii c 2018 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-948087-68-1 iii Introduction Interaction amongst users on social networking platforms can enable constructive and insightful conversations and civic participation; however, on many sites that encourage user interaction, verbal abuse has become commonplace. Abusive behavior such as cyberbullying, hate speech, and scapegoating can poison the social climates within online communities. The last few years have seen a surge in such abusive online behavior, leaving governments, social media platforms, and individuals struggling to deal with the consequences. As a field that works directly with computational analysis of language, the NLP community is uniquely positioned to address the difficult problem of abusive language online; encouraging collaborative and innovate work in this area is the goal of this workshop. The first year of the workshop saw 14 papers presented in a day-long program including interdisciplinary panels and active discussion. In this second edition, we have aimed to build on the success of the first year, maintaining a focus on computationally detecting abusive language and encouraging interdisciplinary work. Reflecting the growing research focus on this topic, the number of submissions received more than doubled from 22 in last year’s edition of the workshop to 48 this year. -
Gold Standard Annotations for Preposition and Verb Sense With
Gold Standard Annotations for Preposition and Verb Sense with Semantic Role Labels in Adult-Child Interactions Lori Moon Christos Christodoulopoulos Cynthia Fisher University of Illinois at Amazon Research University of Illinois at Urbana-Champaign [email protected] Urbana-Champaign [email protected] [email protected] Sandra Franco Dan Roth Intelligent Medical Objects University of Pennsylvania Northbrook, IL USA [email protected] [email protected] Abstract This paper describes the augmentation of an existing corpus of child-directed speech. The re- sulting corpus is a gold-standard labeled corpus for supervised learning of semantic role labels in adult-child dialogues. Semantic role labeling (SRL) models assign semantic roles to sentence constituents, thus indicating who has done what to whom (and in what way). The current corpus is derived from the Adam files in the Brown corpus (Brown, 1973) of the CHILDES corpora, and augments the partial annotation described in Connor et al. (2010). It provides labels for both semantic arguments of verbs and semantic arguments of prepositions. The semantic role labels and senses of verbs follow Propbank guidelines (Kingsbury and Palmer, 2002; Gildea and Palmer, 2002; Palmer et al., 2005) and those for prepositions follow Srikumar and Roth (2011). The corpus was annotated by two annotators. Inter-annotator agreement is given sepa- rately for prepositions and verbs, and for adult speech and child speech. Overall, across child and adult samples, including verbs and prepositions, the κ score for sense is 72.6, for the number of semantic-role-bearing arguments, the κ score is 77.4, for identical semantic role labels on a given argument, the κ score is 91.1, for the span of semantic role labels, and the κ for agreement is 93.9. -
Using Morphemes from Agglutinative Languages Like Quechua and Finnish to Aid in Low-Resource Translation
Using Morphemes from Agglutinative Languages like Quechua and Finnish to Aid in Low-Resource Translation John E. Ortega [email protected] Dept. de Llenguatges i Sistemes Informatics, Universitat d’Alacant, E-03071, Alacant, Spain Krishnan Pillaipakkamnatt [email protected] Department of Computer Science, Hofstra University, Hempstead, NY 11549, USA Abstract Quechua is a low-resource language spoken by nearly 9 million persons in South America (Hintz and Hintz, 2017). Yet, in recent times there are few published accounts of successful adaptations of machine translation systems for low-resource languages like Quechua. In some cases, machine translations from Quechua to Spanish are inadequate due to error in alignment. We attempt to improve previous alignment techniques by aligning two languages that are simi- lar due to agglutination: Quechua and Finnish. Our novel technique allows us to add rules that improve alignment for the prediction algorithm used in common machine translation systems. 1 Introduction The NP-complete problem of translating natural languages as they are spoken by humans to machine readable text is a complex problem; yet, is partially solvable due to the accuracy of machine language translations when compared to human translations (Kleinberg and Tardos, 2005). Statistical machine translation (SMT) systems such as Moses 1, require that an algo- rithm be combined with enough parallel corpora, text from distinct languages that can be com- pared sentence by sentence, to build phrase translation tables from language models. For many European languages, the translation task of bringing words together in a sequential sentence- by-sentence format for modeling, known as word alignment, is not hard due to the abundance of parallel corpora in large data sets such as Europarl 2. -
A Massively Parallel Corpus: the Bible in 100 Languages
Lang Resources & Evaluation DOI 10.1007/s10579-014-9287-y ORIGINAL PAPER A massively parallel corpus: the Bible in 100 languages Christos Christodouloupoulos • Mark Steedman Ó The Author(s) 2014. This article is published with open access at Springerlink.com Abstract We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora. Keywords Parallel corpus Á Multilingual corpus Á Comparative corpus linguistics 1 Introduction Parallel corpora are a valuable resource for linguistic research and natural language processing (NLP) applications. One of the main uses of the latter kind is as training material for statistical machine translation (SMT), where large amounts of aligned data are standardly used to learn word alignment models between the lexica of two languages (for example, in the Giza?? system of Och and Ney 2003). Another interesting use of parallel corpora in NLP is projected learning of linguistic structure. In this approach, supervised data from a resource-rich language is used to guide the unsupervised learning algorithm in a target language. Although there are some techniques that do not require parallel texts (e.g. Cohen et al. 2011), the most successful models use sentence-aligned corpora (Yarowsky and Ngai 2001; Das and Petrov 2011). C. Christodouloupoulos (&) Department of Computer Science, UIUC, 201 N. -
Metapragmatics of Playful Speech Practices in Persian
To Be with Salt, To Speak with Taste: Metapragmatics of Playful Speech Practices in Persian Author Arab, Reza Published 2021-02-03 Thesis Type Thesis (PhD Doctorate) School School of Hum, Lang & Soc Sc DOI https://doi.org/10.25904/1912/4079 Copyright Statement The author owns the copyright in this thesis, unless stated otherwise. Downloaded from http://hdl.handle.net/10072/402259 Griffith Research Online https://research-repository.griffith.edu.au To Be with Salt, To Speak with Taste: Metapragmatics of Playful Speech Practices in Persian Reza Arab BA, MA School of Humanities, Languages and Social Science Griffith University Thesis submitted in fulfilment of the requirements of the Degree of Doctor of Philosophy September 2020 Abstract This investigation is centred around three metapragmatic labels designating valued speech practices in the domain of ‘playful language’ in Persian. These three metapragmatic labels, used by speakers themselves, describe success and failure in use of playful language and construe a person as pleasant to be with. They are hāzerjavāb (lit. ready.response), bāmaze (lit. with.taste), and bānamak (lit. with.salt). Each is surrounded and supported by a cluster of (related) word meanings, which are instrumental in their cultural conceptualisations. The analytical framework is set within the research area known as ethnopragmatics, which is an offspring of Natural Semantics Metalanguage (NSM). With the use of explications and scripts articulated in cross-translatable semantic primes, the metapragmatic labels and the related clusters are examined in meticulous detail. This study demonstrates how ethnopragmatics, its insights on epistemologies backed by corpus pragmatics, can contribute to the metapragmatic studies by enabling a robust analysis using a systematic metalanguage. -
Peters, H. (2017)
"Proceedings of the International Conference “Fluency & Disfluency Across Languages and Language Varieties”" Degand, Liesbeth Abstract Within the frame of a research project entitled “Fluency and disfluency markers. A multimodal contrastive perspective”, the universities of Louvain and Namur have been involved in a large-scale usage-based study of (dis)fluency markers in spoken French, L1 and L2 English, and French Belgian Sign Language (LSFB), with a focus on variation according to language, speaker and genre. To close this five-year research project, the (DIS)FLUENCY 2017 International Conference is held in Louvain-la-Neuve on the subject of fluency and disfluency across languages and language varieties. This volume contains the papers presented at the conference. Document type : Rapport (Report) Référence bibliographique Degand, Liesbeth ; et. al. Proceedings of the International Conference “Fluency & Disfluency Across Languages and Language Varieties”. (2017) 125 pages Available at: http://hdl.handle.net/2078.1/195807 [Downloaded 2019/08/20 at 15:48:16 ] Proceedings of the International Conference “Fluency & Disfluency Across Languages and Language Varieties” Université catholique de Louvain February 15-17, 2017 Louvain-la-Neuve, Belgium Preface Fluency and disfluency have attracted a great deal of attention in different areas of linguistics such as language acquisition or psycholinguistics. They have been investigated through a wide range of methodological and theoretical frameworks, including corpus linguistics, experimental pragmatics, perception studies and natural language processing, with applications in the domains of language learning, teaching and testing, human/machine communication and business communication. Spoken and signed languages are produced and comprehended online, with typically very little time to plan ahead. -
Better Web Corpora for Corpus Linguistics and NLP
Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Vít Suchomel Advisor: Pavel Rychlý i Acknowledgements I would like to thank my advisors, prof. Karel Pala and prof. Pavel Rychlý for their problem insight, help with software design and con- stant encouragement. I am also grateful to my colleagues from Natural Language Process- ing Centre at Masaryk University and Lexical Computing, especially Miloš Jakubíček, Pavel Rychlý and Aleš Horák, for their support of my work and invaluable advice. Furthermore, I would like to thank Adam Kilgarriff, who gave me a wonderful opportunity to work for a leading company in the field of lexicography and corpus driven NLP and Jan Pomikálek who helped me to start. I thank to my wife Kateřina who supported me a lot during writing this thesis. Of those who have always accepted me and loved me in spite of my failures, God is the greatest. ii Abstract The internet is used by computational linguists, lexicographers and social scientists as an immensely large source of text data for various NLP tasks and language studies. Web corpora can be built in sizes which would be virtually impossible to achieve using traditional corpus creation methods.