And the Causality Detection Benchmark

Total Page:16

File Type:pdf, Size:1020Kb

And the Causality Detection Benchmark 1 Persian Causality Corpus (PerCause) and the Causality Detection Benchmark Zeinab Rahimi, Mehrnoush ShamsFard*1 NLP Research Lab., Shahid Beheshti University, Tehran, Iran {z_rahimi, m-shams}@sbu.ac.ir ABSTRACT. Recognizing causal elements and causal relations in text is one of the challenging issues in natural language processing; specifically, in low resource languages such as Persian. In this research we prepare a causality human annotated corpus for the Persian language which consists of 4446 sentences and 5128 causal relations and three labels of cause, effect and causal mark -if possible- are specified for each relation. We have used this corpus to train a system for detecting causal elements’ boundaries. Also, we present a causality detection benchmark for three machine learning methods and two deep learning systems based on this corpus. Performance evaluations indicate that our best total result is obtained through CRF classifier which has F-measure of 0.76 and the best accuracy obtained through BiLSTM-CRF deep learning method with Accuracy equal to %91.4. Keywords: PerCause, Causality annotated corpus, causality detection, deep learning, CRF 1. Introduction Causal relation extraction is a challenging natural language processing (NLP) open problem that mainly involves semantic analysis and is used in many NLP tasks such as textual entailment recognition, question answering, event prediction and narrative extraction. Causation indicates that one event, state or action is the result of the occurrence of the other state, event or action. Consider the following examples: (1) John is poisoned because he ate a poisonous apple. (2) It’s raining. The streets are slippery. (3) Narrow roads are dangerous. (4) The poisonous apple poisoned Snow White. (5) The kid playing with the match burned the house. They all indicate a causal relation. A causal relation usually consists of three elements of cause, effect and causal mark. For example, in (1), "john is poisoned" is the effect of "he ate a poisonous apple" and "because" is the causal mark. As we can see in the above examples, the *corresponding author 2 causal elements may be words, phrases or predicates and also would occur in one sentence or two. Causal relations can be categorized from different points of view; e.g. according to the appearance of causal mark in the sentence. The causal mark may appear in the sentence or not and in the case of appearance it may be ambiguous or not. For example, in (3), "danger" is the effect of "narrow roads", but we do not have any explicit causal mark. Or in (4), the causal mark (“after”) would be ambiguous as it is not always an indicator of causality. On the other hand, the causative constructions may be either explicit or implicit (Girju, 2003). Even cause or effect may be expressed implicitly in a causal sentence. E.g. in (5) “kid playing with the match” is not the reason of burning the house, the fire is the reason. So finding the exact boundaries of causal relations is very important and challenging and needs world knowledge. Recognition of causal elements can be done through rule-based methods or machine learning ones that need tagged corpus. In general, it seems that corpus-based approaches are well suited for such problems (Goyal et al., 2017). Corpus-based causal detection systems are meant to determine the borders of each causal element within a single sentence or a sentence pair and classify them into predefined categories of cause, effect and causal mark. Causal training corpora already exist in languages such as English (some of them are introduced in section 2), but to the best of our knowledge such corpora are not developed for Persian. In this paper, we present a causality annotated corpus in Persian (PerCause) and a benchmark for training models based on the created dataset PerCause covers simple and general causation and complicated ones such as metaphysical causations, or weak, negated or nested causal relations are not in our scope. This paper is organized as follows. Section 2 reviews related work, in section 3 we introduce Persian causality annotated corpus (PerCause) and explain the details of its production procedure. In section 4 we incorporate different machine learning and deep learning methods to extract causal relations based on PerCause and finally in section 5 the details of evaluations and discussion of the paper are presented. Section 6 concludes the paper. 2. Related work Here we review related researches in two main categories: 1) causality detection methods that may lead to semi-automatic development of a corpus and 2) manual creation of causality annotated corpora. 2.1 Causality detection Causal relation extraction can be done through two approaches: rule based methods and machine learning methods. Some researchers use linguistic patterns to identify explicit causation relations. For example, Garcia (1997), employs a semantic model which classifies causative verbal patterns (as linguistic indicators such as <NP1 causative verb NP2>) to detect causal relationships in French texts. Khoo and colleagues (2000) use predefined verbal linguistic patterns to extract cause-effect information from business and medical newspaper texts. Lou and colleagues (2016) propos a framework that automatically harvests a network of causal-effect terms from a large web corpus using specified patterns (such as A leads to B, or If A Then B). Backed by this network, they propose a metric to model the causality strength 3 between terms. In this method, causal relationships in short texts are identified using the created resource and also the presented statistical criteria. Causality extraction by machine learning needs a causality corpus which is prepared manually or semi-automatically based on particular patterns. For example, in (Girju, 2003) a training corpus, containing sentences having 60 simple causal verbs is created using a LA TIMES section of the TREC 9 text. Using a syntactic parser, 6523 relations of the form NP1-Verb-NP2 are found, from which 2101 are manually tagged as causal relations and 4422 as non-causals. These make the positive and negative examples used to train a decision tree classifier to classify causal and non-causal relations in text. Chang and Choi (2005) come up with a workaround mechanism to learn cue phrase and lexical pair probabilities from raw and domain-independent corpus in an unsupervised manner. They use these probabilities to extract inter-noun and inter-sentence causalities. In 2008, Blanco and colleagues (Blanco, et al., 2008) followed up with a work that considered explicit causal relations expressed by the pattern VerbPhrase-Relator-Cause, where relator ∈ {because, since, after, as}. They used semantically annotated SemCor 2.1 corpus for training a C4.5 decision tree binary classifier with bagging. Mirza and Tonelli in 2014 present CATENA, a system to perform temporal and causal relation extraction and classification from English text. To do this, they exploit the interaction between the temporal and the causal relations. They use a hybrid approach combining rule-based and supervised classifiers for identification of causal relations and adopt the notion of causality proposed in the annotation guidelines of the CausalTimeBank (Mirza et al., 2014; Mirza and Tonelli, 2014). This guideline accounts for CAUSE, ENABLE and PREVENT phenomena. Authors aim to assign a causal link to pairs of events when the causal relation is expressed by affect, link and causative verbs (CAUSE-, ENABLE- and PREVENT-type verbs) or when the causal relation marked by a causal mark Mirza (2014) have proposed annotation guidelines for causality between events, based on the TimeML definition of events, which considers all types of actions (punctual and durative) and states as events. They introduced the <CLINK> tag to signify a causal link and also introduced the notion of causal signals through the <C-SIGNAL> tag. Ning, et. Al (2018) claim that because a cause must occur earlier than its effect, temporal and causal relations are closely related and one relation often dictates the value of the other. So they present a joint inference framework for them, using constrained conditional models (CCMs), specifically, formulate the joint problem as an integer linear programming (ILP) problem, enforcing constraints that are inherent in the nature of time and causality. In (Dasgupta, et, al, 2018) a linguistically informed recurrent neural network architecture (a bidirectional LSTM model enhanced with an additional linguistic layer) for automatic extraction of cause-effect relations from text is proposed. The architecture uses word level embeddings and other linguistic features to detect causal events and their effects mentioned within a sentence. And finally in (Hashimoto, 2019) a weakly supervised method for extracting causality knowledge (such as Protectionism -> Trade war) from Wikipedia articles in multiple languages is presented. The key idea is to exploit the causality describing sections and multilingual redundancy of Wikipedia. 2.2 Manually annotated causality corpora There are several manual annotated resources for causality. Verb resources such as VerbNet (Schuler, 2005) and PropBank (Palmer et al., 2005) include verbs of causation. Likewise, 4 preposition schemes (e.g., Schneider et al., 2015, 2016) include some purpose- and explanation related senses. But these cases just cover specific classes of words. Up to the best of our knowledge there is no causality annotated corpus for the Persian language. But there are few causality manual annotated datasets for English language such as: BECAUSE (The Bank of Effects and Causes Stated Explicitly) (Dunietz, et al., 2015), BioCause (Claudiu et al, 2013), CaTeRS (Mostafazadeh et al., 2016) and (Dunietz, Levin and Carbonell, 2015). BECAUSE distinguishes three types of causation, each of which has slightly different semantics: Consequence, in which the cause naturally leads to the effect (e.g. We are in serious economic trouble because of inadequate regulation.); Motivation, in which some agent perceives the cause, and therefore consciously thinks, feels, or chooses something (e.g.
Recommended publications
  • Machine-Translation Inspired Reordering As Preprocessing for Cross-Lingual Sentiment Analysis
    Master in Cognitive Science and Language Master’s Thesis September 2018 Machine-translation inspired reordering as preprocessing for cross-lingual sentiment analysis Alejandro Ramírez Atrio Supervisor: Toni Badia Abstract In this thesis we study the effect of word reordering as preprocessing for Cross-Lingual Sentiment Analysis. We try different reorderings in two target languages (Spanish and Catalan) so that their word order more closely resembles the one from our source language (English). Our original expectation was that a Long Short Term Memory classifier trained on English data with bilingual word embeddings would internalize English word order, resulting in poor performance when tested on a target language with different word order. We hypothesized that the more the word order of any of our target languages resembles the one of our source language, the better the overall performance of our sentiment classifier would be when analyzing the target language. We tested five sets of transformation rules for our Part of Speech reorderings of Spanish and Catalan, extracted mainly from two sources: two papers by Crego and Mariño (2006a and 2006b) and our own empirical analysis of two corpora: CoStEP and Tatoeba. The results suggest that the bilingual word embeddings that we are training our Long Short Term Memory model with do not improve any English word order learning by part of the model when used cross-lingually. There is no improvement when reordering the Spanish and Catalan texts so that their word order more closely resembles English, and no significant drop in result score even when applying a random reordering to them making them almost unintelligible, neither when classifying between 2 options (positive-negative) nor between 4 (strongly positive, positive, negative, strongly negative).
    [Show full text]
  • Student Research Workshop Associated with RANLP 2011, Pages 1–8, Hissar, Bulgaria, 13 September 2011
    RANLPStud 2011 Proceedings of the Student Research Workshop associated with The 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011) 13 September, 2011 Hissar, Bulgaria STUDENT RESEARCH WORKSHOP ASSOCIATED WITH THE INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING’2011 PROCEEDINGS Hissar, Bulgaria 13 September 2011 ISBN 978-954-452-016-8 Designed and Printed by INCOMA Ltd. Shoumen, BULGARIA ii Preface The Recent Advances in Natural Language Processing (RANLP) conference, already in its eight year and ranked among the most influential NLP conferences, has always been a meeting venue for scientists coming from all over the world. Since 2009, we decided to give arena to the younger and less experienced members of the NLP community to share their results with an international audience. For this reason, further to the first successful and highly competitive Student Research Workshop associated with the conference RANLP 2009, we are pleased to announce the second edition of the workshop which is held during the main RANLP 2011 conference days on 13 September 2011. The aim of the workshop is to provide an excellent opportunity for students at all levels (Bachelor, Master, and Ph.D.) to present their work in progress or completed projects to an international research audience and receive feedback from senior researchers. We have received 31 high quality submissions, among which 6 papers have been accepted as regular oral papers, and 18 as posters. Each submission has been reviewed by
    [Show full text]
  • From CHILDES to Talkbank
    From CHILDES to TalkBank Brian MacWhinney Carnegie Mellon University MacWhinney, B. (2001). New developments in CHILDES. In A. Do, L. Domínguez & A. Johansen (Eds.), BUCLD 25: Proceedings of the 25th annual Boston University Conference on Language Development (pp. 458-468). Somerville, MA: Cascadilla. a similar article appeared as: MacWhinney, B. (2001). From CHILDES to TalkBank. In M. Almgren, A. Barreña, M. Ezeizaberrena, I. Idiazabal & B. MacWhinney (Eds.), Research on Child Language Acquisition (pp. 17-34). Somerville, MA: Cascadilla. Recent years have seen a phenomenal growth in computer power and connectivity. The computer on the desktop of the average academic researcher now has the power of room-size supercomputers of the 1980s. Using the Internet, we can connect in seconds to the other side of the world and transfer huge amounts of text, programs, audio and video. Our computers are equipped with programs that allow us to view, link, and modify this material without even having to think about programming. Nearly all of the major journals are now available in electronic form and the very nature of journals and publication is undergoing radical change. These new trends have led to dramatic advances in the methodology of science and engineering. However, the social and behavioral sciences have not shared fully in these advances. In large part, this is because the data used in the social sciences are not well- structured patterns of DNA sequences or atomic collisions in super colliders. Much of our data is based on the messy, ill-structured behaviors of humans as they participate in social interactions. Categorizing and coding these behaviors is an enormous task in itself.
    [Show full text]
  • The General Regionally Annotated Corpus of Ukrainian (GRAC, Uacorpus.Org): Architecture and Functionality
    The General Regionally Annotated Corpus of Ukrainian (GRAC, uacorpus.org): Architecture and Functionality Maria Shvedova[0000-0002-0759-1689] Kyiv National Linguistic University [email protected] Abstract. The paper presents the General Regionally Annotated Corpus of Ukrainian, which is publicly available (GRAC: uacorpus.org), searchable online and counts more than 400 million tokens, representing most genres of written texts. It also features regional annotation, i. e. about 50 percent of the texts are attributed with regard to the different regions of Ukraine or countries of the di- aspora. If the author is known, the text is linked to their home region(s). The journalistic texts are annotated with regard to the place where the edition is pub- lished. This feature differs the GRAC from a majority of general linguistic cor- pora. Keywords: Ukrainian language, corpus, diachronic evolution, regional varia- tion. 1 Introduction Currently many major national languages have large universal corpora, known as “reference” corpora (cf. das Deutsche Referenzkorpus – DeReKo) or “national” cor- pora (a label dating back ultimately to the British National Corpus). These corpora are large, representative for different genres of written language, have a certain depth of (usually morphological and metatextual) annotation and can be used for many differ- ent linguistic purposes. The Ukrainian language lacks a publicly available linguistic corpus. Still there is a need of a corpus in the present-day linguistics. Independently researchers compile dif- ferent corpora of Ukrainian for separate research purposes with different size and functionality. As the community lacks a universal tool, a researcher may build their own corpus according to their needs.
    [Show full text]
  • Conference Abstracts
    EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Patronage of Ms Neelie Kroes, Vice-President of the European Commission, Digital Agenda Commissioner MAY 23-24-25, 2012 ISTANBUL LÜTFI KIRDAR CONVENTION & EXHIBITION CENTRE ISTANBUL, TURKEY CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis. Assistant Editors: Hélène Mazo, Sara Goggi, Olivier Hamon © ELRA – European Language Resources Association. All rights reserved. LREC 2012, EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2012 Conference Abstracts Distributed by: ELRA – European Language Resources Association 55-57, rue Brillat Savarin 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] Copyright by the European Language Resources Association ISBN 978-2-9517408-7-7 EAN 9782951740877 All rights reserved. No part of this book may be reproduced in any form without the prior permission of the European Language Resources Association ii Introduction of the Conference Chair Nicoletta Calzolari I wish first to express to Ms Neelie Kroes, Vice-President of the European Commission, Digital agenda Commissioner, the gratitude of the Program Committee and of all LREC participants for her Distinguished Patronage of LREC 2012. Even if every time I feel we have reached the top, this 8th LREC is continuing the tradition of breaking previous records: this edition we received 1013 submissions and have accepted 697 papers, after reviewing by the impressive number of 715 colleagues.
    [Show full text]
  • ALW2), Pages 1–10 Brussels, Belgium, October 31, 2018
    EMNLP 2018 Second Workshop on Abusive Language Online Proceedings of the Workshop, co-located with EMNLP 2018 October 31, 2018 Brussels, Belgium Sponsors Primary Sponsor Platinum Sponsors Gold Sponsors Silver Sponsors Bronze Sponsors ii c 2018 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-948087-68-1 iii Introduction Interaction amongst users on social networking platforms can enable constructive and insightful conversations and civic participation; however, on many sites that encourage user interaction, verbal abuse has become commonplace. Abusive behavior such as cyberbullying, hate speech, and scapegoating can poison the social climates within online communities. The last few years have seen a surge in such abusive online behavior, leaving governments, social media platforms, and individuals struggling to deal with the consequences. As a field that works directly with computational analysis of language, the NLP community is uniquely positioned to address the difficult problem of abusive language online; encouraging collaborative and innovate work in this area is the goal of this workshop. The first year of the workshop saw 14 papers presented in a day-long program including interdisciplinary panels and active discussion. In this second edition, we have aimed to build on the success of the first year, maintaining a focus on computationally detecting abusive language and encouraging interdisciplinary work. Reflecting the growing research focus on this topic, the number of submissions received more than doubled from 22 in last year’s edition of the workshop to 48 this year.
    [Show full text]
  • Gold Standard Annotations for Preposition and Verb Sense With
    Gold Standard Annotations for Preposition and Verb Sense with Semantic Role Labels in Adult-Child Interactions Lori Moon Christos Christodoulopoulos Cynthia Fisher University of Illinois at Amazon Research University of Illinois at Urbana-Champaign [email protected] Urbana-Champaign [email protected] [email protected] Sandra Franco Dan Roth Intelligent Medical Objects University of Pennsylvania Northbrook, IL USA [email protected] [email protected] Abstract This paper describes the augmentation of an existing corpus of child-directed speech. The re- sulting corpus is a gold-standard labeled corpus for supervised learning of semantic role labels in adult-child dialogues. Semantic role labeling (SRL) models assign semantic roles to sentence constituents, thus indicating who has done what to whom (and in what way). The current corpus is derived from the Adam files in the Brown corpus (Brown, 1973) of the CHILDES corpora, and augments the partial annotation described in Connor et al. (2010). It provides labels for both semantic arguments of verbs and semantic arguments of prepositions. The semantic role labels and senses of verbs follow Propbank guidelines (Kingsbury and Palmer, 2002; Gildea and Palmer, 2002; Palmer et al., 2005) and those for prepositions follow Srikumar and Roth (2011). The corpus was annotated by two annotators. Inter-annotator agreement is given sepa- rately for prepositions and verbs, and for adult speech and child speech. Overall, across child and adult samples, including verbs and prepositions, the κ score for sense is 72.6, for the number of semantic-role-bearing arguments, the κ score is 77.4, for identical semantic role labels on a given argument, the κ score is 91.1, for the span of semantic role labels, and the κ for agreement is 93.9.
    [Show full text]
  • Using Morphemes from Agglutinative Languages Like Quechua and Finnish to Aid in Low-Resource Translation
    Using Morphemes from Agglutinative Languages like Quechua and Finnish to Aid in Low-Resource Translation John E. Ortega [email protected] Dept. de Llenguatges i Sistemes Informatics, Universitat d’Alacant, E-03071, Alacant, Spain Krishnan Pillaipakkamnatt [email protected] Department of Computer Science, Hofstra University, Hempstead, NY 11549, USA Abstract Quechua is a low-resource language spoken by nearly 9 million persons in South America (Hintz and Hintz, 2017). Yet, in recent times there are few published accounts of successful adaptations of machine translation systems for low-resource languages like Quechua. In some cases, machine translations from Quechua to Spanish are inadequate due to error in alignment. We attempt to improve previous alignment techniques by aligning two languages that are simi- lar due to agglutination: Quechua and Finnish. Our novel technique allows us to add rules that improve alignment for the prediction algorithm used in common machine translation systems. 1 Introduction The NP-complete problem of translating natural languages as they are spoken by humans to machine readable text is a complex problem; yet, is partially solvable due to the accuracy of machine language translations when compared to human translations (Kleinberg and Tardos, 2005). Statistical machine translation (SMT) systems such as Moses 1, require that an algo- rithm be combined with enough parallel corpora, text from distinct languages that can be com- pared sentence by sentence, to build phrase translation tables from language models. For many European languages, the translation task of bringing words together in a sequential sentence- by-sentence format for modeling, known as word alignment, is not hard due to the abundance of parallel corpora in large data sets such as Europarl 2.
    [Show full text]
  • A Massively Parallel Corpus: the Bible in 100 Languages
    Lang Resources & Evaluation DOI 10.1007/s10579-014-9287-y ORIGINAL PAPER A massively parallel corpus: the Bible in 100 languages Christos Christodouloupoulos • Mark Steedman Ó The Author(s) 2014. This article is published with open access at Springerlink.com Abstract We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora. Keywords Parallel corpus Á Multilingual corpus Á Comparative corpus linguistics 1 Introduction Parallel corpora are a valuable resource for linguistic research and natural language processing (NLP) applications. One of the main uses of the latter kind is as training material for statistical machine translation (SMT), where large amounts of aligned data are standardly used to learn word alignment models between the lexica of two languages (for example, in the Giza?? system of Och and Ney 2003). Another interesting use of parallel corpora in NLP is projected learning of linguistic structure. In this approach, supervised data from a resource-rich language is used to guide the unsupervised learning algorithm in a target language. Although there are some techniques that do not require parallel texts (e.g. Cohen et al. 2011), the most successful models use sentence-aligned corpora (Yarowsky and Ngai 2001; Das and Petrov 2011). C. Christodouloupoulos (&) Department of Computer Science, UIUC, 201 N.
    [Show full text]
  • Metapragmatics of Playful Speech Practices in Persian
    To Be with Salt, To Speak with Taste: Metapragmatics of Playful Speech Practices in Persian Author Arab, Reza Published 2021-02-03 Thesis Type Thesis (PhD Doctorate) School School of Hum, Lang & Soc Sc DOI https://doi.org/10.25904/1912/4079 Copyright Statement The author owns the copyright in this thesis, unless stated otherwise. Downloaded from http://hdl.handle.net/10072/402259 Griffith Research Online https://research-repository.griffith.edu.au To Be with Salt, To Speak with Taste: Metapragmatics of Playful Speech Practices in Persian Reza Arab BA, MA School of Humanities, Languages and Social Science Griffith University Thesis submitted in fulfilment of the requirements of the Degree of Doctor of Philosophy September 2020 Abstract This investigation is centred around three metapragmatic labels designating valued speech practices in the domain of ‘playful language’ in Persian. These three metapragmatic labels, used by speakers themselves, describe success and failure in use of playful language and construe a person as pleasant to be with. They are hāzerjavāb (lit. ready.response), bāmaze (lit. with.taste), and bānamak (lit. with.salt). Each is surrounded and supported by a cluster of (related) word meanings, which are instrumental in their cultural conceptualisations. The analytical framework is set within the research area known as ethnopragmatics, which is an offspring of Natural Semantics Metalanguage (NSM). With the use of explications and scripts articulated in cross-translatable semantic primes, the metapragmatic labels and the related clusters are examined in meticulous detail. This study demonstrates how ethnopragmatics, its insights on epistemologies backed by corpus pragmatics, can contribute to the metapragmatic studies by enabling a robust analysis using a systematic metalanguage.
    [Show full text]
  • Peters, H. (2017)
    "Proceedings of the International Conference “Fluency & Disfluency Across Languages and Language Varieties”" Degand, Liesbeth Abstract Within the frame of a research project entitled “Fluency and disfluency markers. A multimodal contrastive perspective”, the universities of Louvain and Namur have been involved in a large-scale usage-based study of (dis)fluency markers in spoken French, L1 and L2 English, and French Belgian Sign Language (LSFB), with a focus on variation according to language, speaker and genre. To close this five-year research project, the (DIS)FLUENCY 2017 International Conference is held in Louvain-la-Neuve on the subject of fluency and disfluency across languages and language varieties. This volume contains the papers presented at the conference. Document type : Rapport (Report) Référence bibliographique Degand, Liesbeth ; et. al. Proceedings of the International Conference “Fluency & Disfluency Across Languages and Language Varieties”. (2017) 125 pages Available at: http://hdl.handle.net/2078.1/195807 [Downloaded 2019/08/20 at 15:48:16 ] Proceedings of the International Conference “Fluency & Disfluency Across Languages and Language Varieties” Université catholique de Louvain February 15-17, 2017 Louvain-la-Neuve, Belgium Preface Fluency and disfluency have attracted a great deal of attention in different areas of linguistics such as language acquisition or psycholinguistics. They have been investigated through a wide range of methodological and theoretical frameworks, including corpus linguistics, experimental pragmatics, perception studies and natural language processing, with applications in the domains of language learning, teaching and testing, human/machine communication and business communication. Spoken and signed languages are produced and comprehended online, with typically very little time to plan ahead.
    [Show full text]
  • Better Web Corpora for Corpus Linguistics and NLP
    Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Vít Suchomel Advisor: Pavel Rychlý i Acknowledgements I would like to thank my advisors, prof. Karel Pala and prof. Pavel Rychlý for their problem insight, help with software design and con- stant encouragement. I am also grateful to my colleagues from Natural Language Process- ing Centre at Masaryk University and Lexical Computing, especially Miloš Jakubíček, Pavel Rychlý and Aleš Horák, for their support of my work and invaluable advice. Furthermore, I would like to thank Adam Kilgarriff, who gave me a wonderful opportunity to work for a leading company in the field of lexicography and corpus driven NLP and Jan Pomikálek who helped me to start. I thank to my wife Kateřina who supported me a lot during writing this thesis. Of those who have always accepted me and loved me in spite of my failures, God is the greatest. ii Abstract The internet is used by computational linguists, lexicographers and social scientists as an immensely large source of text data for various NLP tasks and language studies. Web corpora can be built in sizes which would be virtually impossible to achieve using traditional corpus creation methods.
    [Show full text]