ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica Upsaliensia 16

Total Page:16

File Type:pdf, Size:1020Kb

ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica Upsaliensia 16 ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica Upsaliensia 16 Morphosyntactic Corpora and Tools for Persian Mojgan Seraji Dissertation presented at Uppsala University to be publicly examined in Universitetshuset / IX, Uppsala, Wednesday, 27 May 2015 at 10:15 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Professor of Computational Linguistics Jan Hajic (Charles University in Prague). Abstract Seraji, M. 2015. Morphosyntactic Corpora and Tools for Persian. Studia Linguistica Upsaliensia 16. 191 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-9229-8. This thesis presents open source resources in the form of annotated corpora and modules for automatic morphosyntactic processing and analysis of Persian texts. More specifically, the resources consist of an improved part-of-speech tagged corpus and a dependency treebank, as well as tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and dependency parsing for Persian. In developing these resources and tools, two key requirements are observed: compatibility and reuse. The compatibility requirement encompasses two parts. First, the tools in the pipeline should be compatible with each other in such a way that the output of one tool is compatible with the input requirements of the next. Second, the tools should be compatible with the annotated corpora and deliver the same analysis that is found in these. The reuse requirement means that all the components in the pipeline are developed by reusing resources, standard methods, and open source state-of-the-art tools. This is necessary to make the project feasible. Given these requirements, the thesis investigates two main research questions. The first is how can we develop morphologically and syntactically annotated corpora and tools while satisfying the requirements of compatibility and reuse? The approach taken is to accept the tokenization variations in the corpora to achieve robustness. The tokenization variations in Persian texts are related to the orthographic variations of writing fixed expressions, as well as various types of affixes and clitics. Since these variations are inherent properties of Persian texts, it is important that the tools in the pipeline can handle them. Therefore, they should not be trained on idealized data. The second question concerns how accurately we can perform morphological and syntactic analysis for Persian by adapting and applying existing tools to the annotated corpora. The experimental evaluation of the tools shows that the sentence segmenter and tokenizer achieve an F-score close to 100%, the tagger has an accuracy of nearly 97.5%, and the parser achieves a best labeled accuracy of over 82% (with unlabeled accuracy close to 87%). Keywords: Persian, language technology, corpus, treebank, preprocessing, segmentation, part- of-speech tagging, dependency parsing Mojgan Seraji, Department of Linguistics and Philology, Box 635, Uppsala University, SE-75126 Uppsala, Sweden. © Mojgan Seraji 2015 ISSN 1652-1366 ISBN 978-91-554-9229-8 urn:nbn:se:uu:diva-248780 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-248780) Sammandrag Denna avhandling presenterar resurser i form av annoterade korpusar och moduler för au- tomatisk morfosyntaktisk bearbetning och analys av persiska texter. Mera specifikt består dessa resurser av en förbättrad ordklasstaggad korpus och en dependensträdbank samt verk- tyg för textnormalisering, meningssegmentering, tokenisering, ordklasstaggning och depen- densparsning för persiska. Vid utvecklingen av dessa resurser och verktyg har två viktiga krav antagits: kompatibilitet och återanvändning. Kompatibilitetskravet omfattar två delar. För det första bör verktygen i kedjan vara kompatibla med varandra, på ett sådant sätt att utdatan från ett verktyg är kom- patibel med indatan i nästa. För det andra bör verktygen vara kompatibla med de annoterade korpusarna och leverera samma analys som finns i dessa. Återanvändningskravet innebär att alla komponenter i kedjan utvecklas genom återanvändning av resurser, standardmetoder och verktyg med öppen källkod, vilket är nödvändigt för att göra projektet genomförbart. Mot bakgrund av de ställda kraven undersöker avhandlingen två huvudsakliga forskningsfrå- gor. Den första frågan är hur vi kan utveckla morfologiskt och syntaktiskt annoterade korpusar och verktyg och samtidigt uppfylla kraven på kompatibilitet och återanvändning. Den strategi som tillämpas är att acceptera variation i tokenisering för att uppnå robusthet. Variationen i tokenisering i persiska texter är relaterad till ortografiska varianter av flerordsuttryck samt olika typer av affix och klitiska partiklar. Eftersom denna variation är en inneboende egenskap i persiska texter, är det viktigt att verktygen i kedjan kan hantera dem. Därför bör de inte vara tränade på tillrättalagda data. Den andra frågan är med vilken korrekthet vi kan utföra morfologisk och syntaktisk analys för persiska genom att anpassa och tillämpa befintliga verktyg på de annoterade korpusarna? Den experimentella utvärderingen av verktygen visar att meningssegmenteraren och tokenieraren uppnår en korrekthet nära 100%, taggaren har en korrekthet på nästan 97,5%, och parsern uppnår som bäst en korrekthet på över 82% med dependensrelationer (och nära 87% utan relationer). Nyckelord: Persiska, språkteknologi, korpus, trädbank, normalisering, segmentering, ord- klasstaggning, dependensparsning To: my sons Babak and Hooman my parents Asiyeh and Bahram my sister Shohreh my husband Mansour Words cannot express how much I love you all. Contents 1 Introduction ................................................................................................ 23 1.1 Goals and Research Questions ...................................................... 24 1.2 Research Methodology .................................................................. 25 1.3 Outline of the Thesis ...................................................................... 26 1.4 Previous Publications ..................................................................... 27 2 Background ................................................................................................ 29 2.1 Corpora ........................................................................................... 29 2.1.1 Morphological Annotation .............................................. 31 2.1.2 Syntactic Annotation ....................................................... 33 2.2 Tools ................................................................................................ 38 2.2.1 Preprocessing ................................................................... 38 2.2.2 Sentence Segmentation ................................................... 39 2.2.3 Tokenization ..................................................................... 39 2.2.4 Part-of-Speech Tagging ................................................... 40 2.2.5 Parsing .............................................................................. 42 2.3 Persian ............................................................................................. 45 2.3.1 Persian Orthography ........................................................ 46 2.3.2 Persian Morphology ........................................................ 52 2.3.3 Persian Syntax ................................................................. 54 2.4 Existing Corpora and Tools for Persian ........................................ 61 2.4.1 Morphologically Annotated Corpora ............................. 61 2.4.2 Syntactically Annotated Corpora ................................... 64 2.4.3 Sentence Segmenentation and Tokenization .................. 65 2.4.4 Part-of-Speech Taggers ................................................... 65 2.4.5 Parsers .............................................................................. 65 3 Uppsala Persian Corpus ............................................................................ 68 3.1 The Bijankhan Corpus ................................................................... 68 3.2 Uppsala Persian Corpus ................................................................. 70 3.2.1 Character Encodings ....................................................... 70 3.2.2 Sentence Segmentation and Tokenization ...................... 71 3.2.3 Morphological Annotation .............................................. 73 4 Normalization, Segmentation and Morphological Analysis for Persian 82 4.1 Preprocessing, Sentence Segmentation and Tokenization ........... 82 4.1.1 The Preprocessor: PrePer ................................................ 83 4.1.2 The Sentence Segmenter and Tokenizer: SeTPer .......... 88 4.1.3 The Evaluation of PrePer and SeTPer ............................ 89 4.2 The Statistical Part-of-Speech Tagger: TagPer ............................ 91 4.2.1 The Evaluation of TagPer ................................................ 92 5 Uppsala Persian Dependency Treebank ................................................... 99 5.1 Corpus Overview ............................................................................ 99 5.2 Treebank Development ................................................................ 100 5.3 Annotation Scheme ...................................................................... 101 5.4 Basic Relations ............................................................................. 102 5.4.1 Relations from Stanford Dependencies ........................ 102 5.4.2 New Relations ...............................................................
Recommended publications
  • Student Research Workshop Associated with RANLP 2011, Pages 1–8, Hissar, Bulgaria, 13 September 2011
    RANLPStud 2011 Proceedings of the Student Research Workshop associated with The 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011) 13 September, 2011 Hissar, Bulgaria STUDENT RESEARCH WORKSHOP ASSOCIATED WITH THE INTERNATIONAL CONFERENCE RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING’2011 PROCEEDINGS Hissar, Bulgaria 13 September 2011 ISBN 978-954-452-016-8 Designed and Printed by INCOMA Ltd. Shoumen, BULGARIA ii Preface The Recent Advances in Natural Language Processing (RANLP) conference, already in its eight year and ranked among the most influential NLP conferences, has always been a meeting venue for scientists coming from all over the world. Since 2009, we decided to give arena to the younger and less experienced members of the NLP community to share their results with an international audience. For this reason, further to the first successful and highly competitive Student Research Workshop associated with the conference RANLP 2009, we are pleased to announce the second edition of the workshop which is held during the main RANLP 2011 conference days on 13 September 2011. The aim of the workshop is to provide an excellent opportunity for students at all levels (Bachelor, Master, and Ph.D.) to present their work in progress or completed projects to an international research audience and receive feedback from senior researchers. We have received 31 high quality submissions, among which 6 papers have been accepted as regular oral papers, and 18 as posters. Each submission has been reviewed by
    [Show full text]
  • Tehran, Vienna Keen on Exchanging National Archive Documents
    Art & Culture November 23, 2019 3 This Day in History (November 23) Tehran, Vienna Keen on Exchanging Today is Saturday; 2nd of the Iranian month of Azar 1398 solar hijri; corresponding to 25th of the Islamic month of Rabi al-Awwal 1441 lunar hijri; and November 23, 2019, of the Christian Gregorian Calendar. National Archive Documents 1918 solar years ago, on this day in the year 101 AD, present day Romania was occupied by the Roman Empire. This land was ruled by the Romans until its conquest for publishing a 162-year-old Elsewhere in his remarks, by the Ottoman Turks in 1453. Meanwhile, parts of Romania were also occupied by cooperation document between Zarafshan pointed to the documents Austria for a while till the year 1877, in which Romania emerged as independent. Iran and Austria. between Iran and Austria in the Romania covers an area of 237500 sq km. Its capital is Bucharest. 1005 lunar years ago, on this day in 436 AH, the great scholar and jurisprudent, Zarafshan made the remarks at field of cooperation of the two Seyyed Ali Ibn Hussain, popularly known as Sharif Murtaza, passed away at the the venue of National Archive countries in different economic age of 81 in his hometown Baghdad. He was born in a family descended on both of Iran on Wednesday in the sectors, technology transfer, in sides from Prophet Mohammad (blessings of God upon him and his progeny). His inaugural ceremony of Iranian aviation and transport industries, th father Hussain was 5 in line of descent from Imam Musa al-Kazem (AS), while and Austrian documents in the import and export activities and his mother, Fatema – a scion of the family that had carved out an independent state in Tabaristan on the Caspian Sea coast of Iran – was a descendant of Imam presence of Austrian ambassador added, “the old documents of the Zain al-Abedin (AS).
    [Show full text]
  • From CHILDES to Talkbank
    From CHILDES to TalkBank Brian MacWhinney Carnegie Mellon University MacWhinney, B. (2001). New developments in CHILDES. In A. Do, L. Domínguez & A. Johansen (Eds.), BUCLD 25: Proceedings of the 25th annual Boston University Conference on Language Development (pp. 458-468). Somerville, MA: Cascadilla. a similar article appeared as: MacWhinney, B. (2001). From CHILDES to TalkBank. In M. Almgren, A. Barreña, M. Ezeizaberrena, I. Idiazabal & B. MacWhinney (Eds.), Research on Child Language Acquisition (pp. 17-34). Somerville, MA: Cascadilla. Recent years have seen a phenomenal growth in computer power and connectivity. The computer on the desktop of the average academic researcher now has the power of room-size supercomputers of the 1980s. Using the Internet, we can connect in seconds to the other side of the world and transfer huge amounts of text, programs, audio and video. Our computers are equipped with programs that allow us to view, link, and modify this material without even having to think about programming. Nearly all of the major journals are now available in electronic form and the very nature of journals and publication is undergoing radical change. These new trends have led to dramatic advances in the methodology of science and engineering. However, the social and behavioral sciences have not shared fully in these advances. In large part, this is because the data used in the social sciences are not well- structured patterns of DNA sequences or atomic collisions in super colliders. Much of our data is based on the messy, ill-structured behaviors of humans as they participate in social interactions. Categorizing and coding these behaviors is an enormous task in itself.
    [Show full text]
  • The General Regionally Annotated Corpus of Ukrainian (GRAC, Uacorpus.Org): Architecture and Functionality
    The General Regionally Annotated Corpus of Ukrainian (GRAC, uacorpus.org): Architecture and Functionality Maria Shvedova[0000-0002-0759-1689] Kyiv National Linguistic University [email protected] Abstract. The paper presents the General Regionally Annotated Corpus of Ukrainian, which is publicly available (GRAC: uacorpus.org), searchable online and counts more than 400 million tokens, representing most genres of written texts. It also features regional annotation, i. e. about 50 percent of the texts are attributed with regard to the different regions of Ukraine or countries of the di- aspora. If the author is known, the text is linked to their home region(s). The journalistic texts are annotated with regard to the place where the edition is pub- lished. This feature differs the GRAC from a majority of general linguistic cor- pora. Keywords: Ukrainian language, corpus, diachronic evolution, regional varia- tion. 1 Introduction Currently many major national languages have large universal corpora, known as “reference” corpora (cf. das Deutsche Referenzkorpus – DeReKo) or “national” cor- pora (a label dating back ultimately to the British National Corpus). These corpora are large, representative for different genres of written language, have a certain depth of (usually morphological and metatextual) annotation and can be used for many differ- ent linguistic purposes. The Ukrainian language lacks a publicly available linguistic corpus. Still there is a need of a corpus in the present-day linguistics. Independently researchers compile dif- ferent corpora of Ukrainian for separate research purposes with different size and functionality. As the community lacks a universal tool, a researcher may build their own corpus according to their needs.
    [Show full text]
  • Conference Abstracts
    EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Patronage of Ms Neelie Kroes, Vice-President of the European Commission, Digital Agenda Commissioner MAY 23-24-25, 2012 ISTANBUL LÜTFI KIRDAR CONVENTION & EXHIBITION CENTRE ISTANBUL, TURKEY CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis. Assistant Editors: Hélène Mazo, Sara Goggi, Olivier Hamon © ELRA – European Language Resources Association. All rights reserved. LREC 2012, EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2012 Conference Abstracts Distributed by: ELRA – European Language Resources Association 55-57, rue Brillat Savarin 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] Copyright by the European Language Resources Association ISBN 978-2-9517408-7-7 EAN 9782951740877 All rights reserved. No part of this book may be reproduced in any form without the prior permission of the European Language Resources Association ii Introduction of the Conference Chair Nicoletta Calzolari I wish first to express to Ms Neelie Kroes, Vice-President of the European Commission, Digital agenda Commissioner, the gratitude of the Program Committee and of all LREC participants for her Distinguished Patronage of LREC 2012. Even if every time I feel we have reached the top, this 8th LREC is continuing the tradition of breaking previous records: this edition we received 1013 submissions and have accepted 697 papers, after reviewing by the impressive number of 715 colleagues.
    [Show full text]
  • Gold Standard Annotations for Preposition and Verb Sense With
    Gold Standard Annotations for Preposition and Verb Sense with Semantic Role Labels in Adult-Child Interactions Lori Moon Christos Christodoulopoulos Cynthia Fisher University of Illinois at Amazon Research University of Illinois at Urbana-Champaign [email protected] Urbana-Champaign [email protected] [email protected] Sandra Franco Dan Roth Intelligent Medical Objects University of Pennsylvania Northbrook, IL USA [email protected] [email protected] Abstract This paper describes the augmentation of an existing corpus of child-directed speech. The re- sulting corpus is a gold-standard labeled corpus for supervised learning of semantic role labels in adult-child dialogues. Semantic role labeling (SRL) models assign semantic roles to sentence constituents, thus indicating who has done what to whom (and in what way). The current corpus is derived from the Adam files in the Brown corpus (Brown, 1973) of the CHILDES corpora, and augments the partial annotation described in Connor et al. (2010). It provides labels for both semantic arguments of verbs and semantic arguments of prepositions. The semantic role labels and senses of verbs follow Propbank guidelines (Kingsbury and Palmer, 2002; Gildea and Palmer, 2002; Palmer et al., 2005) and those for prepositions follow Srikumar and Roth (2011). The corpus was annotated by two annotators. Inter-annotator agreement is given sepa- rately for prepositions and verbs, and for adult speech and child speech. Overall, across child and adult samples, including verbs and prepositions, the κ score for sense is 72.6, for the number of semantic-role-bearing arguments, the κ score is 77.4, for identical semantic role labels on a given argument, the κ score is 91.1, for the span of semantic role labels, and the κ for agreement is 93.9.
    [Show full text]
  • A Massively Parallel Corpus: the Bible in 100 Languages
    Lang Resources & Evaluation DOI 10.1007/s10579-014-9287-y ORIGINAL PAPER A massively parallel corpus: the Bible in 100 languages Christos Christodouloupoulos • Mark Steedman Ó The Author(s) 2014. This article is published with open access at Springerlink.com Abstract We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora. Keywords Parallel corpus Á Multilingual corpus Á Comparative corpus linguistics 1 Introduction Parallel corpora are a valuable resource for linguistic research and natural language processing (NLP) applications. One of the main uses of the latter kind is as training material for statistical machine translation (SMT), where large amounts of aligned data are standardly used to learn word alignment models between the lexica of two languages (for example, in the Giza?? system of Och and Ney 2003). Another interesting use of parallel corpora in NLP is projected learning of linguistic structure. In this approach, supervised data from a resource-rich language is used to guide the unsupervised learning algorithm in a target language. Although there are some techniques that do not require parallel texts (e.g. Cohen et al. 2011), the most successful models use sentence-aligned corpora (Yarowsky and Ngai 2001; Das and Petrov 2011). C. Christodouloupoulos (&) Department of Computer Science, UIUC, 201 N.
    [Show full text]
  • RASLAN 2011 Recent Advances in Slavonic Natural Language Processing
    RASLAN 2011 Recent Advances in Slavonic Natural Language Processing A. Horák, P. Rychlý (Eds.) RASLAN 2011 Recent Advances in Slavonic Natural Language Processing Fifth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2011 Karlova Studánka, Czech Republic, December 2–4, 2011 Proceedings Tribun EU 2011 Proceedings Editors Aleš Horák Faculty of Informatics, Masaryk University Department of Information Technologies Botanická 68a CZ-602 00 Brno, Czech Republic Email: [email protected] Pavel Rychlý Faculty of Informatics, Masaryk University Department of Information Technologies Botanická 68a CZ-602 00 Brno, Czech Republic Email: [email protected] This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the Czech Copyright Law, in its current version, and permission for use must always be obtained from Tribun EU. Violations are liable for prosecution under the Czech Copyright Law. Editors ○c Aleš Horák, 2011; Pavel Rychlý, 2011 Typography ○c Adam Rambousek, 2011 Cover ○c Petr Sojka, 2010 This edition ○c Tribun EU, Brno, 2011 ISBN 978-80-263-0077-9 Preface This volume contains the Proceedings of the Fifth Workshop on Recent Ad- vances in Slavonic Natural Language Processing (RASLAN 2011) held on De- cember 2nd–4th 2011 in Karlova Studánka, Sporthotel Kurzovní, Jeseníky, Czech Republic. The RASLAN Workshop is an event dedicated to the exchange of informa- tion between research teams working on the projects of computer processing of Slavonic languages and related areas going on in the NLP Centre at the Faculty of Informatics, Masaryk University, Brno.
    [Show full text]
  • A Comparative Study of Metaphor in Arabic and Persian
    EAS Journal of Humanities and Cultural Studies Abbreviated Key Title: EAS J Humanit Cult Stud ISSN: 2663-0958 (Print) & ISSN: 2663-6743 (Online) Published By East African Scholars Publisher, Kenya Volume-2 | Issue-4| Jul-Aug 2020 | DOI: 10.36349/EASJHCS.2020.V02I04.008 Research Article A Comparative Study of Metaphor in Arabic and Persian Yahya Kardgar Associate Professor, Department of Persian language and literature, Faculty of literature and Humanities, University of Qom, Qom, Iran Abstract: Metaphor is one of the most important types of trope e that has a special place in Article History every language, but the way it has been viewed in various languages is different. In Arabic Received: 06.08.2020 and Persian, the similarities of this discussion are more than its distinctions, in that most of Accepted: 22.08.2020 the metaphorical topics of Arabic rhetoric books have been repeated in Persian books two or Published: 30.08.2020 three centuries later. The subject of metaphor in Arabic books has experienced five stages of Journal homepage: outset, growth, prosperity, recession, and Modernism, and has undergone three stages of https://www.easpublisher.com/easjhcs genesis, expansion, and edition in Farsi with fewer developments. Genesis corresponds to the period of outset and growth; the period of expansion corresponds to the period of Quick Response Code recession and the period of edition to the period of Arab modernism. Unfortunately, the Persian rhetoric has not benefited much from the prosperity period of the Arabic rhetoric, so its analytical and aesthetic outlook is weak. Today, research in both languages tend to focus on critical topics, the use of metaphorical studies of other languages, and the linguistic outlook, with a greater emphasis on the nature of language and Nativism in Persian metaphor- seekers.
    [Show full text]
  • CIS Newsletter 15.2
    CENTER FOR IRANIAN STUDIES NEWSLETTER Vol. 15, No. 2 SIPA-Columbia University-New York Fall 2003 ENCYCLOPÆDIA IRANICA SHIRIN EBADI WINNER OF Fascicles 1 and 2 of Volume XII Published; Fascicle 3 in Press 2003 NOBEL PEACE PRIZE The first and second fascicles way in which Hedayat’s satire per- of Volume XII of the Encyclopædia meates many of his short stories. Iranica were published in the Sum- Hillmann reviews plots and themes mer and Fall of 2003. They fea- of Hedayat’s fiction, some fifty or ture over 120 articles on various as- more works written from the mid- pects of Iranian culture and history, in- 1920s through the mid-1940s, and cites cluding four series of articles on spe- features of Hedayat’s distinctive ways cific subjects: four entries on Sadeq of narration which advanced the capa- Hedayat, four entries on Hazara groups bilities of the language in Persian lit- in Afghanistan, four entries on Helmand erature and served as an indigenous River, and eight entries on Herat. model for later Iranian short story writ- ers and novelists. Shirin Ebadi, lawyer and human SADEQ HEDAYAT rights activist who contributed the en- AND PERSIAN LITERATURE Persian literature is also treated in try CHILDREN’S RIGHTS IN IRAN to the the following eight articles: HASAN Encyclopædia Iranica and whose book Four articles discuss the life and GHAZNAVI, poet at the court of History and Documentation of Human work of Sadeq Hedayat, the foremost Bahramshah Ghaznavi, by J. S. Rights in Iran was published by the modern Persian fiction writer who had Meisami; HATEF ESFAHANI, influential Center for Iranian Studies in 2000, was a vast influence on subsequent genera- poet of 18th century, by the late Z.
    [Show full text]
  • Better Web Corpora for Corpus Linguistics and NLP
    Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Vít Suchomel Advisor: Pavel Rychlý i Acknowledgements I would like to thank my advisors, prof. Karel Pala and prof. Pavel Rychlý for their problem insight, help with software design and con- stant encouragement. I am also grateful to my colleagues from Natural Language Process- ing Centre at Masaryk University and Lexical Computing, especially Miloš Jakubíček, Pavel Rychlý and Aleš Horák, for their support of my work and invaluable advice. Furthermore, I would like to thank Adam Kilgarriff, who gave me a wonderful opportunity to work for a leading company in the field of lexicography and corpus driven NLP and Jan Pomikálek who helped me to start. I thank to my wife Kateřina who supported me a lot during writing this thesis. Of those who have always accepted me and loved me in spite of my failures, God is the greatest. ii Abstract The internet is used by computational linguists, lexicographers and social scientists as an immensely large source of text data for various NLP tasks and language studies. Web corpora can be built in sizes which would be virtually impossible to achieve using traditional corpus creation methods.
    [Show full text]
  • Book of Abstracts
    Book of Abstracts - Journée d’études AFLiCo JET Corpora and Representativeness jeudi 3 et vendredi 4 mai 2018 Bâtiment Max Weber (W) Keynote speakers Dawn Knight et Thomas Egan Plus d’infos : bit.ly/2FPGxvb 1 Table of contents DAY 1 – May 3rd 2018 Thomas Egan, Some perils and pitfalls of non-representativeness 1 Daniel Henke, De quoi sont représentatifs les corpus de textes traduits au juste ? Une étude de corpus comparable-parallèle 1 Ilmari Ivaska, Silvia Bernardini, Adriano Ferraresi, The comparability paradox in multilingual and multi-varietal corpus research: Coping with the unavoidable 2 Antonina Bondarenko, Verbless Sentences: Advantages and Challenges of a Parallel Corpus-based Approach 3 Adeline Terry, The representativeness of the metaphors of death, disease, and sex in a TV show corpus 5 Julien Perrez, Pauline Heyvaert, Min Reuchamps, On the representativeness of political corpora in linguistic research 6 Joshua M. Griffiths, Supplementing Maximum Entropy Phonology with Corpus Data 8 Emmanuelle Guérin, Olivier Baude, Représenter la variation – Revisiter les catégories et les variétés dans le corpus ESLO 9 Caroline Rossi, Camille Biros, Aurélien Talbot, La variation terminologique en langue de spécialité : pour une analyse à plusieurs niveaux 10 Day 2 – May 4rd 2018 Dawn Knight, Representativeness in CorCenCC: corpus design in minoritised languages 11 Frederick Newmeyer, Conversational corpora: When 'big is beautiful' 12 Graham Ranger, How to get "along": in defence of an enunciative and corpus- based approach 13 Thi Thu Trang
    [Show full text]