Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC

Total Page:16

File Type:pdf, Size:1020Kb

Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC Piotr Bański, Marc Kupietz, Harald Lüngen, Paul Rayson, Hanno Biber, Evelyn Breiteneder, Simon Clematide, John Mariani, Mark Stevenson, Theresa Sick (editors) Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest section Birmingham, 24 July 2017 Challenges in the Management of Large Corpora and Big Data and Natural Language Processing 2017 Workshop Programme 24 July 2017 WAC-XI guest session (11.00 - 12.30) Convenors: Adrien Barbaresi (ICLTT Vienna), Felix Bildhauer (IDS Mannheim), Roland Schäfer (FU Berlin) Chair: Stefan Evert (Friedrich-Alexander-Universität Erlangen-Nürnberg) Edyta Jurkiewicz-Rohrbacher, Zrinka Kolaković, Björn Hansen Web Corpora – the best possible solution for tracking rare phenomena in underresourced languages – Clitics in Bosnian, Croatian and Serbian Vladimir Benko – Are Web Corpora Inferior? The Case of Czech and Slovak Vit Suchomel – Removing Spam from Web Corpora Through Supervised Learning Using FastText Lunch (12:30-13:30) CMLC-5+BigNLP main section: (13:30 –17:00) Welcome and Introduction 13:30-13:40 National Corpora Talks 13:40 – 15:40 Dawn Knight, Tess Fitzpatrick, Steve Morris, Jeremy Evas, Paul Rayson, Irena Spasic, Mark Stonelake, Enlli Môn Thomas, Steven Neale, Jennifer Needs, Scott Piao, Mair Rees, Gareth Watkins, Laurence Anthony, Thomas Michael Cobb, Margaret Deuchar, Kevin Donnelly, Michael McCarthy, Kevin Scannell, Creating CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes – The National Corpus of Contemporary Welsh) David McClure, Mark Algee-Hewitt, Douris Steele, Erik Fredner and Hannah Walse, Organizing corpora at the Stanford Literary Lab Marc Kupietz, Andreas Witt, Piotr Bański, DanTufiş, DanCristea, TamásVáradi, EuReCo - Joining Forces for a European Reference Corpus as a sustainable base for cross- linguistic research John Kirk, Anna Čermáková, From ICE to ICC: The International Comparable Corpus ii Coffee break 15.00-15.20 Harald Lüngen, Marc Kupietz, CMC Corpora in DEREKO Technology Talks 15:40-16:40 John Vidler, Stephen Wattam, Keeping Properties with the Data: CL-MetaHeaders – An Open Specification Andreas Dittrich, Intra-connecting a small exemplary literary corpus with semantic web technologies for exploratory literary studies Radoslav Rábara, Pavel Rychlý ,Ondřej Herman, Accelerating Corpus Search Using Multiple Cores Wrap-up discussion 16:40-17:00 CMLC-5+BigNLP Organisers and Editors Piotr Bański, Marc Kupietz, Harald Lüngen Institut für Deutsche Sprache, Mannheim Hanno Biber, Evelyn Breiteneder Institute for Corpus Linguistics and Text Technology, Vienna Simon Clematide Institute of Computational Linguistics, Zurich John Mariani, Paul Rayson Lancaster University, UK Mark Stevenson Sheffield University, UK iii CMLC-5+BigNLP Programme Committee Laurence Anthony Waseda University, Japan Alistair Baron Lancaster University, UK Felix Bildhauer IDS Mannheim Damir Ćavar Indiana University, Bloomington Matt Coole Lancaster Univeristy, UK Dan Cristea Romanian Academy, Institute for Computer Science-Iaşi, "Alexandru Ioan Cuza" University of Iași Tomaž Erjavec Jožef Stefan Institute Alexander Geyken Berlin-Brandenburgische Akademie der Wissenschaften Johannes Graën University of Zurich Andrew Hardie Lancaster University Serge Heiden ENS de Lyon Miloš Jakubíček Lexical Computing Ltd. Dawn Knight Cardiff University, UK Michal Křen Charles University, Prague Sandra Kübler Indiana University, Bloomington Jochen Leidner Thomson Reuters, UK Rao Muhammad Adeel Nawab COMSATS, Pakistan Piotr Pęzik University of Łódź Laura Irina Rusu IBM Australia Roland Schäfer FU Berlin Roman Schneider IDS Mannheim Gandhi Sivakumar IBM Australia Irena Spasić Cardiff University, UK Marko Tadić University of Zagreb, Faculty of Humanities and Social Sciences Dan Tufiş Institute for Artificial Intelligence Mihai Drăgănescu, Bucharest Tamás Váradi Research Institute for Linguistics, Hungarian Academy of Sciences Andreas Witt University of Cologne, IDS Mannheim, University of Heidelberg Amir Zeldes Georgetown University, USA CMLC-5+BigNLP Homepage: http://corpora.ids-mannheim.de/cmlc-2017.html iv Table of contents Intra-connecting a small exemplary literary corpus with semantic web technologies for exploratory literary studies Andreas Dittrich (Academiae Corpora, Austrian Academy of Sciences) ........................................... 1 From ICE to ICC: The new International Comparable Corpus John Kirk (Dresden University of Technology), Anna Čermáková (Charles University, Prague) ..... 7 Creating CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes – The National Corpus of Contemporary Welsh) Dawn Knight (Cardiff University), Tess Fitzpatrick (Swansea University), Steve Morris (Swansea University), Jeremy Evas (Cardiff University), Paul Rayson (Lancaster University), Irena Spasic (Cardiff University), Mark Stonelake (Swansea University), Enlli Môn Thomas (Bangor University), Steven Neale (Cardiff University), Jennifer Needs (Swansea University), Scott Piao (Lancaster University), Mair Rees (Swansea University), Gareth Watkins (Cardiff University), Laurence Anthony (Waseda University), Thomas Michael Cobb (University of Quebec at Montreal), Margaret Deuchar (University of Cambridge), Kevin Donnelly (Freelance), Michael McCarthy (University of Nottingham), Kevin Scannell (Saint Louis University) .............................................. 13 EuReCo - Joining Forces for a European Reference Corpus as a sustainable base for cross- linguistic research Marc Kupietz (IDS Mannheim), Andreas Witt (University of Cologne, IDS Mannheim, Heidelberg University), Piotr Bański (IDS Mannheim), Dan Tufiş (Institute for Artificial Intelligence Mihai Drăgănescu, Bucharest), Dan Cristea (Romanian Academy, Institute for Computer Science - Iaşi, Alexandru Ioan Cuza University), Tamás Váradi (Research Institute for Linguistics, Hungarian Academy of Sciences) ....................................................................................................................... 15 CMC Corpora in DeReKo Harald Lüngen, Marc Kupietz (IDS Mannheim) .............................................................................. 20 Organizing corpora at the Stanford Literary Lab David McClure, Mark Algee-Hewitt, Douris Steele, Erik Fredner and Hannah Walser (Stanford University), ........................................................................................................................................ 25 Accelerating Corpus Search Using Multiple Cores Radoslav Rábara, Pavel Rychlý ,Ondřej Herman (Lexical Computing) ........................................... 30 Keeping Properties with the Data: CL-MetaHeaders – An Open Specification John Vidler (School of Computing and Communications, Lancaster University), Stephen Wattam (Department of Linguistics and English Language, Lancaster University) ...................................... 35 WAC-XI contributions .............................................................................................................. 42 Are Web Corpora Inferior? The Case of Czech and Slovak Vladimir Benko (Slovak Academy of Sciences, Comenius University in Bratislava) ....................... 43 v Web Corpora – the best possible solution for tracking phenomena in underresourced languages: clitics in Bosnian, Croatian and Serbian Edyta Jurkiewicz-Rohrbacher (University of Helsinki, Universität Regensburg),, Zrinka Kolaković (Universität Regensburg), Björn Hansen (Universität Regensburg) ................................................. 49 Removing Spam from Web Corpora Through Supervised Learning Using FastText Vit Suchomel (Masaryk University, Brno) ......................................................................................... 56 vi Author Index Algee-Hewitt, Mark .......................................................................................................................... 25 Anthony, Laurence ............................................................................................................................ 13 Bański, Piotr ...................................................................................................................................... 15 Benko, Vladimir ................................................................................................................................ 43 Čermáková, Anna ................................................................................................................................ 7 Cobb, Thomas Michael ..................................................................................................................... 13 Cristea, Dan ....................................................................................................................................... 15 Deuchar, Margaret ............................................................................................................................ 13 Dittrich, Andreas ................................................................................................................................. 1 Donnelly, Kevin ................................................................................................................................ 13 Evas, Jeremy ..................................................................................................................................... 13 Fitzpatrick, Tess ...............................................................................................................................
Recommended publications
  • CLIB 2016 Proceedings
    The Second International Conference Computational ​ Linguistics in Bulgaria (CLIB 2016) is organised within ​ the Operation for Support for International Scientific Conferences Held in Bulgaria of the National Science ​ Fund Grant № ДПМНФ 01/9 of 11 Aug 2016. National Science Fund ​ ​ CLIB 2016 is organised by: The Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences PUBLICATION AND CATALOGUING INFORMATION Title: Proceedings of the Second International Conference Computational Linguistics in Bulgaria (CLIB 2016) ISSN: 2367­5675 (online) Published and The Institute for Bulgarian Language Prof. Lyubomir ​ distributed by: Andreychin, Bulgarian Academy of Sciences ​ Editorial address: Institute for Bulgarian Language Prof. Lyubomir ​ Andreychin, Bulgarian Academy of Sciences ​ 52 Shipchenski prohod blvd., bldg. 17 Sofia 1113, Bulgaria +359 2/ 872 23 02 Copyright of each paper stays with the respective authors. The works in the Proceedings are licensed under a Creative Commons Attribution 4.0 International Licence (CC BY 4.0). Licence details: http://creativecommons.org/licenses/by/4.0 ​ Proceedings of the Second International Conference Computational Linguistics in Bulgaria 9 September 2016 Sofia, Bulgaria PREFACE We are excited to welcome you to the second edition of the International Conference Computational ​ Linguistics in Bulgaria (CLIB 2016) in Sofia, Bulgaria! ​ CLIB aspires to foster the NLP community in Bulgaria and further the cooperation among researchers working in NLP for Bulgarian around the world. The need for a conference dedicated to NLP research dealing with or applicable to Bulgarian has been felt for quite some time. We believe that building a strong community of researchers and teams who have chosen to work on Bulgarian is a key factor to meeting the challenges and requirements posed to computational linguistics and NLP in Bulgaria.
    [Show full text]
  • The Main Features of the E-Glava Online Valency Dictionary
    The Main Features of the e-Glava Online Valency Dictionary Matea Birtić, Ivana Brač, Siniša Runjaić Institute of Croatian Language and Linguistics, Ulica Republike Austrije 16, HR-10000 Zagreb, Croatia E.mail: [email protected], [email protected], [email protected] Abstract E-Glava is an online valency dictionary of Croatian verbs. The theoretical approach to valency follows the German tradition, particularly that of the VALBU dictionary, with some minor changes and adjustments. The main principle of our valency approach is to link valency patterns to specific verb meanings. The verb list is compiled semi-automatically on the basis of the Croatian Frequency Dictionary and Croatian language textbooks. Currently, e-Glava contains descriptions of 57 psychological verbs with 187 meanings and 375 valency patterns. The lexicographic articles are written in Tschwanelex. A Document Type Definition editing module has been used, and the description of verbs follows a three-level linguistic schema prepared for lexicographers. Verbs are distributed throughout 34 semantic classes, and examples are extracted manually from Croatian corpora. Fully processed data for each semantic class will be publicly available in the form of a browsable HTML dictionary. The paper also presents a comparison between e-Glava and other cognate resources, as well as a summary of its main advantages, disadvantages, and potential applied uses. Keywords: Croatian language; valency dictionary; e-dictionary; syntax 1. Introduction Sentence structure and the syntactic behaviour of verbs were perhaps the most intriguing and interesting topics for early grammatical descriptions and, later, linguistic descriptions of language. Valency properties are relevant to both theoretical and applied linguistic considerations.
    [Show full text]
  • Collection of Usage Information for Language Resources from Academic Articles
    Collection of Usage Information for Language Resources from Academic Articles Shunsuke Kozaway, Hitomi Tohyamayy, Kiyotaka Uchimotoyyy, Shigeki Matsubaray yGraduate School of Information Science, Nagoya University yyInformation Technology Center, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan yyyNational Institute of Information and Communications Technology 4-2-1 Nukui-Kitamachi, Koganei, Tokyo, 184-8795, Japan fkozawa,[email protected], [email protected], [email protected] Abstract Recently, language resources (LRs) are becoming indispensable for linguistic researches. However, existing LRs are often not fully utilized because their variety of usage is not well known, indicating that their intrinsic value is not recognized very well either. Regarding this issue, lists of usage information might improve LR searches and lead to their efficient use. In this research, therefore, we collect a list of usage information for each LR from academic articles to promote the efficient utilization of LRs. This paper proposes to construct a text corpus annotated with usage information (UI corpus). In particular, we automatically extract sentences containing LR names from academic articles. Then, the extracted sentences are annotated with usage information by two annotators in a cascaded manner. We show that the UI corpus contributes to efficient LR searches by combining the UI corpus with a metadata database of LRs and comparing the number of LRs retrieved with and without the UI corpus. 1. Introduction Thesaurus1. In recent years, such language resources (LRs) as corpora • He also employed Roget’s Thesaurus in 100 words of and dictionaries are being widely used for research in the window to implement WSD.
    [Show full text]
  • Conference Abstracts
    EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Patronage of Ms Neelie Kroes, Vice-President of the European Commission, Digital Agenda Commissioner MAY 23-24-25, 2012 ISTANBUL LÜTFI KIRDAR CONVENTION & EXHIBITION CENTRE ISTANBUL, TURKEY CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis. Assistant Editors: Hélène Mazo, Sara Goggi, Olivier Hamon © ELRA – European Language Resources Association. All rights reserved. LREC 2012, EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2012 Conference Abstracts Distributed by: ELRA – European Language Resources Association 55-57, rue Brillat Savarin 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] Copyright by the European Language Resources Association ISBN 978-2-9517408-7-7 EAN 9782951740877 All rights reserved. No part of this book may be reproduced in any form without the prior permission of the European Language Resources Association ii Introduction of the Conference Chair Nicoletta Calzolari I wish first to express to Ms Neelie Kroes, Vice-President of the European Commission, Digital agenda Commissioner, the gratitude of the Program Committee and of all LREC participants for her Distinguished Patronage of LREC 2012. Even if every time I feel we have reached the top, this 8th LREC is continuing the tradition of breaking previous records: this edition we received 1013 submissions and have accepted 697 papers, after reviewing by the impressive number of 715 colleagues.
    [Show full text]
  • Morphological Doublets in Croatian: a Multi-Methodological Analysis
    Morphological Doublets in Croatian: A multi-methodological analysis By: Dario Lečić A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy The University of Sheffield Faculty of Arts and Humanities Department of Russian and Slavonic Studies 20 January, 2017 Acknowledgments Many a PhD student past and present will agree that doing a PhD is a time-consuming process with lots of ups and downs, motivational issues and even a number of nervous breakdowns. Having experienced all of these, I can only say that they were right. However, having reached the end of the tunnel, I have to admit that the feeling is great. I would like to use this opportunity to thank all the people who made this possible and who have helped me during these four years spent researching the intricate world of morphological doublets in Croatian. First of all, I would like to express my immense gratitude to my primary supervisor, Professor Neil Bermel from the Department or Russian and Slavonic Studies at the University of Sheffield for offering his guidance from day one. Our regular supervisory meetings as well as numerous e-mail exchanges have been eye-opening and I would not have been able to do this without you. I hope this dissertation will justify all the effort you have put into me as your PhD student. Even though the jurisdiction of the second supervisor as defined by the University of Sheffield officially stretches mostly to matters of the Doctoral Development Programme, my second supervisor, Dr Dagmar Divjak, nevertheless played a major role in this research as well, primarily in matters of statistics.
    [Show full text]
  • FERSIWN GYMRAEG ISOD the National Corpus of Contemporary
    FERSIWN GYMRAEG ISOD The National Corpus of Contemporary Welsh Project Report, October 2020 Authors: Dawn Knight1, Steve Morris2, Tess Fitzpatrick2, Paul Rayson3, Irena Spasić and Enlli Môn Thomas4. 1. Introduction 1.1. Purpose of this report This report provides an overview of the CorCenCC project and the online corpus resource that was developed as a result of work on the project. The report lays out the theoretical underpinnings of the research, demonstrating how the project has built on and extended this theory. We also raise and discuss some of the key operational questions that arose during the course of the project, outlining the ways in which they were answered, the impact of these decisions on the resource that has been produced and the longer-term contribution they will make to practices in corpus-building. Finally, we discuss some of the applications and the utility of the work, outlining the impact that CorCenCC is set to have on a range of different individuals and user groups. 1.2. Licence The CorCenCC corpus and associated software tools are licensed under Creative Commons CC-BY-SA v4 and thus are freely available for use by professional communities and individuals with an interest in language. Bespoke applications and instructions are provided for each tool (for links to all tools, refer to section 10 of this report). When reporting information derived by using the CorCenCC corpus data and/or tools, CorCenCC should be appropriately acknowledged (see 1.3). § To access the corpus visit: www.corcencc.org/explore § To access the GitHub site: https://github.com/CorCenCC o GitHub is a cloud-based service that enables developers to store, share and manage their code and datasets.
    [Show full text]
  • Languages in the European Information Society Croatian
    META-NET White Paper Series Languages in the European Information Society Croatian Early Release Edition META-FORUM 2011 27-28 June 2011 Budapest, Hungary The development of this white paper has been funded by the Seventh Framework Programme and the ICT Policy Support Programme of the European Commission under contracts T4ME (Grant Agreement 249119), CESAR (Grant Agreement 271022), METANET4U (Grant Agreement 270893) and META-NORD (Grant Agreement 270899). This white paper is for educators, journalists, politicians, language communities and others, who want to establish a truly multilingual Europe. This white paper is part of a series that promotes knowledge about language technology and its potential. The availability and use of language technology in Europe varies between languages. Conse- quently, the actions that are required to further support research and development of language technologies also differs for each language. The required actions depend on many factors, such as the complexity of a given language and the size of its community. META-NET, a European Commission Network of Excellence, has conducted an analysis of current language resources and technolo- gies. This analysis focused on the 23 official European languages as well as other important regional languages in Europe. The results of this analysis suggests that there are many significant research gaps for each language. A more detailed, expert analysis and as- sessment of the current situation will help maximise the impact of additional research and minimize any risks. META-NET consists of 44 research centres from 31 countries who are working with stakeholders from commercial businesses, gov- ernment agencies, industry, research organisations, software com- panies, technology providers and European universities.
    [Show full text]
  • The Translation Equivalents Database (Treq) As a Lexicographer’S Aid
    The Translation Equivalents Database (Treq) as a Lexicographer’s Aid Michal Škrabal, Martin Vavřín Institute of the Czech National Corpus, Charles University, Czech Republic E-mail: [email protected], [email protected] Abstract The aim of this paper is to introduce a tool that has recently been developed at the Institute of the Czech National Corpus, the Treq (Translation Equivalents) database, and to explore its possible uses, especially in the field of lexicography. Equivalent candidates offered by Treq can also be considered as potential equivalents in a bilingual dictionary (we will focus on the Latvian–Czech combination in this paper). Lexicographers instantly receive a list of candidates for target language counterparts and their frequencies (expressed both in absolute numbers and percentages) that suggest the probability that a given candidate is functionally equivalent. A significant advantage is the possibility to click on any one of these candidates and immediately verify their individual occurrences in a given context; and thus more easily distinguish the relevant translation candidates from the misleading ones. This utility, which is based on data stored in the InterCorp parallel corpus, is continually being upgraded and enriched with new functions (the recent integration of multi-word units, adding English as the primary language of the dictionaries, an improved interface, etc.), and the accuracy of the results is growing as the volume of data keeps increasing. Keywords: InterCorp; Treq; translation equivalents; alignment; Latvian–Czech dictionary 1. Introduction The aim of this paper is to introduce one of the tools that has been developed recently at the Institute of the Czech National Corpus (ICNC) and which could be especially helpful to lexicographers: namely, the Treq translation equivalents database1.
    [Show full text]
  • The Workshop Programme
    The Workshop Programme Monday, May 26 14:30 -15:00 Opening Nancy Ide and Adam Meyers 15:00 -15:30 SIGANN Shared Corpus Working Group Report Adam Meyers 15:30 -16:00 Discussion: SIGANN Shared Corpus Task 16:00 -16:30 Coffee break 16:30 -17:00 Towards Best Practices for Linguistic Annotation Nancy Ide, Sameer Pradhan, Keith Suderman 17:00 -18:00 Discussion: Annotation Best Practices 18:00 -19:00 Open Discussion and SIGANN Planning Tuesday, May 27 09:00 -09:10 Opening Nancy Ide and Adam Meyers 09:10 -09:50 From structure to interpretation: A double-layered annotation for event factuality Roser Saurí and James Pustejovsky 09:50 -10:30 An Extensible Compositional Semantics for Temporal Annotation Harry Bunt, Chwhynny Overbeeke 10:30 -11:00 Coffee break 11:00 -11:40 Using Treebank, Dictionaries and GLARF to Improve NomBank Annotation Adam Meyers 11:40 -12:20 A Dictionary-based Model for Morpho-Syntactic Annotation Cvetana Krstev, Svetla Koeva, Du!ko Vitas 12:20 -12:40 Multiple Purpose Annotation using SLAT - Segment and Link-based Annotation Tool (DEMO) Masaki Noguchi, Kenta Miyoshi, Takenobu Tokunaga, Ryu Iida, Mamoru Komachi, Kentaro Inui 12:40 -14:30 Lunch 14:30 -15:10 Using inheritance and coreness sets to improve a verb lexicon harvested from FrameNet Mark McConville and Myroslava O. Dzikovska 15:10 -15:50 An Entailment-based Approach to Semantic Role Annotation Voula Gotsoulia 16:00 -16:30 Coffee break 16:30 -16:50 A French Corpus Annotated for Multiword Expressions with Adverbial Function Eric Laporte, Takuya Nakamura, Stavroula Voyatzi
    [Show full text]
  • The National Corpus of Contemporary Welsh 1. Introduction
    The National Corpus of Contemporary Welsh Project Report, October 2020 1. Introduction 1.1. Purpose of this report This report provides an overview of the CorCenCC project and the online corpus resource that was developed as a result of work on the project. The report lays out the theoretical underpinnings of the research, demonstrating how the project has built on and extended this theory. We also raise and discuss some of the key operational questions that arose during the course of the project, outlining the ways in which they were answered, the impact of these decisions on the resource that has been produced and the longer-term contribution they will make to practices in corpus-building. Finally, we discuss some of the applications and the utility of the work, outlining the impact that CorCenCC is set to have on a range of different individuals and user groups. 1.2. Licence The CorCenCC corpus and associated software tools are licensed under Creative Commons CC-BY-SA v4 and thus are freely available for use by professional communities and individuals with an interest in language. Bespoke applications and instructions are provided for each tool (for links to all tools, refer to section 10 of this report). When reporting information derived by using the CorCenCC corpus data and/or tools, CorCenCC should be appropriately acknowledged (see 1.3). § To access the corpus visit: www.corcencc.org/explore § To access the GitHub site: https://github.com/CorCenCC o GitHub is a cloud-based service that enables developers to store, share and manage their code and datasets.
    [Show full text]
  • Overabundance in Croatian Dual-Class Verbs FLUMINENSIA, God
    Tomislava Bošnjak Botica, Gordana Hržica, Overabundance in Croatian dual-class verbs FLUMINENSIA, god. 28 (2016), br. 1 Tomislava Bošnjak Botica, Gordana Hržica OVERABUNDANCE IN CROATIAN DUAL-CLASS VERBS dr. sc. Tomislava Bošnjak Botica, Institut za hrvatski jezik i jezikoslovlje, [email protected], Zagreb dr. sc. Gordana Hržica, Edukacijsko-rehabilitacijski fakultet, [email protected], Zagreb izvorni znanstveni članak UDK 811.163.42’367.625 rukopis primljen: 5. 4. 2016.; prihvaćen za tisak: 21. 6. 2016. Croatian verbal inflection morphology is typically described using verb class distinctions. The number of classes differs among approaches, but the basic criterion for class division is the presence or absence and the type of suppletion in verb stems. Generally, one verb belongs to one inflectional class or paradigm only. However, some verbs belong to two classes, i.e. they have two parallel sets of stems. In such dual-class verbs, one infinitive form is realizable in two present forms in all cells within a class, i.e. there is an overabundance (Thorton 2011). Inevitably, one of the stem forming paradigms is a class with categorial suppletion. The present stem of a categorial suppletion class has a greater phonological distance from the infinitive stem than the present stem of the other class. Using a different terminology one class can be described as more transparent, while the other is less transparent (more opaque) in forming the present stem. This study attempts to present overabundance in dual-class verbs and to determine whether competition in such forms can be explained by their tendency to conform to one default class or by other factors, specifically, by the phonological distance between the two paradigms of dual-class verbs.
    [Show full text]
  • Better Web Corpora for Corpus Linguistics and NLP
    Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Vít Suchomel Advisor: Pavel Rychlý i Acknowledgements I would like to thank my advisors, prof. Karel Pala and prof. Pavel Rychlý for their problem insight, help with software design and con- stant encouragement. I am also grateful to my colleagues from Natural Language Process- ing Centre at Masaryk University and Lexical Computing, especially Miloš Jakubíček, Pavel Rychlý and Aleš Horák, for their support of my work and invaluable advice. Furthermore, I would like to thank Adam Kilgarriff, who gave me a wonderful opportunity to work for a leading company in the field of lexicography and corpus driven NLP and Jan Pomikálek who helped me to start. I thank to my wife Kateřina who supported me a lot during writing this thesis. Of those who have always accepted me and loved me in spite of my failures, God is the greatest. ii Abstract The internet is used by computational linguists, lexicographers and social scientists as an immensely large source of text data for various NLP tasks and language studies. Web corpora can be built in sizes which would be virtually impossible to achieve using traditional corpus creation methods.
    [Show full text]