RASLAN 2016 Recent Advances in Slavonic Natural Language Processing

RASLAN 2016 Recent Advances in Slavonic Natural Language Processing A. Horák, P. Rychlý, A. Rambousek (Eds.) RASLAN 2016 Recent Advances in Slavonic Natural Language Processing Tenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2016 Karlova Studánka, Czech Republic, December 2–4, 2016 Proceedings Tribun EU 2016 Proceedings Editors Aleš Horák Faculty of Informatics, Masaryk University Department of Information Technologies Botanická 68a CZ-602 00 Brno, Czech Republic Email: [email protected] Pavel Rychlý Faculty of Informatics, Masaryk University Department of Information Technologies Botanická 68a CZ-602 00 Brno, Czech Republic Email: [email protected] Adam Rambousek Faculty of Informatics, Masaryk University Department of Information Technologies Botanická 68a CZ-602 00 Brno, Czech Republic Email: [email protected] This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the Czech Copyright Law, in its current version, and permission for use must always be obtained from Tribun EU. Violations are liable for prosecution under the Czech Copyright Law. Editors © Aleš Horák, 2016; Pavel Rychlý, 2016; Adam Rambousek, 2016 Typography © Adam Rambousek, 2016 Cover © Petr Sojka, 2010 This edition © Tribun EU, Brno, 2016 ISBN 978-80-263-1095-2 ISSN 2336-4289 Preface This volume contains the Proceedings of the Tenth Workshop on Recent Advances in Slavonic Natural Language Processing (RASLAN 2016) held on December 2nd–4th 2016 in Karlova Studánka, Sporthotel Kurzovní, Jeseníky, Czech Republic. The RASLAN Workshop is an event dedicated to the exchange of information between research teams working on the projects of computer processing of Slavonic languages and related areas going on in the NLP Centre at the Faculty of Informatics, Masaryk University, Brno. RASLAN is focused on theoretical as well as technical aspects of the project work, on presentations of verified methods together with descriptions of development trends. The workshop also serves as a place for discussion about new ideas. The intention is to have it as a forum for presentation and discussion of the latest developments in the field of language engineering, especially for undergraduates and postgraduates af- filiated to the NLP Centre at FI MU. Topics of the Workshop cover a wide range of subfields from the area of artificial intelligence and natural language processing including (but not limited to): * text corpora and tagging * syntactic parsing * sense disambiguation * machine translation, computer lexicography * semantic networks and ontologies * semantic web * knowledge representation * logical analysis of natural language * applied systems and software for NLP RASLAN 2016 offers a rich program of presentations, short talks, technical papers and mainly discussions. A total of 18 papers were accepted, contributed altogether by 28 authors. Our thanks go to the Program Committee members and we would also like to express our appreciation to all the members of the Organizing Committee for their tireless efforts in organizing the Workshop and ensuring its smooth running. In particular, we would like to mention the work of Aleš Horák, Pavel Rychlý and Marie Stará. The TEXpertise of Adam Rambousek (based on LATEX macros prepared by Petr Sojka) resulted in the extremely speedy and efficient production of the volume which you are now holding in your hands. Last but not least, the cooperation of Tribun EU as a printer of these proceedings is gratefully acknowledged. Brno, December 2016 Karel Pala Table of Contents I Text Corpora Between Comparable and Parallel: English-Czech Corpus from Wikipedia 3 Adéla Štromajerová, Vít Baisa, and Marek Blahuš Comparison of High-Frequency Nouns from the Perspective of Large Corpora . 9 Maria Khokhlova Feeding the “Brno Pipeline”: The Case of Araneum Slovacum . 19 Vladimír Benko Gold-Standard Datasets for Annotation of Slovene Computer- Mediated Communication . 29 Tomaž Erjavec, Jaka Cibej,ˇ Špela Arhar Holdt, Nikola Ljubešić,and Darja Fišer II Semantics and Language Modelling Detecting Semantic Shifts in Slovene Twitterese . 43 Darja Fišer and Nikola Ljubešić The Algorithm of Context Recognition in TIL . 51 Marie Duží and Michal Fait Czech Grammar Agreement Dataset for Evaluation of Language Models . 63 Vít Baisa Bilingual Logical Analysis of Natural Language Sentences . 69 Marek Medved’, Aleš Horák, and VojtˇechKováˇr ScaleText: The Design of a Scalable, Adaptable and User-Friendly Document System for Similarity Searches . 79 Jan Rygl, Petr Sojka, Michal R˚užiˇcka,and Radim Reh˚uˇrekˇ III Electronic Lexicography and Language Resources Automatic Identification of Valency Frames in Free Text . 91 Martin Wörgötter VIII Table of Contents Data Structures in Lexicography: from Trees to Graphs . 97 Michal Mˇechura Pre-processing Large Resources for Family Names Research . 105 Adam Rambousek Options for Automatic Creation of Dictionary Definitions from Corpora . 111 Marie Stará and VojtˇechKováˇr IV Corpora-based Applications Evaluating Natural Language Processing Tasks with Low Inter-Annotator Agreement: The Case of Corpus Applications . 127 VojtˇechKováˇr Terminology Extraction for Academic Slovene Using Sketch Engine. 135 Darja Fišer, Vít Suchomel, and Miloš Jakubíˇcek Large Scale Keyword Extraction using a Finite State Backend . 143 Miloš Jakubíˇcekand Pavel Šmerk Evaluation of the Sketch Engine Thesaurus on Analogy Queries . 147 Pavel Rychlý How to Present NLP Topics to Children? . 153 Zuzana Nevˇeˇrilováand Adam Rambousek Subject Index .................................................... 161 Author Index ..................................................... 163 Part I Text Corpora Between Comparable and Parallel: English-Czech Corpus from Wikipedia Adéla Štromajerová, Vít Baisa, Marek Blahuš Faculty of Informatics, Masaryk University Botanická 68a, 602 00 Brno, Czech Republic {xstromaj,xbaisa,xblah}@fi.muni.cz Abstract. We describe the process of creating a parallel corpus from Czech and English Wikipedias using methods which are language inde- pendent. The corpus consists of Czech and English Wikipedia articles, the Czech ones being translations of the English ones, is aligned on sentence level and is accessible in Sketch Engine corpus manager.1 Key words: parallel corpora, comparable corpora, Wikipedia 1 Introduction Wikipedia is now available in almost 300 languages, 13 of them having more than one million articles. The largest is the English Wikipedia, which contains more than 5 million articles. For each article, Wikipedia stores information about all the editing: the editor, the time of the editing and the changes made in the article. It is also possible to view any previous version of the article. New articles are either written from scratch, or an article from another language version of Wikipedia is translated. However, translations do not need to cover the whole original article. Translations in Wikipedia are mostly from English to other languages. The Translated page template should be always added into such article so that it is clear that it is a translation and to identify what article and what language version of Wikipedia have been used for the translation.2 We used this information and extracted the translated articles as a basis for an English-Czech parallel corpus. There are more than 37,000 such articles3 and they cover a wide range of topics. As the whole Czech Wikipedia contains about 350,000 articles, the proportion of the English-Czech translations is quite high, i.e. approximately 10%, when the number of articles (not their length) is considered. 1 https://ske.fi.muni.cz 2 https://cs.wikipedia.org/wiki/Wikipedie:WikiProjekt_P%C5%99eklad/Rady 3 https://cs.wikipedia.org/wiki/Kategorie:Monitoring:%C4%8Cl%C3%A1nky_p% C5%99elo%C5%BEen%C3%A9_z_enwiki Aleš Horák, Pavel Rychlý, Adam Rambousek (Eds.): Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2016, pp. 3–8, 2016. © Tribun EU 2016 4 A. Štromajerová, V. Baisa, and M. Blahuš 2 Related work There have been a few attempts at creating corpora from Wikipedia, e.g. [1]. There is a huge comparable corpus Wikipedia Comparable Corpora4. This consists of monolingual corpora, every one of them containing all articles from a particular language version of Wikipedia. These corpora are then document- aligned, i.e., the corpora consist of document pairs of articles on the same subject. There are also some parallel corpora based on Wikipedia, for example, a Chinese-Japanese parallel corpus created to help to improve the SMT between these two languages [2]. The field of web parallel corpora is of great interest nowadays, and there are many projects dealing specifically with Wikipedia (Besides the already mentioned ones, e.g. a Persian-English parallel corpus [3]. However, there is no parallel corpus made out of English and Czech articles from Wikipedia. 3 Exploiting Wikipedia Article Translations The workflow was the following: a) to identify which Czech articles were created by translating English articles; b) to find out which version of the Czech article was the first, original translation; c) to identify the English articles from which the Czech ones were translated; d) to determine the version of the English

RASLAN 2016 Recent Advances in Slavonic Natural Language Processing

Newcastle University Eprints

RASLAN 2015 Recent Advances in Slavonic Natural Language Processing

Children Online: a Survey of Child Language and CMC Corpora

Email in the Australian National Corpus

C 2014 Gourab Kundu DOMAIN ADAPTATION with MINIMAL TRAINING

CS224N Section 3 Corpora, Etc. Pi-Chuan Chang, Friday, April 25

Machine Learning Approaches To

1. Introduction

The Best Kept Secrets with Corpus Linguistics

The Enron Corpus: Where the Email Bodies Are Buried?

Rating Tobacco Industry Documents for Corporate Deception And

Corpus Approaches to Forensic Linguistics David Wright