RASLAN 2014 Recent Advances in Slavonic Natural Language Processing

RASLAN 2014 Recent Advances in Slavonic Natural Language Processing A. Horák, P. Rychlý (Eds.) RASLAN 2014 Recent Advances in Slavonic Natural Language Processing Eighth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2014 Karlova Studánka, Czech Republic, December 5–7, 2014 Proceedings NLP Consulting 2014 Proceedings Editors Aleš Horák Faculty of Informatics, Masaryk University Department of Information Technologies Botanická 68a CZ-602 00 Brno, Czech Republic Email: [email protected] Pavel Rychlý Faculty of Informatics, Masaryk University Department of Information Technologies Botanická 68a CZ-602 00 Brno, Czech Republic Email: [email protected] This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the Czech Copyright Law, in its current version, and permission for use must always be obtained from NLP Consulting. Violations are liable for prosecution under the Czech Copyright Law. Editors ○c Aleš Horák, 2014; Pavel Rychlý, 2014 Typography ○c Adam Rambousek, 2014 Cover ○c Petr Sojka, 2010 This edition ○c NLP Consulting, Brno, 2014 ISSN 2336-4289 Preface This volume contains the Proceedings of the Eighth Workshop on Recent Advances in Slavonic Natural Language Processing (RASLAN 2014) held on December 5–7, 2014 in Karlova Studánka, Czech Republic. The RASLAN Workshop is an event dedicated to the exchange of information between researchers working on the projects of computer processing of Slavonic languages and related areas going on in the NLP Centre at the Faculty of Informatics, Masaryk University, Brno. RASLAN is focused on theoretical as well as technical aspects of the project work, on presentations of the verified methods together with descriptions of development trends. The workshop also serves as a place for discussions about new ideas. The intention is to have it as a forum for presentation and discussion of the latest developments in the field of language engineering, especially for undergraduates and postgraduates af- filiated to the NLP Centre at FI MU. Topics of the Workshop cover a wide range of subfields from the area of artificial intelligence and natural language processing including (but not limited to): * text corpora and tagging, * syntactic analysis, * sense disambiguation, * machine translation, computer lexicography, * semantic networks and ontologies, * semantic web, * knowledge representation, * logical analysis of natural language, * applied systems and software for NLP. RASLAN 2014 offers a rich program of presentations, short talks, technical papers and mainly discussions. A total of 18 papers were accepted, contributed altogether by 24 authors. Our thanks go to the Program Committee members and we would also like to express our appreciation to all the members of the Organizing Committee for their tireless efforts in organizing the Workshop and ensuring its smooth running. In particular, we would like to mention the work of Aleš Horák, Pavel Rychlý and Lucia Kocincová. The TEXpertise of Adam Rambousek (based on LATEX macros prepared by Petr Sojka) resulted in the extremely speedy and efficient production of the volume which you are now holding in your hands. Last but not least, the cooperation of Tribun EU as a printer of these proceedings is gratefully acknowledged. Brno, December 2014 Karel Pala Table of Contents I Language Modelling Character-based Language Model . 3 Vít Baisa A System for Predictive Writing . 11 Zuzana Nevˇeˇrilováand Barbora Ulipová One System to Solve Them All . 19 Jan Rygl Improving Coverage of Translation Memories with Language Modelling . 27 Vít Baisa, Josef Bušta, and Aleš Horák II Text Corpora Optimization of Regular Expression Evaluation within the Manatee Corpus Management System . 37 Miloš Jakubíˇcekand Pavel Rychlý Modus Questions: Query Models and Frequency in Russian Text Corpora 49 Victoria V. Kazakovskaya and Maria V. Khokhlova Low Inter-Annotator Agreement = An Ill-Defined Problem? . 57 VojtˇechKováˇr,Pavel Rychlý, and Miloš Jakubíˇcek SkELL: Web Interface for English Language Learning . 63 Vít Baisa and Vít Suchomel Text Tokenisation Using unitok ....................................... 71 Jan Michelfeit, Jan Pomikálek, and Vít Suchomel Finding the Best Name for a Set of Words Automatically . 77 Pavel Rychlý III Semantics and Information Retrieval Style Markers Based on Stop-word List . 85 Jan Rygl and Marek Medved’ Separating Named Entities . 91 Barbora Ulipová and Marek Grác VIII Table of Contents Intelligent Search and Replace for Czech Phrases . 97 Zuzana Nevˇeˇrilováand Vít Suchomel An Architecture for Scientific Document Retrieval: Using Textual and Math Entailment Modules . 107 Partha Pakray and Petr Sojka IV Morphology and Lexicon SQAD: Simple Question Answering Database . 121 Marek Medved’ and Aleš Horák Semiautomatic Building and Extension of Terminological Thesaurus for Land Surveying Domain . 129 Adam Rambousek, Aleš Horák, Vít Suchomel, and Lucia Kocincová Mapping Czech and English Valency Lexicons: Preliminary Report . 139 Vít Baisa, Karel Pala, ZdeˇnkaSitová, and Jakub Vonšovský Tools for Fast Morphological Analysis Based on Finite State Automata . 147 Pavel Šmerk Subject Index .................................................... 151 Author Index ..................................................... 153 Part I Language Modelling Character-based Language Model Vít Baisa Natural Language Processing Centre Faculty of Informatics, Masaryk University Botanická 68a, 602 00 Brno, Czech Republic [email protected] Abstract. Language modelling and also other natural language processing tasks are usually based on words. I present here a more general yet simpler approach to language modelling using much smaller units of text data: character-based language model (CBLM).1 In this paper I describe the underlying data structure of the model, evaluate the model using standard measures (entropy, perplexity). As a proof-of-concept and an extrin- sic evaluation I present also a random sentence generator based on this model. Keywords: language model, suffix array, LCP, trie, character-based, random text generator, corpus 1 Introduction Current approaches to language modelling are based almost utterly on words. To work with words, the input data needs to be tokenized which might be quite tricky for some languages. The tokenization might cause errors which are propagated to following processing steps. But even if the tokenization was 100% reliable, another problem emerges: word-based language models treat similar words as completely unrelated. Consider two words platypus and platypuses. The former is contained in the latter yet they will be treated completely independently. This issue can be sorted out partially by using factored language models [1] where lemmas and morphological information (here singular vs. plural number of the same lemma) are treated simultaneously with the word forms. In most systems, word-based language models are based on n-grams (usually 3–4) and on Markov chain of the corresponding order where only a finite and fixed number of previous words is taken into account. I propose a model which tackles with the above-mentioned problems. The tokenization is removed from the process of building the model since the model uses sequences of characters (or bytes) from the input data. Words (byte sequences) which share prefix of characters (bytes) are stored on the same place in the 1 I call this ChaRactEr-BasEd LangUage Model (CBLM) cerebellum: a part of human brain which plays an important role in motor control and which is involved also in some cognitive processes including language processing. Aleš Horák, Pavel Rychlý (Eds.): Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2014, pp. 3–10, 2014. ○c NLP Consulting 2014 4 Vít Baisa model. The model uses suffix array and trie structure and is completely language independent. The aim of CBLM is to make language modelling more robust and at the same time simpler: with no need for interpolating, smoothing and other language modelling techniques. 2 Related work Character-based language models are used very rarely despite they are described frequently in theoretical literature. That is because a standard n-gram character-based language models would suffer from very limited context: even 10-grams are not expressive enough since they describe only very limited width of context. Even the famous Shannon’s paper [2] mentions a simple uni-, bi- and tri-gram models but then it swiftly moves to word-based models. There have been some attempts to use sub-word units (morphemes) for language modelling, especially for speech recognition tasks [3] for morphologi- cally rich languages like Finnish and Hungarian but they have not gone deeper. Variable-length n-gram modelling is also closely related but the model described in [4] is based rather on categories than on substrings from the raw input data. Suffix array language model (SALM) based on words has been proposed in [5]. 3 Building the model As input, any plain text (in any encoding but it is most convenient to use UTF- 8) can be used. In a previous version of the model all input characters were encoded to a 7-bit code (the last bit was used for storing structure information). Currently the

RASLAN 2014 Recent Advances in Slavonic Natural Language Processing

Arabic Corpus and Word Sketches

Jazykovedný Ústav Ľudovíta Štúra

Korpusleksikograafia Uued Võimalused Eesti Keele Kollokatsioonisõnastiku Näitel

The Translation Equivalents Database (Treq) As a Lexicographer’S Aid

ESP Corpus Construction: a Plea for a Needs-Driven Approach Bâtir Des Corpus En Anglais De Spécialité : Plaidoyer Pour Une Approche Fondée Sur L’Analyse Des Besoins

A Corpus-Based Study of the Effects of Collocation, Phraseology

Machine Learning of Antonyms in English and Arabic Corpora

Better Web Corpora for Corpus Linguistics and NLP

Designing a Learner's Dictionary Based on Sinclair's Lexical Units by Means of Corpus Pattern Analysis and the Sketch Engine

Arabic Corpus and Word Sketches

Genre Annotation for the Web: Text-External and Text-Internal Perspectives

Open Source International Arabic News Corpus - Preparation and Integration Into the CLARIN-Infrastructure