RASLAN 2011 Recent Advances in Slavonic Natural Language Processing
Total Page:16
File Type:pdf, Size:1020Kb
RASLAN 2011 Recent Advances in Slavonic Natural Language Processing A. Horák, P. Rychlý (Eds.) RASLAN 2011 Recent Advances in Slavonic Natural Language Processing Fifth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2011 Karlova Studánka, Czech Republic, December 2–4, 2011 Proceedings Tribun EU 2011 Proceedings Editors Aleš Horák Faculty of Informatics, Masaryk University Department of Information Technologies Botanická 68a CZ-602 00 Brno, Czech Republic Email: [email protected] Pavel Rychlý Faculty of Informatics, Masaryk University Department of Information Technologies Botanická 68a CZ-602 00 Brno, Czech Republic Email: [email protected] This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the Czech Copyright Law, in its current version, and permission for use must always be obtained from Tribun EU. Violations are liable for prosecution under the Czech Copyright Law. Editors ○c Aleš Horák, 2011; Pavel Rychlý, 2011 Typography ○c Adam Rambousek, 2011 Cover ○c Petr Sojka, 2010 This edition ○c Tribun EU, Brno, 2011 ISBN 978-80-263-0077-9 Preface This volume contains the Proceedings of the Fifth Workshop on Recent Ad- vances in Slavonic Natural Language Processing (RASLAN 2011) held on De- cember 2nd–4th 2011 in Karlova Studánka, Sporthotel Kurzovní, Jeseníky, Czech Republic. The RASLAN Workshop is an event dedicated to the exchange of informa- tion between research teams working on the projects of computer processing of Slavonic languages and related areas going on in the NLP Centre at the Faculty of Informatics, Masaryk University, Brno. RASLAN is focused on theoretical as well as technical aspects of the project work, on presentations of verified methods together with descriptions of development trends. The workshop also serves as a place for discussion about new ideas. The intention is to have it as a forum for presentation and discussion of the latest developments in the field of language engineering, especially for undergraduates and postgraduates af- filiated to the NLP Centre at FI MU. We also have to mention the cooperation with the Dept. of Computer Science FEI, VŠB Technical University Ostrava. Topics of the Workshop cover a wide range of subfields from the area of artificial intelligence and natural language processing including (but not limited to): * text corpora and tagging * syntactic parsing * sense disambiguation * machine translation, computer lexicography * semantic networks and ontologies * semantic web * knowledge representation * logical analysis of natural language * applied systems and software for NLP RASLAN 2011 offers a rich program of presentations, short talks, technical papers and mainly discussions. A total of 15 papers were accepted, contributed altogether by 22 authors. Our thanks go to the Program Committee members and we would also like to express our appreciation to all the members of the Organizing Committee for their tireless efforts in organizing the Workshop and ensuring its smooth running. In particular, we would like to mention the work of Aleš Horák, Pavel Rychlý and Dana Hlaváˇcková.The TEXpertise of Adam Rambousek (based on LATEX macros prepared by Petr Sojka) resulted in the extremely speedy and efficient production of the volume which you are now holding in your hands. Last but not least, the cooperation of Tribun EU as both publisher and printer of these proceedings is gratefully acknowledged. Brno, December 2011 Karel Pala Table of Contents I Logic and Language Analyzing Time-Related Clauses in Transparent Intensional Logic . 3 Aleš Horák, Miloš Jakubíˇcek,VojtˇechKováˇr Mind modeling using Transparent Intensional Logic . 11 Andrej Gardoˇn Verbs as Predicates: Towards Inference in a Discourse . 19 Zuzana Nevˇeˇrilová,Marek Grác II Syntax, Morphology and Lexicon Czech Morphological Tagset Revisited . 29 Miloš Jakubíˇcek,VojtˇechKováˇr,Pavel Šmerk Korean Parsing in an Extended Categorial Grammar . 43 Juyeon Kang Extracting Phrases from PDT 2.0 . 51 Vašek Nˇemˇcík Extended VerbaLex Editor Interface Based on the DEB Platform . 59 Dana Hlaváˇcková,Adam Rambousek III Text Corpora Building Corpora of Technical Texts: Approaches and Tools . 69 Petr Sojka, Martin Líška, Michal R˚užiˇcka Corpus-based Disambiguation for Machine Translation . 81 Vít Baisa Building a 50M Corpus of Tajik Language . 89 Gulshan Dovudov, Jan Pomikálek, Vít Suchomel, Pavel Šmerk Practical Web Crawling for Text Corpora . 97 Vít Suchomel, Jan Pomikálek VIII Table of Contents IV Language Modelling A Bayesian Approach to Query Language Identification . 111 JiˇríMaterna and Juraj Hreško A Framework for Authorship Identification in the Internet Environment . 117 Jan Rygl, Aleš Horák chared: Character Encoding Detection with a Known Language . 125 Jan Pomikálek, Vít Suchomel Words’ Burstiness in Language Models . 131 Pavel Rychlý Author Index ..................................................... 139 Part I Logic and Language Analyzing Time-Related Clauses in Transparent Intensional Logic Aleš Horák, Miloš Jakubíˇcek,VojtˇechKováˇr Faculty of Informatics Masaryk University Botanická 68a, 602 00 Brno, Czech Republic fhales, jak, [email protected] Abstract. The Normal Translation Algorithm (NTA) for Transparent In- tensional Logic (TIL) describes the key parts of standard translation from natural language sentences to logical formulae. NTA was specified for the Czech language as a representative of a free-word-order language with rich morphological system. In this paper, we describe the implementation of the sentence building part of NTA within a module of logical analysis in the Czech language syntactic parser synt. We show the respective lexicons and data structures in clause processing with numerous examples concentrating on the tem- poral aspects of clauses. Key words: Transparent Intensional Logic, TIL, Normal Translation Al- gorithm, NTA, logical analysis, temporal analysis, time-related clauses 1 Introduction Logical representation of a natural language sentence forms a firm basis for se- mantic analysis, machine learning techniques and automated reasoning. In cur- rent practical systems all these fields build on propositional and first-order logic due to the advantages and simplicity of their processing [1]. However, it can be shown that first-order formalisms are not able to handle systematically the natural language phenomena like intensionality, belief attitudes, grammatical tenses and modalities (modal verbs and modal particles in natural language). In the following text, we describe a system theoretically based on the formalism of the Transparent Intensional Logic (TIL [2]), an expressive logical system introduced by Pavel Tichý [3], which works with a complex hierarchy of types, system of possible worlds and times and an inductive system for derivation of new facts from a knowledge base in development. The actual implementation of the system works together with the system for syntactic analysis of the Czech language named synt. The parsing mechanism in synt [4,5] is based on the meta-grammar formalism working with a grammar of about 250 meta-rules with contextual actions for various phrase and sentence level tests. The Normal Translation Algorithm (NTA [6]) as a standard way of logical analysis of natural language by means of the compositions of syntactic Aleš Horák, Pavel Rychlý (Eds.): Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011, pp. 3–9, 2011. ○c Tribun EU 2011 4 Aleš Horák, Miloš Jakubíˇcek,VojtˇechKováˇr constituents such as verb phrase, noun phrase, clause and their modifiers was provided by a basic prototype implementation. Current works on NTA have supplemented the process with high-coverage lexicons and new logical tests and rules with growing coverage on common news paper texts. 2 Logical Analysis on the Clause Level The central point of each clause (as a “simple” sentence) is formed by the verb phrase, which represents the predicative skeleton of the clause. Tichý comes with a classification of significant verbs into two groups according to the classification of their meaning: 1. attributive verbs express what qualities the attributed objects have or what the objects are. Typical examples are sentences with adjectival predicates, such as Zmínˇenézaˇrízeníje ˇcíslicovˇeˇrízené. (The mentioned device is numerically controlled.) lwlt(9i) [číslicověwt, řízený], i ^ [zmíněný, zařízení]wt , i ... p The appropriate analysis of the attributive verbs lies in a proposition that ascribes the alluded property to the subject. 2. episodic verbs, on the other hand, express actions, they tell what the subject does: Celková úmrtnost klesá. (The overall mortality rate decreases.) (9 )(9 ) Does [Imp ] ^ = klesat ^ lw1lt2 x3 i4 w1t2 , i4, w1 , x3 x3 w1 celkový úmrtnost [ , ]w1t2 , i4 ... p The main difference between attributive and episodic verbs consists in their time consumption — the attributive verbs do not take the time dimension into account, they just describe the state of the subject in the very one moment by saying that the subject has (or has not) a certain property. The analysis of episodic verbs is defined with the means called events and episodes, which are in detailed described in [7,8,6]. An example of the resulting