RASLAN 2011 Recent Advances in Slavonic Natural Language Processing
A. Horák, P. Rychlý (Eds.) RASLAN 2011
Recent Advances in Slavonic Natural Language Processing
Fifth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2011 Karlova Studánka, Czech Republic, December 2–4, 2011 Proceedings
Tribun EU 2011 Proceedings Editors Aleš Horák Faculty of Informatics, Masaryk University Department of Information Technologies Botanická 68a CZ-602 00 Brno, Czech Republic Email: [email protected]
Pavel Rychlý Faculty of Informatics, Masaryk University Department of Information Technologies Botanická 68a CZ-602 00 Brno, Czech Republic Email: [email protected]
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the Czech Copyright Law, in its current version, and permission for use must always be obtained from Tribun EU. Violations are liable for prosecution under the Czech Copyright Law.
Editors ○c Aleš Horák, 2011; Pavel Rychlý, 2011 Typography ○c Adam Rambousek, 2011 Cover ○c Petr Sojka, 2010 This edition ○c Tribun EU, Brno, 2011
ISBN 978-80-263-0077-9 Preface
This volume contains the Proceedings of the Fifth Workshop on Recent Ad- vances in Slavonic Natural Language Processing (RASLAN 2011) held on De- cember 2nd–4th 2011 in Karlova Studánka, Sporthotel Kurzovní, Jeseníky, Czech Republic. The RASLAN Workshop is an event dedicated to the exchange of informa- tion between research teams working on the projects of computer processing of Slavonic languages and related areas going on in the NLP Centre at the Faculty of Informatics, Masaryk University, Brno. RASLAN is focused on theoretical as well as technical aspects of the project work, on presentations of verified methods together with descriptions of development trends. The workshop also serves as a place for discussion about new ideas. The intention is to have it as a forum for presentation and discussion of the latest developments in the field of language engineering, especially for undergraduates and postgraduates af- filiated to the NLP Centre at FI MU. We also have to mention the cooperation with the Dept. of Computer Science FEI, VŠB Technical University Ostrava. Topics of the Workshop cover a wide range of subfields from the area of artificial intelligence and natural language processing including (but not limited to):
* text corpora and tagging * syntactic parsing * sense disambiguation * machine translation, computer lexicography * semantic networks and ontologies * semantic web * knowledge representation * logical analysis of natural language * applied systems and software for NLP RASLAN 2011 offers a rich program of presentations, short talks, technical papers and mainly discussions. A total of 15 papers were accepted, contributed altogether by 22 authors. Our thanks go to the Program Committee members and we would also like to express our appreciation to all the members of the Organizing Committee for their tireless efforts in organizing the Workshop and ensuring its smooth running. In particular, we would like to mention the work of Aleš Horák, Pavel Rychlý and Dana Hlaváˇcková.The TEXpertise of Adam Rambousek (based on LATEX macros prepared by Petr Sojka) resulted in the extremely speedy and efficient production of the volume which you are now holding in your hands. Last but not least, the cooperation of Tribun EU as both publisher and printer of these proceedings is gratefully acknowledged.
Brno, December 2011 Karel Pala
Table of Contents
I Logic and Language
Analyzing Time-Related Clauses in Transparent Intensional Logic ...... 3 Aleš Horák, Miloš Jakubíˇcek,VojtˇechKováˇr
Mind modeling using Transparent Intensional Logic ...... 11 Andrej Gardoˇn
Verbs as Predicates: Towards Inference in a Discourse ...... 19 Zuzana Nevˇeˇrilová,Marek Grác
II Syntax, Morphology and Lexicon
Czech Morphological Tagset Revisited ...... 29 Miloš Jakubíˇcek,VojtˇechKováˇr,Pavel Šmerk
Korean Parsing in an Extended Categorial Grammar ...... 43 Juyeon Kang
Extracting Phrases from PDT 2.0 ...... 51 Vašek Nˇemˇcík
Extended VerbaLex Editor Interface Based on the DEB Platform ...... 59 Dana Hlaváˇcková,Adam Rambousek
III Text Corpora
Building Corpora of Technical Texts: Approaches and Tools ...... 69 Petr Sojka, Martin Líška, Michal R˚užiˇcka
Corpus-based Disambiguation for Machine Translation ...... 81 Vít Baisa
Building a 50M Corpus of Tajik Language ...... 89 Gulshan Dovudov, Jan Pomikálek, Vít Suchomel, Pavel Šmerk
Practical Web Crawling for Text Corpora ...... 97 Vít Suchomel, Jan Pomikálek VIII Table of Contents
IV Language Modelling
A Bayesian Approach to Query Language Identification ...... 111 JiˇríMaterna and Juraj Hreško
A Framework for Authorship Identification in the Internet Environment . 117 Jan Rygl, Aleš Horák chared: Character Encoding Detection with a Known Language ...... 125 Jan Pomikálek, Vít Suchomel
Words’ Burstiness in Language Models ...... 131 Pavel Rychlý
Author Index ...... 139 Part I
Logic and Language
Analyzing Time-Related Clauses in Transparent Intensional Logic
Aleš Horák, Miloš Jakubíˇcek,VojtˇechKováˇr
Faculty of Informatics Masaryk University Botanická 68a, 602 00 Brno, Czech Republic {hales, jak, xkovar3}@fi.muni.cz
Abstract. The Normal Translation Algorithm (NTA) for Transparent In- tensional Logic (TIL) describes the key parts of standard translation from natural language sentences to logical formulae. NTA was specified for the Czech language as a representative of a free-word-order language with rich morphological system. In this paper, we describe the implementation of the sentence building part of NTA within a module of logical analysis in the Czech language syntactic parser synt. We show the respective lexicons and data structures in clause processing with numerous examples concentrating on the tem- poral aspects of clauses.
Key words: Transparent Intensional Logic, TIL, Normal Translation Al- gorithm, NTA, logical analysis, temporal analysis, time-related clauses
1 Introduction
Logical representation of a natural language sentence forms a firm basis for se- mantic analysis, machine learning techniques and automated reasoning. In cur- rent practical systems all these fields build on propositional and first-order logic due to the advantages and simplicity of their processing [1]. However, it can be shown that first-order formalisms are not able to handle systematically the natural language phenomena like intensionality, belief attitudes, grammatical tenses and modalities (modal verbs and modal particles in natural language). In the following text, we describe a system theoretically based on the formalism of the Transparent Intensional Logic (TIL [2]), an expressive logical system introduced by Pavel Tichý [3], which works with a complex hierarchy of types, system of possible worlds and times and an inductive system for derivation of new facts from a knowledge base in development. The actual implementation of the system works together with the system for syntactic analysis of the Czech language named synt. The parsing mechanism in synt [4,5] is based on the meta-grammar formalism working with a grammar of about 250 meta-rules with contextual actions for various phrase and sentence level tests. The Normal Translation Algorithm (NTA [6]) as a standard way of logical analysis of natural language by means of the compositions of syntactic
Aleš Horák, Pavel Rychlý (Eds.): Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011, pp. 3–9, 2011. ○c Tribun EU 2011 4 Aleš Horák, Miloš Jakubíˇcek,VojtˇechKováˇr constituents such as verb phrase, noun phrase, clause and their modifiers was provided by a basic prototype implementation. Current works on NTA have supplemented the process with high-coverage lexicons and new logical tests and rules with growing coverage on common news paper texts.
2 Logical Analysis on the Clause Level
The central point of each clause (as a “simple” sentence) is formed by the verb phrase, which represents the predicative skeleton of the clause. Tichý comes with a classification of significant verbs into two groups according to the classification of their meaning:
1. attributive verbs express what qualities the attributed objects have or what the objects are. Typical examples are sentences with adjectival predicates, such as Zmínˇenézaˇrízeníje ˇcíslicovˇeˇrízené. (The mentioned device is numerically controlled.) λwλt(∃i) [číslicověwt, řízený], i ∧ [zmíněný, zařízení]wt , i ... π The appropriate analysis of the attributive verbs lies in a proposition that ascribes the alluded property to the subject. 2. episodic verbs, on the other hand, express actions, they tell what the subject does: Celková úmrtnost klesá. (The overall mortality rate decreases.) (∃ )(∃ ) Does [Imp ] ∧ = klesat ∧ λw1λt2 x3 i4 w1t2 , i4, w1 , x3 x3 w1 celkový úmrtnost [ , ]w1t2 , i4 ... π
The main difference between attributive and episodic verbs consists in their time consumption — the attributive verbs do not take the time dimension into account, they just describe the state of the subject in the very one moment by saying that the subject has (or has not) a certain property. The analysis of episodic verbs is defined with the means called events and episodes, which are in detailed described in [7,8,6]. An example of the resulting analysis uses three functions — Does and Imp/Perf which describe the relation between the subject and the predicate in the sentence and display the verb’s aspect. In natural language, the time constructions are encoded with two principle means — the verb tense and time adverbial groups that serve as modifiers of the default verb meaning. The contemporary Czech language has three verb tenses — the present tense, the past tense and the future tense. For the basic verb tense analysis, we can adapt the definitions specified for English in [7]. Unlike English the Czech verbs have the capability of expressing the verb aspect built directly into their grammatical category — every verb is exactly in either the imperfective or the perfective form. A special property that goes along with the perfective aspect is Analyzing Time-Related Clauses in Transparent Intensional Logic 5 that these verbs do not form the present tense — they have only the past tense and the future tense forms.1 The present tense of the perfective verbs is usually expressed by their imperfective counterparts. The verb tenses that allow us to assert propositions with respect to the past or to the future can be looked at as mirror images of each other. In the analysis, we understand the past tense as a certain operation working over 1. the underlying proposition in the present tense form and 2. the reference time span 3. with regard to an assertion moment The meaning of a past tense proposition is often connected not only with a certain reference time span (in the indefinite case the object Anytime) but also with a frequency adverb specifying how many times the proposition happened to be true. A frequency adverb is analysed as a ((o(oτ))π)ω-object, i.e. as a world- dependent operation that takes a proposition p to the class of time intervals that have the requested qualities regarding the chronology of p. For instance, the adverb ‘dvakrát’ (twice) takes every proposition p to a class of time intervals that have exactly two distinct intersections with the chronology of p. If the frequency adverb is not specified in the sentence, we assume that the frequency of the proposition is ‘at least once’ (object Onc) in its time span, which means that we do not limit the sentence’s time span in that case. Thus, if P denotes the past tense function a general schema of the logical analysis of a past tense sentence looks like P