RASLAN 2011 Recent Advances in Slavonic Natural Language Processing

Total Page:16

File Type:pdf, Size:1020Kb

RASLAN 2011 Recent Advances in Slavonic Natural Language Processing RASLAN 2011 Recent Advances in Slavonic Natural Language Processing A. Horák, P. Rychlý (Eds.) RASLAN 2011 Recent Advances in Slavonic Natural Language Processing Fifth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2011 Karlova Studánka, Czech Republic, December 2–4, 2011 Proceedings Tribun EU 2011 Proceedings Editors Aleš Horák Faculty of Informatics, Masaryk University Department of Information Technologies Botanická 68a CZ-602 00 Brno, Czech Republic Email: [email protected] Pavel Rychlý Faculty of Informatics, Masaryk University Department of Information Technologies Botanická 68a CZ-602 00 Brno, Czech Republic Email: [email protected] This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the Czech Copyright Law, in its current version, and permission for use must always be obtained from Tribun EU. Violations are liable for prosecution under the Czech Copyright Law. Editors ○c Aleš Horák, 2011; Pavel Rychlý, 2011 Typography ○c Adam Rambousek, 2011 Cover ○c Petr Sojka, 2010 This edition ○c Tribun EU, Brno, 2011 ISBN 978-80-263-0077-9 Preface This volume contains the Proceedings of the Fifth Workshop on Recent Ad- vances in Slavonic Natural Language Processing (RASLAN 2011) held on De- cember 2nd–4th 2011 in Karlova Studánka, Sporthotel Kurzovní, Jeseníky, Czech Republic. The RASLAN Workshop is an event dedicated to the exchange of informa- tion between research teams working on the projects of computer processing of Slavonic languages and related areas going on in the NLP Centre at the Faculty of Informatics, Masaryk University, Brno. RASLAN is focused on theoretical as well as technical aspects of the project work, on presentations of verified methods together with descriptions of development trends. The workshop also serves as a place for discussion about new ideas. The intention is to have it as a forum for presentation and discussion of the latest developments in the field of language engineering, especially for undergraduates and postgraduates af- filiated to the NLP Centre at FI MU. We also have to mention the cooperation with the Dept. of Computer Science FEI, VŠB Technical University Ostrava. Topics of the Workshop cover a wide range of subfields from the area of artificial intelligence and natural language processing including (but not limited to): * text corpora and tagging * syntactic parsing * sense disambiguation * machine translation, computer lexicography * semantic networks and ontologies * semantic web * knowledge representation * logical analysis of natural language * applied systems and software for NLP RASLAN 2011 offers a rich program of presentations, short talks, technical papers and mainly discussions. A total of 15 papers were accepted, contributed altogether by 22 authors. Our thanks go to the Program Committee members and we would also like to express our appreciation to all the members of the Organizing Committee for their tireless efforts in organizing the Workshop and ensuring its smooth running. In particular, we would like to mention the work of Aleš Horák, Pavel Rychlý and Dana Hlaváˇcková.The TEXpertise of Adam Rambousek (based on LATEX macros prepared by Petr Sojka) resulted in the extremely speedy and efficient production of the volume which you are now holding in your hands. Last but not least, the cooperation of Tribun EU as both publisher and printer of these proceedings is gratefully acknowledged. Brno, December 2011 Karel Pala Table of Contents I Logic and Language Analyzing Time-Related Clauses in Transparent Intensional Logic . 3 Aleš Horák, Miloš Jakubíˇcek,VojtˇechKováˇr Mind modeling using Transparent Intensional Logic . 11 Andrej Gardoˇn Verbs as Predicates: Towards Inference in a Discourse . 19 Zuzana Nevˇeˇrilová,Marek Grác II Syntax, Morphology and Lexicon Czech Morphological Tagset Revisited . 29 Miloš Jakubíˇcek,VojtˇechKováˇr,Pavel Šmerk Korean Parsing in an Extended Categorial Grammar . 43 Juyeon Kang Extracting Phrases from PDT 2.0 . 51 Vašek Nˇemˇcík Extended VerbaLex Editor Interface Based on the DEB Platform . 59 Dana Hlaváˇcková,Adam Rambousek III Text Corpora Building Corpora of Technical Texts: Approaches and Tools . 69 Petr Sojka, Martin Líška, Michal R˚užiˇcka Corpus-based Disambiguation for Machine Translation . 81 Vít Baisa Building a 50M Corpus of Tajik Language . 89 Gulshan Dovudov, Jan Pomikálek, Vít Suchomel, Pavel Šmerk Practical Web Crawling for Text Corpora . 97 Vít Suchomel, Jan Pomikálek VIII Table of Contents IV Language Modelling A Bayesian Approach to Query Language Identification . 111 JiˇríMaterna and Juraj Hreško A Framework for Authorship Identification in the Internet Environment . 117 Jan Rygl, Aleš Horák chared: Character Encoding Detection with a Known Language . 125 Jan Pomikálek, Vít Suchomel Words’ Burstiness in Language Models . 131 Pavel Rychlý Author Index ..................................................... 139 Part I Logic and Language Analyzing Time-Related Clauses in Transparent Intensional Logic Aleš Horák, Miloš Jakubíˇcek,VojtˇechKováˇr Faculty of Informatics Masaryk University Botanická 68a, 602 00 Brno, Czech Republic fhales, jak, [email protected] Abstract. The Normal Translation Algorithm (NTA) for Transparent In- tensional Logic (TIL) describes the key parts of standard translation from natural language sentences to logical formulae. NTA was specified for the Czech language as a representative of a free-word-order language with rich morphological system. In this paper, we describe the implementation of the sentence building part of NTA within a module of logical analysis in the Czech language syntactic parser synt. We show the respective lexicons and data structures in clause processing with numerous examples concentrating on the tem- poral aspects of clauses. Key words: Transparent Intensional Logic, TIL, Normal Translation Al- gorithm, NTA, logical analysis, temporal analysis, time-related clauses 1 Introduction Logical representation of a natural language sentence forms a firm basis for se- mantic analysis, machine learning techniques and automated reasoning. In cur- rent practical systems all these fields build on propositional and first-order logic due to the advantages and simplicity of their processing [1]. However, it can be shown that first-order formalisms are not able to handle systematically the natural language phenomena like intensionality, belief attitudes, grammatical tenses and modalities (modal verbs and modal particles in natural language). In the following text, we describe a system theoretically based on the formalism of the Transparent Intensional Logic (TIL [2]), an expressive logical system introduced by Pavel Tichý [3], which works with a complex hierarchy of types, system of possible worlds and times and an inductive system for derivation of new facts from a knowledge base in development. The actual implementation of the system works together with the system for syntactic analysis of the Czech language named synt. The parsing mechanism in synt [4,5] is based on the meta-grammar formalism working with a grammar of about 250 meta-rules with contextual actions for various phrase and sentence level tests. The Normal Translation Algorithm (NTA [6]) as a standard way of logical analysis of natural language by means of the compositions of syntactic Aleš Horák, Pavel Rychlý (Eds.): Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011, pp. 3–9, 2011. ○c Tribun EU 2011 4 Aleš Horák, Miloš Jakubíˇcek,VojtˇechKováˇr constituents such as verb phrase, noun phrase, clause and their modifiers was provided by a basic prototype implementation. Current works on NTA have supplemented the process with high-coverage lexicons and new logical tests and rules with growing coverage on common news paper texts. 2 Logical Analysis on the Clause Level The central point of each clause (as a “simple” sentence) is formed by the verb phrase, which represents the predicative skeleton of the clause. Tichý comes with a classification of significant verbs into two groups according to the classification of their meaning: 1. attributive verbs express what qualities the attributed objects have or what the objects are. Typical examples are sentences with adjectival predicates, such as Zmínˇenézaˇrízeníje ˇcíslicovˇeˇrízené. (The mentioned device is numerically controlled.) lwlt(9i) [číslicověwt, řízený], i ^ [zmíněný, zařízení]wt , i ... p The appropriate analysis of the attributive verbs lies in a proposition that ascribes the alluded property to the subject. 2. episodic verbs, on the other hand, express actions, they tell what the subject does: Celková úmrtnost klesá. (The overall mortality rate decreases.) (9 )(9 ) Does [Imp ] ^ = klesat ^ lw1lt2 x3 i4 w1t2 , i4, w1 , x3 x3 w1 celkový úmrtnost [ , ]w1t2 , i4 ... p The main difference between attributive and episodic verbs consists in their time consumption — the attributive verbs do not take the time dimension into account, they just describe the state of the subject in the very one moment by saying that the subject has (or has not) a certain property. The analysis of episodic verbs is defined with the means called events and episodes, which are in detailed described in [7,8,6]. An example of the resulting
Recommended publications
  • Tehran, Vienna Keen on Exchanging National Archive Documents
    Art & Culture November 23, 2019 3 This Day in History (November 23) Tehran, Vienna Keen on Exchanging Today is Saturday; 2nd of the Iranian month of Azar 1398 solar hijri; corresponding to 25th of the Islamic month of Rabi al-Awwal 1441 lunar hijri; and November 23, 2019, of the Christian Gregorian Calendar. National Archive Documents 1918 solar years ago, on this day in the year 101 AD, present day Romania was occupied by the Roman Empire. This land was ruled by the Romans until its conquest for publishing a 162-year-old Elsewhere in his remarks, by the Ottoman Turks in 1453. Meanwhile, parts of Romania were also occupied by cooperation document between Zarafshan pointed to the documents Austria for a while till the year 1877, in which Romania emerged as independent. Iran and Austria. between Iran and Austria in the Romania covers an area of 237500 sq km. Its capital is Bucharest. 1005 lunar years ago, on this day in 436 AH, the great scholar and jurisprudent, Zarafshan made the remarks at field of cooperation of the two Seyyed Ali Ibn Hussain, popularly known as Sharif Murtaza, passed away at the the venue of National Archive countries in different economic age of 81 in his hometown Baghdad. He was born in a family descended on both of Iran on Wednesday in the sectors, technology transfer, in sides from Prophet Mohammad (blessings of God upon him and his progeny). His inaugural ceremony of Iranian aviation and transport industries, th father Hussain was 5 in line of descent from Imam Musa al-Kazem (AS), while and Austrian documents in the import and export activities and his mother, Fatema – a scion of the family that had carved out an independent state in Tabaristan on the Caspian Sea coast of Iran – was a descendant of Imam presence of Austrian ambassador added, “the old documents of the Zain al-Abedin (AS).
    [Show full text]
  • A Comparative Study of Metaphor in Arabic and Persian
    EAS Journal of Humanities and Cultural Studies Abbreviated Key Title: EAS J Humanit Cult Stud ISSN: 2663-0958 (Print) & ISSN: 2663-6743 (Online) Published By East African Scholars Publisher, Kenya Volume-2 | Issue-4| Jul-Aug 2020 | DOI: 10.36349/EASJHCS.2020.V02I04.008 Research Article A Comparative Study of Metaphor in Arabic and Persian Yahya Kardgar Associate Professor, Department of Persian language and literature, Faculty of literature and Humanities, University of Qom, Qom, Iran Abstract: Metaphor is one of the most important types of trope e that has a special place in Article History every language, but the way it has been viewed in various languages is different. In Arabic Received: 06.08.2020 and Persian, the similarities of this discussion are more than its distinctions, in that most of Accepted: 22.08.2020 the metaphorical topics of Arabic rhetoric books have been repeated in Persian books two or Published: 30.08.2020 three centuries later. The subject of metaphor in Arabic books has experienced five stages of Journal homepage: outset, growth, prosperity, recession, and Modernism, and has undergone three stages of https://www.easpublisher.com/easjhcs genesis, expansion, and edition in Farsi with fewer developments. Genesis corresponds to the period of outset and growth; the period of expansion corresponds to the period of Quick Response Code recession and the period of edition to the period of Arab modernism. Unfortunately, the Persian rhetoric has not benefited much from the prosperity period of the Arabic rhetoric, so its analytical and aesthetic outlook is weak. Today, research in both languages tend to focus on critical topics, the use of metaphorical studies of other languages, and the linguistic outlook, with a greater emphasis on the nature of language and Nativism in Persian metaphor- seekers.
    [Show full text]
  • CIS Newsletter 15.2
    CENTER FOR IRANIAN STUDIES NEWSLETTER Vol. 15, No. 2 SIPA-Columbia University-New York Fall 2003 ENCYCLOPÆDIA IRANICA SHIRIN EBADI WINNER OF Fascicles 1 and 2 of Volume XII Published; Fascicle 3 in Press 2003 NOBEL PEACE PRIZE The first and second fascicles way in which Hedayat’s satire per- of Volume XII of the Encyclopædia meates many of his short stories. Iranica were published in the Sum- Hillmann reviews plots and themes mer and Fall of 2003. They fea- of Hedayat’s fiction, some fifty or ture over 120 articles on various as- more works written from the mid- pects of Iranian culture and history, in- 1920s through the mid-1940s, and cites cluding four series of articles on spe- features of Hedayat’s distinctive ways cific subjects: four entries on Sadeq of narration which advanced the capa- Hedayat, four entries on Hazara groups bilities of the language in Persian lit- in Afghanistan, four entries on Helmand erature and served as an indigenous River, and eight entries on Herat. model for later Iranian short story writ- ers and novelists. Shirin Ebadi, lawyer and human SADEQ HEDAYAT rights activist who contributed the en- AND PERSIAN LITERATURE Persian literature is also treated in try CHILDREN’S RIGHTS IN IRAN to the the following eight articles: HASAN Encyclopædia Iranica and whose book Four articles discuss the life and GHAZNAVI, poet at the court of History and Documentation of Human work of Sadeq Hedayat, the foremost Bahramshah Ghaznavi, by J. S. Rights in Iran was published by the modern Persian fiction writer who had Meisami; HATEF ESFAHANI, influential Center for Iranian Studies in 2000, was a vast influence on subsequent genera- poet of 18th century, by the late Z.
    [Show full text]
  • Building a 50M Corpus of Tajik Language
    Building a 50M Corpus of Tajik Language Gulshan Dovudov1, Jan Pomikálek2, Vít Suchomel1, Pavel Šmerk1 1 Natural Language Processing Centre Faculty of Informatics, Masaryk University Botanická 68a, 602 00 Brno, Czech Republic {xdovudov,xsuchom2,xsmerk}@fi.muni.cz 2 Lexical Computing Ltd. 73 Freshfield Road, Brighton, UK [email protected] Abstract. Paper presents by far the largest available computer corpus of Tajik Language of the size of more than 50 million words. To obtain the texts for the corpus two different approaches were used. The paper brings a description of both of them, discusses their advantages and disadvantages and shows some statistics of the two respective partial corpora. Then the paper characterizes the resulting joined corpus and finally discusses some possible future improvements. 1 Introduction Tajik language is a variant of the Persian language spoken mainly in Tajikistan and also in some few other parts of Central Asia. Unlike Iranian Persian, Tajik is written in the Cyrillic alphabet. Since the Tajik language internet society (and consequently the potential market) is rather small and Tajikistan itself is ranked among developing countries, there are almost no resources for computational processing of Tajik at all. Namely the computer corpora of Tajik language are either small or even still only in the stage of planning or development. The University of Helsinky offers a very small corpus of 87 654 words.1 Megerdoomian and Parvaz [3,2] mention a test corpus of approximately 500 000 words, and the biggest and the only freely available corpus is offered within the Leipzig Corpora Collection project [4] and consists of 100 000 Tajik sentences which equals to almost 1.8 million words.2 Iranian linguist Hamid Hassani is said to be preparing a 1 1 http://www.ling.helsinki.fi/uhlcs/readme-all/README-indo-european-lgs.html 2 Unfortunately, the encoding and/or transliteration vary greatly: more than 5 % of sentences are in Latin script, almost 10 % of sentences seem to use Russian characters instead of Tajik specific characters (e.g.
    [Show full text]
  • Moral Responsibility and Types of Moral Responsibility
    Science Arena Publications International Journal of Philosophy and Social-Psychological Sciences ISSN: 2414-5343 Available online at www.sciarena.com 2019, Vol, 5 (3): 41-58 Moral Responsibility and Types of Moral Responsibility Reza Arian Moghaddam M.A. in Islamic Ethics, Department of Philosophy of Ethics, School of Theology and Islamic Studies, University of Qom, Qom, Iran. Abstract: The issue of "moral responsibility" is one of the most significant moral issues that has caught the attention of moral philosophers. In the course of history of moral discussions, many issues have been debated on the meaning of responsibility, its conditions and properties of the responsible man. The importance of this issue becomes understood when we pay attention to its relationship with some philosophic-theological discussions as well as the data of natural sciences and humanities. This issue from one perspective is intertwined with such issues as causality, determinism and volition. On other hand, it is interrelated with such vital theological issues as omniscience and omnipotence of God and divine providence. The despisers of moral responsibility struggle to prove through different proofs that human actions are contingent upon the conditions and modes that do not need any other reason. Moreover, they seek to demonstrate that human will has no role in their existence or absence. Determinists provide certain reasons in order to prove their claims in this regard. Some of these scholars deny human will and moral responsibility based on the law of causality. Some others have become bogged down in determinism due to their defected understanding of religious doctrines like Absolute Power (omnipotence), omniscience and divine providence.
    [Show full text]
  • ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica Upsaliensia 16
    ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica Upsaliensia 16 Morphosyntactic Corpora and Tools for Persian Mojgan Seraji Dissertation presented at Uppsala University to be publicly examined in Universitetshuset / IX, Uppsala, Wednesday, 27 May 2015 at 10:15 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Professor of Computational Linguistics Jan Hajic (Charles University in Prague). Abstract Seraji, M. 2015. Morphosyntactic Corpora and Tools for Persian. Studia Linguistica Upsaliensia 16. 191 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-9229-8. This thesis presents open source resources in the form of annotated corpora and modules for automatic morphosyntactic processing and analysis of Persian texts. More specifically, the resources consist of an improved part-of-speech tagged corpus and a dependency treebank, as well as tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and dependency parsing for Persian. In developing these resources and tools, two key requirements are observed: compatibility and reuse. The compatibility requirement encompasses two parts. First, the tools in the pipeline should be compatible with each other in such a way that the output of one tool is compatible with the input requirements of the next. Second, the tools should be compatible with the annotated corpora and deliver the same analysis that is found in these. The reuse requirement means that all the components in the pipeline are developed by reusing resources, standard methods, and open source state-of-the-art tools. This is necessary to make the project feasible. Given these requirements, the thesis investigates two main research questions.
    [Show full text]
  • I the EFFECTS of SOCIAL AND
    THE EFFECTS OF SOCIAL AND POLITICAL DISLOCATION ON PERSIANATE CHILDREN'S LITERATURE: CHANGE AND CONTINUITY by Nafisa Abdelsadek submitted in accordance with the requirements for the degree of DOCTOR OF LITERATURE AND PHILOSOPHY at the UNIVERSITY OF SOUTH AFRICA SUPERVISOR: Professor Thomas Van der Walt February 2011 i Student number: 4278-613-4 I declare that THE EFFECTS OF SOCIAL AND POLITICAL DISLOCATION ON PERSIANATE CHILDREN'S LITERATURE: CHANGE AND CONTINUITY is my own work and that all the sources that I have used or quoted have been indicated and acknowledged by means of complete references. 22/02/2011 ________________________ __________________ SIGNATURE DATE (N.Abdelsadek) ii ABSTRACT OF THESIS THE EFFECTS OF SOCIAL AND POLITICAL DISLOCATION ON PERSIANATE CHILDREN'S LITERATURE: CHANGE AND CONTINUITY by Nafisa Abdelsadek This thesis seeks to investigate the various forces that have shaped modern Persianate children‘s literature - history, revolution, political climate, government, institutions, writers, education, and so on. The historical origins of tales popular in modern times, and of themes recurrent in stories from past times to present are analyzed, along with other factors which have shaped Persianate children‘s literature. The thesis begins with a historical and theoretical overview relating to change and continuity in Persianate children‘s literature. It examines the influence of ancient texts on modern Persianate children‘s stories. The cultural development reflected in the organizational infrastructure of institutions is also examined, as well as other contemporary influences, both social and political, in order to assess how these have affected modern Persianate children‘s literature. The contents of children‘s books are analyzed from different aspects, including their representation of social values.
    [Show full text]