ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica Upsaliensia 16

ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica Upsaliensia 16 Morphosyntactic Corpora and Tools for Persian Mojgan Seraji Dissertation presented at Uppsala University to be publicly examined in Universitetshuset / IX, Uppsala, Wednesday, 27 May 2015 at 10:15 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Professor of Computational Linguistics Jan Hajic (Charles University in Prague). Abstract Seraji, M. 2015. Morphosyntactic Corpora and Tools for Persian. Studia Linguistica Upsaliensia 16. 191 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-9229-8. This thesis presents open source resources in the form of annotated corpora and modules for automatic morphosyntactic processing and analysis of Persian texts. More specifically, the resources consist of an improved part-of-speech tagged corpus and a dependency treebank, as well as tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and dependency parsing for Persian. In developing these resources and tools, two key requirements are observed: compatibility and reuse. The compatibility requirement encompasses two parts. First, the tools in the pipeline should be compatible with each other in such a way that the output of one tool is compatible with the input requirements of the next. Second, the tools should be compatible with the annotated corpora and deliver the same analysis that is found in these. The reuse requirement means that all the components in the pipeline are developed by reusing resources, standard methods, and open source state-of-the-art tools. This is necessary to make the project feasible. Given these requirements, the thesis investigates two main research questions. The first is how can we develop morphologically and syntactically annotated corpora and tools while satisfying the requirements of compatibility and reuse? The approach taken is to accept the tokenization variations in the corpora to achieve robustness. The tokenization variations in Persian texts are related to the orthographic variations of writing fixed expressions, as well as various types of affixes and clitics. Since these variations are inherent properties of Persian texts, it is important that the tools in the pipeline can handle them. Therefore, they should not be trained on idealized data. The second question concerns how accurately we can perform morphological and syntactic analysis for Persian by adapting and applying existing tools to the annotated corpora. The experimental evaluation of the tools shows that the sentence segmenter and tokenizer achieve an F-score close to 100%, the tagger has an accuracy of nearly 97.5%, and the parser achieves a best labeled accuracy of over 82% (with unlabeled accuracy close to 87%). Keywords: Persian, language technology, corpus, treebank, preprocessing, segmentation, part- of-speech tagging, dependency parsing Mojgan Seraji, Department of Linguistics and Philology, Box 635, Uppsala University, SE-75126 Uppsala, Sweden. © Mojgan Seraji 2015 ISSN 1652-1366 ISBN 978-91-554-9229-8 urn:nbn:se:uu:diva-248780 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-248780) Sammandrag Denna avhandling presenterar resurser i form av annoterade korpusar och moduler för au- tomatisk morfosyntaktisk bearbetning och analys av persiska texter. Mera specifikt består dessa resurser av en förbättrad ordklasstaggad korpus och en dependensträdbank samt verktyg för textnormalisering, meningssegmentering, tokenisering, ordklasstaggning och dependensparsning för persiska. Vid utvecklingen av dessa resurser och verktyg har två viktiga krav antagits: kompatibilitet och återanvändning. Kompatibilitetskravet omfattar två delar. För det första bör verktygen i kedjan vara kompatibla med varandra, på ett sådant sätt att utdatan från ett verktyg är kom- patibel med indatan i nästa. För det andra bör verktygen vara kompatibla med de annoterade korpusarna och leverera samma analys som finns i dessa. Återanvändningskravet innebär att alla komponenter i kedjan utvecklas genom återanvändning av resurser, standardmetoder och verktyg med öppen källkod, vilket är nödvändigt för att göra projektet genomförbart. Mot bakgrund av de ställda kraven undersöker avhandlingen två huvudsakliga forskningsfrå- gor. Den första frågan är hur vi kan utveckla morfologiskt och syntaktiskt annoterade korpusar och verktyg och samtidigt uppfylla kraven på kompatibilitet och återanvändning. Den strategi som tillämpas är att acceptera variation i tokenisering för att uppnå robusthet. Variationen i tokenisering i persiska texter är relaterad till ortografiska varianter av flerordsuttryck samt olika typer av affix och klitiska partiklar. Eftersom denna variation är en inneboende egenskap i persiska texter, är det viktigt att verktygen i kedjan kan hantera dem. Därför bör de inte vara tränade på tillrättalagda data. Den andra frågan är med vilken korrekthet vi kan utföra morfologisk och syntaktisk analys för persiska genom att anpassa och tillämpa befintliga verktyg på de annoterade korpusarna? Den experimentella utvärderingen av verktygen visar att meningssegmenteraren och tokenieraren uppnår en korrekthet nära 100%, taggaren har en korrekthet på nästan 97,5%, och parsern uppnår som bäst en korrekthet på över 82% med dependensrelationer (och nära 87% utan relationer). Nyckelord: Persiska, språkteknologi, korpus, trädbank, normalisering, segmentering, ordklasstaggning, dependensparsning To: my sons Babak and Hooman my parents Asiyeh and Bahram my sister Shohreh my husband Mansour Words cannot express how much I love you all. Contents 1 Introduction ................................................................................................ 23 1.1 Goals and Research Questions ...................................................... 24 1.2 Research Methodology .................................................................. 25 1.3 Outline of the Thesis ...................................................................... 26 1.4 Previous Publications ..................................................................... 27 2 Background ................................................................................................ 29 2.1 Corpora ........................................................................................... 29 2.1.1 Morphological Annotation .............................................. 31 2.1.2 Syntactic Annotation ....................................................... 33 2.2 Tools ................................................................................................ 38 2.2.1 Preprocessing ................................................................... 38 2.2.2 Sentence Segmentation ................................................... 39 2.2.3 Tokenization ..................................................................... 39 2.2.4 Part-of-Speech Tagging ................................................... 40 2.2.5 Parsing .............................................................................. 42 2.3 Persian ............................................................................................. 45 2.3.1 Persian Orthography ........................................................ 46 2.3.2 Persian Morphology ........................................................ 52 2.3.3 Persian Syntax ................................................................. 54 2.4 Existing Corpora and Tools for Persian ........................................ 61 2.4.1 Morphologically Annotated Corpora ............................. 61 2.4.2 Syntactically Annotated Corpora ................................... 64 2.4.3 Sentence Segmenentation and Tokenization .................. 65 2.4.4 Part-of-Speech Taggers ................................................... 65 2.4.5 Parsers .............................................................................. 65 3 Uppsala Persian Corpus ............................................................................ 68 3.1 The Bijankhan Corpus ................................................................... 68 3.2 Uppsala Persian Corpus ................................................................. 70 3.2.1 Character Encodings ....................................................... 70 3.2.2 Sentence Segmentation and Tokenization ...................... 71 3.2.3 Morphological Annotation .............................................. 73 4 Normalization, Segmentation and Morphological Analysis for Persian 82 4.1 Preprocessing, Sentence Segmentation and Tokenization ........... 82 4.1.1 The Preprocessor: PrePer ................................................ 83 4.1.2 The Sentence Segmenter and Tokenizer: SeTPer .......... 88 4.1.3 The Evaluation of PrePer and SeTPer ............................ 89 4.2 The Statistical Part-of-Speech Tagger: TagPer ............................ 91 4.2.1 The Evaluation of TagPer ................................................ 92 5 Uppsala Persian Dependency Treebank ................................................... 99 5.1 Corpus Overview ............................................................................ 99 5.2 Treebank Development ................................................................ 100 5.3 Annotation Scheme ...................................................................... 101 5.4 Basic Relations ............................................................................. 102 5.4.1 Relations from Stanford Dependencies ........................ 102 5.4.2 New Relations ...............................................................

Load more