ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica Upsaliensia 16

Morphosyntactic Corpora and Tools for Persian

Mojgan Seraji

Dissertation presented at Uppsala University to be publicly examined in Universitetshuset / IX, Uppsala, Wednesday, 27 May 2015 at 10:15 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Professor of Computational Linguistics Jan Hajic (Charles University in Prague).

Abstract Seraji, M. 2015. Morphosyntactic Corpora and Tools for Persian. Studia Linguistica Upsaliensia 16. 191 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-9229-8.

This thesis presents open source resources in the form of annotated corpora and modules for automatic morphosyntactic processing and analysis of Persian texts. More specifically, the resources consist of an improved part-of-speech tagged corpus and a dependency treebank, as well as tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and dependency parsing for Persian. In developing these resources and tools, two key requirements are observed: compatibility and reuse. The compatibility requirement encompasses two parts. First, the tools in the pipeline should be compatible with each other in such a way that the output of one tool is compatible with the input requirements of the next. Second, the tools should be compatible with the annotated corpora and deliver the same analysis that is found in these. The reuse requirement means that all the components in the pipeline are developed by reusing resources, standard methods, and open source state-of-the-art tools. This is necessary to make the project feasible. Given these requirements, the thesis investigates two main research questions. The first is how can we develop morphologically and syntactically annotated corpora and tools while satisfying the requirements of compatibility and reuse? The approach taken is to accept the tokenization variations in the corpora to achieve robustness. The tokenization variations in Persian texts are related to the orthographic variations of writing fixed expressions, as well as various types of affixes and clitics. Since these variations are inherent properties of Persian texts, it is important that the tools in the pipeline can handle them. Therefore, they should not be trained on idealized data. The second question concerns how accurately we can perform morphological and syntactic analysis for Persian by adapting and applying existing tools to the annotated corpora. The experimental evaluation of the tools shows that the sentence segmenter and tokenizer achieve an F-score close to 100%, the tagger has an accuracy of nearly 97.5%, and the parser achieves a best labeled accuracy of over 82% (with unlabeled accuracy close to 87%).

Keywords: Persian, language technology, corpus, treebank, preprocessing, segmentation, part- of-speech tagging, dependency parsing

Mojgan Seraji, Department of Linguistics and Philology, Box 635, Uppsala University, SE-75126 Uppsala, Sweden.

© Mojgan Seraji 2015

ISSN 1652-1366 ISBN 978-91-554-9229-8 urn:nbn:se:uu:diva-248780 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-248780) Sammandrag Denna avhandling presenterar resurser i form av annoterade korpusar och moduler för au- tomatisk morfosyntaktisk bearbetning och analys av persiska texter. Mera specifikt består dessa resurser av en förbättrad ordklasstaggad korpus och en dependensträdbank samt verk- tyg för textnormalisering, meningssegmentering, tokenisering, ordklasstaggning och depen- densparsning för persiska. Vid utvecklingen av dessa resurser och verktyg har två viktiga krav antagits: kompatibilitet och återanvändning. Kompatibilitetskravet omfattar två delar. För det första bör verktygen i kedjan vara kompatibla med varandra, på ett sådant sätt att utdatan från ett verktyg är kom- patibel med indatan i nästa. För det andra bör verktygen vara kompatibla med de annoterade korpusarna och leverera samma analys som finns i dessa. Återanvändningskravet innebär att alla komponenter i kedjan utvecklas genom återanvändning av resurser, standardmetoder och verktyg med öppen källkod, vilket är nödvändigt för att göra projektet genomförbart. Mot bakgrund av de ställda kraven undersöker avhandlingen två huvudsakliga forskningsfrå- gor. Den första frågan är hur vi kan utveckla morfologiskt och syntaktiskt annoterade korpusar och verktyg och samtidigt uppfylla kraven på kompatibilitet och återanvändning. Den strategi som tillämpas är att acceptera variation i tokenisering för att uppnå robusthet. Variationen i tokenisering i persiska texter är relaterad till ortografiska varianter av flerordsuttryck samt olika typer av affix och klitiska partiklar. Eftersom denna variation är en inneboende egenskap i persiska texter, är det viktigt att verktygen i kedjan kan hantera dem. Därför bör de inte vara tränade på tillrättalagda data. Den andra frågan är med vilken korrekthet vi kan utföra morfologisk och syntaktisk analys för persiska genom att anpassa och tillämpa befintliga verktyg på de annoterade korpusarna? Den experimentella utvärderingen av verktygen visar att meningssegmenteraren och tokenieraren uppnår en korrekthet nära 100%, taggaren har en korrekthet på nästan 97,5%, och parsern uppnår som bäst en korrekthet på över 82% med dependensrelationer (och nära 87% utan relationer).

Nyckelord: Persiska, språkteknologi, korpus, trädbank, normalisering, segmentering, ord- klasstaggning, dependensparsning

To: my sons Babak and Hooman my parents Asiyeh and Bahram my sister Shohreh my husband Mansour Words cannot express how much I love you all.

Contents

1 Introduction ...... 23 1.1 Goals and Research Questions ...... 24 1.2 Research Methodology ...... 25 1.3 Outline of the Thesis ...... 26 1.4 Previous Publications ...... 27

2 Background ...... 29 2.1 Corpora ...... 29 2.1.1 Morphological Annotation ...... 31 2.1.2 Syntactic Annotation ...... 33 2.2 Tools ...... 38 2.2.1 Preprocessing ...... 38 2.2.2 Sentence Segmentation ...... 39 2.2.3 Tokenization ...... 39 2.2.4 Part-of-Speech Tagging ...... 40 2.2.5 Parsing ...... 42 2.3 Persian ...... 45 2.3.1 Persian Orthography ...... 46 2.3.2 Persian Morphology ...... 52 2.3.3 Persian Syntax ...... 54 2.4 Existing Corpora and Tools for Persian ...... 61 2.4.1 Morphologically Annotated Corpora ...... 61 2.4.2 Syntactically Annotated Corpora ...... 64 2.4.3 Sentence Segmenentation and Tokenization ...... 65 2.4.4 Part-of-Speech Taggers ...... 65 2.4.5 Parsers ...... 65

3 Uppsala Persian Corpus ...... 68 3.1 The ...... 68 3.2 Uppsala Persian Corpus ...... 70 3.2.1 Character Encodings ...... 70 3.2.2 Sentence Segmentation and Tokenization ...... 71 3.2.3 Morphological Annotation ...... 73 4 Normalization, Segmentation and Morphological Analysis for Persian 82 4.1 Preprocessing, Sentence Segmentation and Tokenization ...... 82 4.1.1 The Preprocessor: PrePer ...... 83 4.1.2 The Sentence Segmenter and Tokenizer: SeTPer ...... 88 4.1.3 The Evaluation of PrePer and SeTPer ...... 89 4.2 The Statistical Part-of-Speech Tagger: TagPer ...... 91 4.2.1 The Evaluation of TagPer ...... 92

5 Uppsala Persian Dependency Treebank ...... 99 5.1 Corpus Overview ...... 99 5.2 Treebank Development ...... 100 5.3 Annotation Scheme ...... 101 5.4 Basic Relations ...... 102 5.4.1 Relations from Stanford Dependencies ...... 102 5.4.2 New Relations ...... 118 5.4.3 An Example Sentence Annotated with STD ...... 128 5.5 Complex Relations ...... 129 5.6 Unused Relations ...... 130 5.7 Comparison with Other Treebanks for Persian ...... 131 5.7.1 Data and Format ...... 131 5.7.2 Tokenization ...... 132 5.7.3 Annotation Schemes ...... 134 5.7.4 Sample Analyses ...... 140

6 Dependency Parsing for Persian ...... 147 6.1 Preliminaries ...... 148 6.1.1 Data ...... 148 6.1.2 Evaluation Metrics ...... 148 6.1.3 Parsers ...... 149 6.2 Experiments with Different Parsing Representations ...... 151 6.2.1 Baseline: Full Treebank Annotation ...... 151 6.2.2 Coarse-Grained Part-of-Speech Tags ...... 154 6.2.3 Coarse-Grained LVC Relations ...... 158 6.2.4 No Complex Relations ...... 162 6.2.5 Best Parsing Representation ...... 166 6.3 Experiments with Different Parsers ...... 167 6.4 Dependency Parser for Persian: ParsPer ...... 168 6.4.1 The Evaluation of ParsPer ...... 169

7 Conclusion ...... 173

References ...... 177

Appendix A: UPDT Dependency Labels ...... 189 List of Tables

Table 2.1: An example of the English sentence Economic news had little effect on financial markets., taken from the Penn Treebank (Marcus et al., 1993), annotated with the Google universal part-of-speech tags (Petrov et al., 2012) and the STD presented in CoNLL format...... 37

Table 2.2: Dual-joining Persian characters...... 45

Table 2.3: Right-joining Persian characters...... 47 Table 2.4: Examples of Persian homographs disambiguated by diacritics, N_SING = Noun Singular, V_PA = Past...... 47

Table 2.5: Persian homophonic letters...... 48

Table 2.6: Diverse spellings of certain homophonic Persian words...... 48 Table 2.7: 12 different ways of writing the plural and definite form of the compound word  (the libraries of)...... 49 øAëéKAgH. AJ» Table 2.8: Different forms of hamze...... 50

Table 2.9: Different forms of Persian and characters...... 51 Table 2.10: Digital characters for Persian (Extended Arabic-Indic Digits), Arabic (Arabic-Indic Digits) and Western...... 51 Table 2.11: Examples of words derived from the present stem à@X /dan/¯ (to know) combined with various types of other stems and nouns as well as derivational affixes...... 52 Table 2.12: Present indicative of the verb á ¯P /raftan/ (to go)...... 53

Table 2.13: Syntactic patterns in Persian...... 54 Table 2.14: Personal endings in past tense (personal endings in present tense are illustrated in Section 2.3.2.) ...... 55

Table 2.15: Pronominal clitics...... 57 Table 2.16: Pronominal clitics accompanied by the word PA¿ /kar/¯ (work)...... 57

Table 2.17: Syntactic relations in the Persian Dependency Treebank...... 66

Table 3.1: Part-of-speech tags in the Bijankhan Corpus...... 69 Table 3.2: Part-of-speech tags in the UPC and the corresponding tags in the Bijankhan Corpus (BC in the table)...... 74 Table 3.3: A sample sentence taken from the Bijankhan Corpus and the corresponding sentence modified in the UPC...... 79

Table 4.1: Personal endings in past tense...... 85 Table 4.2: Copula clitics. * The third singular è /-h/ in formal usage is consistently used along with the verb Iƒ@ /ast/ (is)...... 85 Table 4.3: Verbal stems in the formation of compound words...... 86

Table 4.4: Adjectival and nominal suffixes...... 87

Table 4.5: List of token separators...... 89

Table 4.6: Words not treated by segmentation tools...... 90 Table 4.7: Comparison of different models for tag transitions and word emissions...... 92

Table 4.8: Comparison of different models for unseen words...... 92 Table 4.9: Recall, precision, and F-score for different part-of-speech tags when TagPer was evaluated on a subset of UPC...... 93 Table 4.10: Recall, precision, and F-score for different part-of-speech tags when TagPer was evaluated on 100 automatically tokenized sentences (2778 tokens) taken from the web-based journal Hamshahri...... 94 Table 4.11: Recall, precision, and F-score for different part-of-speech tags when TagPer was evaluated on 100 manually tokenized sentences (2788 tokens) taken from the web-based journal Hamshahri...... 95

Table 5.1: A statistical overview of the UPDT...... 99

Table 5.2: Syntactic relations in UPDT with new relations in italics. ... 126 Table 6.1: Labeled recall and precision on the development set for the 20 most frequent dependency types in the UPDT, when MaltParser is trained on the full treebank annotation (automatically generated part-of-speech tags)...... 152 Table 6.2: Labeled recall and precision on the development set for the 20 most frequent dependency types in the UPDT, when MaltParser is trained on the full treebank annotation (gold standard part-of-speech tags)...... 153 Table 6.3: Labeled and unlabeled attachment scores and label accuracy on the development set when MaltParser was trained on UPDT with a fine-grained annotated treebank...... 154 Table 6.4: Labeled recall and precision on the development set for the 20 most frequent dependency types in the UPDT, when MaltParser is trained on the UPDT with coarse-grained auto part-of-speech tags...... 155 Table 6.5: Labeled recall and precision on the development set for the 20 most frequent dependency types in the UPDT, when MaltParser is trained on the UPDT with coarse-grained gold part-of-speech tags...... 156 Table 6.6: Labeled and unlabeled attachment scores and label accuracy on the development set when MaltParser was trained on the UPDT with coarse-grained part-of-speech tags...... 158 Table 6.7: Labeled recall and precision on the development set for the 20 most frequent dependency types in the UPDT, when MaltParser is trained on the treebank with fine-grained auto part-of-speech tags and only one light verb construction...... 159 Table 6.8: Labeled recall and precision on the development set for the 20 most frequent dependency types in the UPDT, when MaltParser is trained on the treebank with fine-grained gold part-of-speech tags and only one light verb construction...... 160 Table 6.9: Recall and precision for LVC relations with fine-grained auto and gold part-of-speech tags in experiments 1 and 3...... 161 Table 6.10: Labeled and unlabeled attachment scores and label accuracy on the development set when MaltParser was trained on UPDT with fine-grained part-of-speech tags and only one dependency relation for light verb construction...... 162 Table 6.11: Labeled recall and precision on the development set for the 20 most frequent dependency types in the UPDT, when MaltParser is trained on the treebank with fine-grained auto part-of-speech tags and only basic dependency relations...... 163 Table 6.12: Labeled recall and precision on the development set for the 20 most frequent dependency types in the UPDT, when MaltParser is trained on the treebank with fine-grained gold part-of-speech tags and only basic dependency relations...... 164 Table 6.13: Labeled and unlabeled attachment scores and label accuracy on the development set when MaltParser was trained on UPDT with fine-grained part-of-speech tags and merely basic dependency relations. . 165 Table 6.14: Labeled and unlabeled attachment scores, and label accuracy on the development set resulting from 8 empirical studies where MaltParser was trained on UPDT with different simplifications of annotation schemes in part-of-speech tagset and dependency relations. Baseline = Experiment with a fine-grained annotated treebank, CPOS = Experiment with coarser-grained part-of-speech tags and fine-grained dependency relations, 1LVC = Experiment with fine-grained part-of-speech tags and dependency relations free from distinctive features in light verb construction, and Basic DepRel = Experiment with fine-grained part-of-speech tags and merely basic dependency relations. . 166 Table 6.15: Best results given by different parsers when trained on UPDT with auto part-of-speech tags, 1LVC, CompRel in the model assessment...... 169 Table 6.16: The evaluation of the ParsPer when tested on 100 randomly selected sentences from the web-based journal Hamshahri. LR = Labeled Recall, LP = Labeled Precision, UR = Unlabeled Recall, UP = Unlabeled Precision, AS = Automatically Segmented, AT = Automatically Tagged, AP = Automatically Parsed, MS = Manually Segmented, and MT = Manually Tagged...... 171 Table 6.17: Precision and recall of binned head direction obtained when ParsPer was evaluated on 100 manually tokenized and automatically tagged sentences taken from the web-based journal Hamshahri...... 172 Table 6.18: Precision and recall of binned head distance obtained when ParsPer was evaluated on 100 manually tokenized and automatically tagged sentences taken from the web-based journal Hamshahri...... 172 List of Figures

Figure 2.1: Constituent structure for an English sentence taken from the Penn Treebank (Marcus et al., 1993)...... 34 Figure 2.2: Dependency structure for an English sentence taken from the Penn Treebank, converted to the Stanford Typed Dependencies representation...... 34 Figure 2.3: Constituency annotation in the IBM Paris Treebank...... 35 Figure 4.1: Persian natural language processing pipeline...... 83 Figure 5.1: Data selection of the UPDT...... 100 Figure 5.2: Syntactic annotation of a Persian sentence. Gloss: she/he book ra¯ delivery-ez book.house give.past.3sg. Translation: She/he delivered the book to the library...... 121 Figure 5.3: Syntactic annotation for a Persian sentence with English gloss. To make the figure more readable, glosses have been simplified as follows: humans = human-pl, animals-e = animal-pl-ez, facts = fact-pl, take = cont-take.pres-3pl, features-e = feature-pl-ez, specific-e = specific-ez, own = self, have = have.pres-3pl, look-a = look-indef, kind-are = kind-be.pres-3pl. Gloss: human-pl and animal-pl-ez Born although from fact-pl effect cont-take.pres-3pl, feature-pl-ez specific-ez self ra¯ have.pres-3pl and in look-indef general all of one kind-be.pres.3pl. Translation: Although (Adolf) Born’s humans and animals are affected by realities, they have their own special characteristics and in (a) general (look) all are of the same kind...... 127 Figure 5.4: Syntactic annotation of a Persian sentence taken from the PerDT. To make the figure more readable, glosses have been simplified as follows: they = this-pl, became = become.past-3pl, are = be.pres.3pl. The sentence is illustrated based on two different annotation schemes: PerDT annotation and UPDT annotation. Gloss: from time-indef that they with each.other familiar become.past-3pl happy be.pres-3pl. Translation: Since the time they became familiar with each other they are happy...... 141 Figure 5.5: Syntactic annotation of a Persian sentence taken from the PerDT. To make the figure more readable, glosses have been simplified as follows: want = cont-want.pres-3pl, do = sub-do.pres-3pl. The sentence is illustrated based on two different annotation schemes: PerDT annotation and UPDT annotation. Gloss: if even cont-want.pres-3pl me ra¯ execution sub-do.pres-3pl, sub-do.pres-3pl. Translation: Even if they want to execute me, let them do it...... 144

Acknowledgements During my doctoral research, I have received lots of guidance, encouragement, and support from a number of people. First and foremost, I would like to express my special gratitude to my main supervisor Joakim Nivre and my co-supervisor Carina Jahani. I am extremely thankful to Joakim Nivre for his continuous guidance and advice throughout the entire project as well as the writing phase of this thesis. Joakim’s rich knowledge of Computational Linguistics, clear guidance, and valuable suggestions have been a great asset in my research studies. He has always been an endless source of inspiration in my work and someone whom I could count on to answer my questions. I am deeply indebted to Carina Jahani for her expertise in Persian linguistics. Carina’s wealth of knowledge and reflection about Persian grammar and the guidance she has provided have been a significant advantage throughout my research process. As a PhD student with one foot in Computational Linguistics and the other in the (as a native speaker) I feel immensely lucky to have had this opportunity to receive full support and deep scholarly guidance from two experts in these two fields. Their expertise was a perfect match to my research area. I would also like to thank Anna Sågvall Hein, my first main supervisor, for accepting me as her PhD student. Although I did not have the chance to work closely with her, as she retired soon after I started my work, I am grateful for the opportunity I was given to enter into this research field. This research would not have been possible to complete without the help I received from staff and colleagues at the Department of Linguistics and Philol- ogy, as well as other researchers from elsewhere. At our department, I am immensely thankful to Jörg Tiedemann for his guidance and help with the Up- lug tokenizer when I was adapting the software to Persian. Even though I came by his office unexpectedly, knocking at the door as an unexpected vis- itor, he kindly answered my questions and helped me. Thank you! Looking back, I realize that I should have booked a time. I will do that next time! I am thoroughly thankful to Per Starbäck for his technical support. He has al- ways been very helpful and promptly resolved technical problems that arose, as well as answering my questions related to Uppsala University’s thesis tem- plate and coming up with new solutions that saved me much time. I would also like to thank Bengt Dahlqvist for acquainting me with the Ruby programming language and answering related questions when I began developing the Per- sian normalizer in Ruby, and also for technical support once Per Starbäck was not available. I am deeply thankful to Forogh Hashabeiky, Mahmoud Hassan- abadi, and Esmat Esmaeili at the Persian language department for their fruitful discussions, valuable advice, and suggestions regarding Persian orthography, morphology, and syntax. Even though we did not always agree on different issues in Persian grammar, the discussions always opened my mind to think differently and see things from other points of view.

17 I would like to express my special thanks to other (former or present) col- leagues and friends at the department for their support, scholarly interaction, kind messages, company during coffee breaks or Fridays after work, or sim- ple chats in the pantry: Aynur Abish (my former officemate), Jakob Anders- son, Ali Basirat, Miguel Ballesteros, Mats Dahllöf, Marie Dubremetz, Megh- dad Farahmand, Christian Hardmeier, Eva Martinez Garcia, Daisy Gurdeep Kaur, Birsel Karakoç, Mattias Nilsson (my former college classmate and of- ficemate), Alexander Nilsson, Padideh Pakpour, Eva Pettersson, Yan Shao, Guiti Shokri, Aron Smith, Sara Stymne, Heinz Werner Wessler, and Vera Wil- helmsen. I am also thankful to Beáta Megyesi for her feedback and comments on some of my published articles. I am enormously indepted to Yvonne Adesam, from Gothenburg Univer- sity, for her valuable comments and feedback when she acted as opponent at my mock defense. I am especially thankful to Mahmood Bijankhan, from University, for kindly answering my e-mails related to the Bijankhan Corpus, as well as to Hamid Hassani from the Academy of Persian Language and Literature in Tehran, for his replies regarding the Persian linguistics. I would like to extend my very special thanks to Jon Dehdari, from Ohio State University, for patiently answering my questions related to the Persian Linked Grammar Parser and cordially sharing his annotation scheme and a number of selected annotated sentences that I used as a starting point for my work with the treebank creation. I am extremely thankful to Jan Štepánekˇ and Mar- tin Kruliš from Charles University in Prague, and Petr Pajas from Google in Zürich (former member of the TrEd development team), for kindly following up bugs I encountered when working with TrEd and also for their efforts to reconfigure the Tk library so that TrEd could work smoothly for Persian on a Mac! I am deeply grateful to Bernd Bohnet, from Google in London, for his kindness and generous collaboration, running multiple experiments with different Mate Parsers during the time I was tuning parameters within the tree- bank. I am particularly thankful to Recorded Future Inc. for their financial support in developing the treebank. I am very grateful to Sara and her husband Pezhman for their many years of friendship. In particular, I am thankful to Sara for all those girls’-nights-out and the laughs we shared together. Thanks for all the fun stories we suspi- ciously made up about people who crossed our path on our way back to the car after watching horror movies. You always made me forget about the hard work I was doing and I enjoyed the moments to the full, like in my teenage years. Thank you for all the wonderful memories! Finally, I would like to express my deepest appreciation and respect to my dear parents for their lifelong support and unconditional love and encour- agement. Special thanks go to my sister for her love, care and friendship throughout my life, and also for designing and creating that beautiful image of the tree with Persian sentences for foliage, resembling the shape of on the map, that I used as a symbol in my work on the treebank. Last but

18 certainly not least, I am endlessly grateful to my dear family who endured the great amount of time I was away from them working on this thesis. I am wholeheartedly thankful to my sons, Babak and Hooman, the joys of my life, for collaborating with their mom in being more independent and doing their everyday chores excellently. One final special thank you goes to my husband for his love and support and for taking on more responsibility at home so that I could concentrate on my research.

Uppsala, April 2015 Mojgan Seraji

19

Glossing Abbreviations

cl classifier

cont continuous

ez ezafe¯ construction

fut future

gl glide

indef indefinite

inf infinitive

neg negation

past past

pc pronominal clitic

pl plural

pp past participle

pres present

sg singular

sub subjunctive mood

voc vocative

21

1. Introduction

Can computers overtake human beings when it comes to the ability to pro- duce and understand language? We live in an era characterized by real-time communication in which searching for, exchanging, and sharing information happen instantaneously, and we can therefore make use of machines that can understand and process human language. Machines should also be able to provide support when we face language barriers. Over the past decades, var- ious techniques have been applied to develop tools for automatic processing of human language at different levels. Although computers are still far from being able to match human ability, modern breakthroughs in computational linguistics have resulted in innovative applications in such areas as informa- tion retrieval, information extraction, machine translation, speech technology, and human-machine communication. Techniques in computational linguistics are to some extent language in- dependent but are always dependent on the availability of language-specific resources. In particular, most approaches today rely on statistical machine learning techniques. Systems based on supervised machine learning have the advantage of being readily adapted to any specific domain or language, given data sets annotated with linguistic information of that language. However, machine learning techniques require large data sets, and preferably annotated ones, for the induction of linguistic structure. In addition, they require tools for processing language specific data. Thus, every language needs standardized and publicly available resources such as , lexicons, general and specialized corpora, as well as tools for processing data. The notion of a Basic Language Resource Kit (BLARK) has been coined for the resources and tools needed to develop language tech- nology applications for a given language. In order to be maximally useful, a BLARK should be reusable and freely available. Reusability and open access of language specific resources and tools enable researchers and developers to easily enlarge and modify the source materials. This may improve the quality of data analysis results and at the same time reduce the cost and time for de- velopment. Otherwise, there is a risk that each developer will have to recreate resources and tools for more advanced language processing tasks. Languages vary greatly in terms of the number of resources and tools that are available. Most languages still lack basic resources and tools for language processing. Persian is one of the languages with a sizable number of native speakers in the world. Yet, it still belongs to the group of languages with rela- tively few annotated data sets and tools. Importantly, most of those resources

23 and tools that do exist are not freely available. Although a certain amount of resources and tools have been developed recently in Persian computational linguistics, there is still a great need to develop new ones. The aim of my research is to contribute to this effort. Developing language resources and tools for Persian can additionally ben- efit computational linguistics in general. A language like Persian offers partly different challenges compared to languages that have received more attention, in particular English. The lack of standardization in Persian orthography poses challenges for tokenization that further impact the quality of morphological and syntactic analysis. Persian syntactic structure exhibits special character- istics, in particular the prevalence of complex predicates in the form of so- called light verb constructions. There are thus a variety of challenges in Per- sian on various levels, from orthography to syntactic structure. Hopefully, the methods and solutions put forward in this thesis can ease the way for other languages with similar linguistic and orthographic characteristics to develop language resources and tools.

1.1 Goals and Research Questions The major research motivation behind this doctoral thesis is to develop open source language resources and tools for Persian. The goal is to make the lan- guage technology infrastructure richer and hopefully move it a step closer to a full-fledged BLARK for this language. More specifically, I want to improve a part-of-speech tagged corpus and build a dependency-based treebank for Per- sian. In addition, I want to develop a normalizer, a sentence segmenter and tokenizer, a part-of-speech tagger, and a parser for Persian text processing. In pursuing this goal I observe two important requirements. The first is the compatibility requirement, which has two parts. On the one hand, tools are meant to be run in a pipeline, where the output of one tool must be com- patible with the input requirements of the next. For example the output of a part-of-speech tagger must match the input requirements of a syntactic parser. Accordingly, the pipeline will take raw text as input and provide syntactically analyzed text as output. On the other hand, I want the tools to deliver the same analysis that is found in the annotated corpora. Otherwise, it is impossible to use the annotated corpora to train new tools that can be applied to the output of the pipeline. The second requirement is one of necessity. To be able to develop these resources and tools within the scope of a thesis project the development must be based on reuse of existing resources and tools. Thus, the corpus resources developed will be based on the only freely available tagged corpus of Persian, the Bijankhan Corpus (Bijankhan, 2004), and tools for morphological and syntactic analysis will be created by adapting existing tools to Persian.

24 The goals and requirements together raise the following research questions:

Q-1 How can we develop morphologically and syntactically annotated corpora and tools while satisfying the requirements of compatibility and reuse?

Q-2 How accurately can we perform morphological and syntactic analysis for Persian by adapting and applying existing tools to the annotated corpora?

The first question addresses the interaction between different linguistic lev- els with respect to segmentation and annotation when modifying the exist- ing annotation scheme in the Bijankhan Corpus for higher linguistic analysis. Adding a syntactic annotation layer imposes new requirements on lower lay- ers, and the question is how I can best satisfy these requirements without re- segmenting and reannotating all the data from scratch. The situation is further complicated by inconsistencies in Persian orthography with respect to syn- tactically significant elements such as clitics. The modifications are basically improvements made on tokenization and part-of-speech tagging to make the corpus more appropriate for syntactic analysis. In other words, the corpus is to be used as the basis for a dependency-based treebank for Persian. The second question will be addressed by adapting and evaluating standard tools built on resources in question 1. For this, I make use of standard methods and state-of-the-art tools. Among the tools I have selected are the sentence segmentation and tokenization tools in Uplug (Tiedemann, 2003), the part- of-speech tagger HunPoS (Halácsy et al., 2007), and the data-driven parser generator MaltParser (Nivre et al., 2006). Adapting these tools and evaluating them on the morphologically and syntactically annotated corpora will provide benchmarks for morphosyntactic analysis of Persian.

1.2 Research Methodology Computational linguistics is a multidisciplinary field, which uses methods from several different sciences. Developing resources and tools can be seen as part of design science, where the notion of utility is of prime importance. Annotating corpora is a form of linguistic analysis which draws upon a long tradition of descriptive and theoretical linguistics. Evaluating tools is a kind of experimental science, based on principles for experimental design and sta- tistical inference for hypothesis testing. Resources and tools for a specific language must be designed to match cer- tain characteristics of that language. In developing the annotated resources, I therefore take advantage of the Persian grammatical tradition. However, the resources and tools must also serve the needs of practical language technol- ogy, which means that I will need to adapt the traditional descriptions to fit

25 the needs of automatic processing and make sure that the requirements for compatibility can be met. In developing the pipeline I take advantage of both rule-based and statistical techniques. More specifically, the development of the normalizer, the sentence segmenter, and the tokenizer follow a rule-based approach and the creation of the part-of-speech tagger and the dependency parser are oriented towards statistical language modeling. For the treebank development I further employ statistical bootstrapping. To address the first research question I will systematically study the linguis- tic properties of Persian and try to come up with suitable methods given the re- quirements of compatibility. For automatic modeling of the Persian language I employ statistical methods, which are to some extent language independent, while the methods used for data representation are to a great extent dependent on the linguistic properties of words, phrases, and sentences (morphological and syntactic structure) in Persian. To address the second research question I will rely on the established ex- perimental methodology for evaluation in computational linguistics. By mea- suring the accuracy of a tool on a sample of data that has not been used in developing the tool, we can use statistical inference to estimate the general accuracy of the tool or to test hypotheses about the relative merits of different tools. In the rest of the thesis, methods for research and development will not be discussed in separate subsections. Instead this discussion will be integrated into the discussion of tools and resources, so that different methodological choices can be justified in the proper context.

1.3 Outline of the Thesis After introducing the goals and research questions in this introductory chapter, I organize the remainder of the thesis into the following chapters:

Chapter 2 provides background on morphosyntactically annotated corpora and tools. In addition, it gives a brief description of Persian and its main characteristics, as well as a discussion of challenges that arise in processing Persian text. The chapter ends with an overview of existing morphosyntactic corpora and tools for Persian. Chapter 3 introduces the Uppsala Persian Corpus, a part-of-speech tagged corpus developed by improving the tokenization and part-of-speech tagging of the Bijankhan Corpus. Chapter 4 presents tools for automatic analysis of Persian developed by reusing and modifying existing tools such as the sentence segmen- tation and tokenization tools in Uplug and the part-of-speech tagger HunPoS, all compatible with the Uppsala Persian Corpus. The chap-

26 ter ends with empirical evaluations of the sentence segmentation and tokenization tools, as well as the part-of-speech tagger, including a detailed error analysis. Chapter 5 presents the Uppsala Persian Dependency Treebank, a dependency-based treebank with an annotation scheme based on Stanford Typed Dependencies. This chapter additionally pro- vides a comparison with an existing dependency-based treebank for Persian. Chapter 6 presents extensive parsing experiments using MaltParser, explor- ing the impact on parsing accuracy of different label sets for both part-of-speech tags and dependency relations. Moreover, it presents evaluations of different dependency parsers such as MSTParser, Tur- boParser, and MateParsers on the best selected treebank representa- tion. The chapter ends by introducing and evaluating a parsing tool for Persian, developed by training the graph-based MateParser on the Uppsala Persian Dependency Treebank. Chapter 7 summarizes the main contributions of the thesis and ends with suggestions for future research.

1.4 Previous Publications This thesis is to a large extent based on the following publications:

Mojgan Seraji (2011). A Statistical Part-of-Speech Tagger for Persian. In Proceedings of the 18th Nordic Conference of Computational Linguis- tics NODALIDA, pages 340–343, Riga, Latvia. Mojgan Seraji, Beáta Megyesi, and Joakim Nivre (2012b). Bootstrapping a Persian Dependency Treebank. Linguistic Issues in Language Technol- ogy 7 (18), pages 1–10. Mojgan Seraji, Beáta Megyesi, and Joakim Nivre (2012a). A Basic Language Resource Kit for Persian. In Proceedings of the 8th International Con- ference on Language Resources and Evaluation (LREC), pages 2245– 2252, Istanbul, Turkey. Mojgan Seraji, Beáta Megyesi, and Joakim Nivre (2012c). Dependency Parsers for Persian. In Proceedings of 10th Workshop on Asian Lan- guage Resources, 24th International Conference on Computational Lin- guistics (COLING), pages 35–44, Mumbai, India. Mojgan Seraji, Carina Jahani, Beáta Megyesi, and Joakim Nivre (2013). Up- psala Persian Dependency Treebank: Annotation Guidelines. Depart- ment of Linguistics and Philology, Uppsala University. Mojgan Seraji, Carina Jahani, Beáta Megyesi, and Joakim Nivre (2014). A Persian Treebank with Stanford Typed Dependencies. In Proceedings of

27 the 9th International Conference on Language Resources and Evalua- tion (LREC), pages 796–801, Reykjavik, Iceland. Mojgan Seraji (2013). PrePer: A Pre-processor for Persian. Presented at the 5th International Conference on Iranian Linguistics (ICIL5), Bamberg, Germany. [Not published.] Mojgan Seraji, Bernd Bohnet, and Joakim Nivre (2015). ParsPer: A Dependency Parser for Persian. In Proceedings of the International Conference on Dependency Linguistics (DepLing 2015), Uppsala, Sweden. [Submitted.]

28 2. Background

This chapter provides background on morphosyntactically annotated corpora and tools for morphosyntactic analysis. More specifically, it discusses anno- tation schemes used in part-of-speech tagging and syntactic analysis of mono- lingual corpora, as well as standard methods for preprocessing, sentence seg- mentation and tokenization, data-driven part-of-speech tagging and parsing. It further gives a brief description of Persian and its main orthographic, morpho- logical, and syntactic features, while discussing interdependent text processing issues. The chapter ends by presenting the existing morphosyntactic corpora and tools for morphosyntactic analysis of Persian.

2.1 Corpora Corpora are compiled collections of linguistic data, either in the form of writ- ten or spoken material, or transcriptions of recorded speech. The usefulness of corpora for different purposes has grown over the past 50 years, as various types of corpora have been developed and often enriched with linguistic in- formation. Nowadays, corpora with different types of linguistic information have become essential training resources for developing computational tools by means of machine learning. Even systems that are based on hand-crafted rules need to be evaluated with annotated corpora. Corpora are further used as resources in linguistic research and for teaching and learning. Most created corpora are monolingual. The classic (Kuceraˇ and Nelson, 1967) and the (BNC) (Aston and Burnard, 1998) are typical monolingual corpora for English. However, there also exist multilingual parallel corpora containing texts in one language with translations in another. The Hansard Corpus (Roukos et al., 1995) based on records of proceedings in the Canadian Parliament in both English and French, and the European Parliament (EUROPARL) parallel corpus (Koehn, 2002), based on European languages, are typical multilingual parallel corpora. General corpora exist that are designed to represent a wide variety of genres and domains. These corpora are used as standard references for a given lan- guage and contain samples from regional and national newspapers, technical journals, academic books, fiction, political statements, etc. General corpora vary in size. Some are large, consisting of more than 100 million words, such as the BNC for modern British English, or the English Gigaword ver- sion 5 (Parker et al., 2011) with a total of more than 4 billion words (currently

29 the largest corpus of English). The latter corpus consists of 10 million docu- ments taken from different news outlets. Others are much smaller and contain 1 million words, such as the Stockholm Umeå Corpus (SUC) (Capková and Hartmann, 2006) for Swedish. There are also specialized corpora that are de- veloped merely to be a domain-specific. The Guangzhou Petroleum English Corpus (Q.-b. Zhu, 1989), for instance, consists of 411,612 words of writ- ten English from the petrochemical domain. The Computer Science corpus of the Hong Kong University of Science and Technology (HKUST) (James et al., 1994) is a further example of a domain-specific corpus, and contains one million words of written English taken from textbooks in computer science. Monitor (or open-ended) corpora are another variety. These are constantly being updated with language changes in order to track the advent and life cy- cle of neologisms. The Corpus of Contemporary American English (COCA) (Davies, 2010), an example of this kind, was started in 1990 as the first elec- tronic archive monitor corpus. With its 450 million words (1990–2012), it is the largest freely-available corpus of American English. Corpora may contain metadata, namely information associated with a text such as title, author, date of publication, etc. Metadata related to different cor- pora is represented differently. For instance, metadata in early corpora such as the Brown Corpus was provided in a separate reference manual (a large A4 volume of typescript). However, nowadays, metadata is usually represented in an integrated form together with the corpus by a particular text encoding. There are various types of text encoding standards for corpora such as Text En- coding Initiative1 (TEI), Corpus Encoding Standard (CES) (Ide et al., 1996), and the XML version of CES (XCES) (Ide et al., 2000). Different corpora may additionally possess different character encodings such as ASCII, ISO- 8859-1, etc. Unicode has a unique representation for every possible character including alphabets, syllabaries, and logographic systems. At present, many corpora are annotated at different linguistic levels. These annotation layers are generally accomplished sequentially from lower to upper layers of linguistic information, i.e., first morphology, then syntax, and finally semantics (Taulé et al., 2008). Each annotation process is performed manu- ally, semi-automatically, or fully automatically. The two most common layers of linguistic description are morphological and syntactic annotations. In the following sections, we will review the structure and design of morphological and syntactic annotation schemes. Other types of annotation such as seman- tic annotation and discourse annotation will not be covered in this thesis. It is worth noting that the terms morphological annotation and morphosyntactic annotation are sometimes used as synonyms and sometimes not. For clarity, in this thesis I have decided to use the term morphological annotation for an- notation at the word level, syntactic annotation for annotation at the sentence level, and morphosyntactic annotation as a term covering both.

1http://www.tei-c.org/index.xml

30 2.1.1 Morphological Annotation Corpora annotated with morphological information are one of the fundamen- tal language resources and are a prerequisite for creating and evaluating lan- guage analysis modules such as morphological analyzers, taggers, chunkers, and parsers. Morphological annotation encodes different aspects of lexical in- formation such as part of speech (PoS), morphological features, and lemma. For example, the morphological analysis of the word cats could be: POS = NOUN, NUMBER = PLURAL, and LEMMA = cat. Lemma involves assigning each word its basic form while morphological annotation involves assigning part-of-speech tags to different tokens using a fixed list of tags called a tagset. Since I have not treated lemmas and have limited my work to part-of-speech and morphological features, I will not discuss lemmatization further. There are various types of morphological information in different languages that require different kinds of markup. For example, some languages con- tain information about gender and some have case systems. Tagsets therefore vary from language to language depending on the linguistic characteristics and structure of a particular language. Tagsets can also differ within a language. Depending on what a corpus is developed for, a tagset may contain more or less fine-grained distinctions. For instance, the noun category can be assigned different fine-grained classifications, such as common noun for a word like book and proper noun for a word like John. A fine-grained tagset can be rep- resented either with atomic tags that store and combine a part-of-speech tag with its morphological features or with complex tags that are composed of atomic tags and additional features. An example of a fine-grained tagset us- ing atomic tags is the Penn Treebank. Each tag represents a base category together with specific atomic values, such as NN is a singular or mass com- mon noun, NNP is a singular proper noun, and NNS is a plural common noun. An example of a fine-grained tagset using complex tags is SUC (Capková and Hartmann, 2006). In this tagset, each part-of-speech tag is followed by one or more feature values, such as NN UTR PLU IND NOM, where NN denotes the base part-of-speech tag noun, followed by the features UTR (specifies gen- der as common), PLU (defines number as plural), IND (marks indefiniteness), and NOM (represents nominative case). The number of tags in a tagset de- pends on how many morphological features exist in a language. There are, for example, differences between the basic tagset for a morphologically ambigu- ous inflective language like Czech, with 1171 part-of-speech features, and a poorly inflected language like English, with 48 tags in Penn Treebank (Hladká and Ribarov, 1998). Various tagset systems in different annotated corpora often share a num- ber of major part-of-speech categories such as adjective, adverb, article, con- junction, determiner, interjection, noun, numeral, pre/postposition, pronoun, verb, and in many cases punctuations (van Halteren, 1999). These main cat- egories can easily be further analyzed according to morphological features of

31 the word, giving a more fine-grained annotation. Miscellaneous categories that may not fit into other tagsets, such as symbols, abbreviations, foreign ex- pressions, and so forth can be defined as special tags. Special tags can further be combined with the major part-of-speech categories in a special tag system for specific texts and specific languages. As languages differ greatly with regard to morphological complexity, it seems to be difficult to include all varieties of languages within one standard- ized annotation scheme. However, because the sharing, merging, and compar- ison of language resources is increasingly common in language technology, the use of common standards and interoperability between resources are to a large extent taken into consideration. So far, certain fundamental principles for morphological annotation have been adopted and many attempts have been made to create different standards for different languages. In natural language processing, different approaches have been suggested to facilitate future research and to standardize best- practices. An elementary morphological annotation set based on language- independent recommendations, the EAGLES tagset, proposed in Leech and Wilson (1994) was an early attempt in this area. Morphological labels were initially provided for English and Italian. Leech and Wilson (1994) proposed that any morphological tagset should be at a level that can easily be mapped onto an intermediate tagset. The aim was to demonstrate what is common between different languages and what options are available for extension or omission. The basic idea underlying this statement is to represent a set of common morphological categories that exist across languages and are often realized as universals. Multext-East (Erjavec and Ide, 1998), for instance, was a project that used the same formal EAGLES-based morphological tagset for multiple languages, namely, Bulgarian, Czech, English, Estonian, Hungarian, Romanian, and Slovene. The project resulted in an annotated multilingual cor- pus (Erjavec and Ide, 1998) containing a speech corpus, a comparable corpus and a parallel corpus, lexical resources (Tufis et al., 1998), and tool resources for the seven languages (Erjavec et al., 2003). The specifications were later extended to cover nine languages five of which are Slavic: Bulgarian, Croat- ian, Czech, Serbian, and Slovene. Interset (Zeman, 2008) is a further example of an interlingual morphological tagset. It contains a universal set of parts of speech as well as morphological features such as gender, number, tense, etc. Through the Interset, any morphological tagset of any language can be converted into any other tagset using the Interset representation as an interlin- gua. In other words, Interset is used to encode language specific tagsets in a joint and uniform representation. Some features of the source tags may be lost during conversion however. This may, to a great extent, be dependent on the features that the target tagset can take in. More recently, Petrov et al. (2012) proposed a tagset containing twelve universal part-of-speech categories that cover the most frequent word classes in different languages. A mapping from fine-grained part-of-speech tags for 25 different treebanks is additionally in-

32 cluded into this universal set. When the original treebank data is included, the universal tagset and mapping produce a data set containing common part-of- speech tags for 22 different languages. As we have seen, different approaches to multilingual morphological speci- fication have been presented as leading towards one standard analysis to make it easier to add new languages. Yet, it is still far from simple to adopt a group of inflectional tags or select what kinds of attributes and values to use as one single universal tagset. In other words, there is still no unified standardized morphological annotation scheme. Developing morphologically annotated corpora is highly time-consuming. Therefore different techniques are usually applied in their creation with an in- terplay between automatic analysis and manual linguistic revision in order to reduce costs while preserving quality. A bootstrapping procedure is therefore usually employed in order to increase the size of an annotated corpus. The pro- cess starts by training a part-of-speech tagger on a seed data set of manually annotated and validated data and then using the induced model to tag a subset of raw texts. The tagged corpus is corrected and then added to the training set. The tagger is retrained with the new extended training data to tag addi- tional raw texts. This process is iterated as the size of the corpus grows and the quality of the tagger improves because more training data ensures better performance. As morphological specifications are based on the notion of words, they are not sufficient for the structure of linguistic analysis at sentence level. There- fore, an additional layer of syntactic analysis is required, as will be described in the next subsection.

2.1.2 Syntactic Annotation Over the past decades there has been increasing interest in developing differ- ent syntactically annotated corpora, treebanks, for many languages, focusing on grammatical representations beyond the morphological analysis level. In syntactically annotated corpora, each sentence is annotated with its syntactic structure. Treebanks are often built on an already annotated corpus that has part-of-speech tags and sometimes is enhanced with semantic information. Treebanks are typically much smaller in size than the part-of-speech tagged corpora they are built on, usually containing between 50,000 and 1,000,000 tokens. Selecting a subset of corpus material to include as treebank data is a crucial consideration, as it is for any annotated corpus. Since treebanks are usually based on previously established corpora, they inherit the genres of the original corpus (Nivre, 2008b). For instance, the SUSANNE Corpus (Sampson, 1995) is based on a subset of the Brown Corpus (Kuceraˇ and Nelson, 1967). The genre on which most available treebanks are based is contemporary newspa-

33 S QQ  Q  VP Q  H Q  H Q  NP Q  "H Q  " H Q  " Q  " PP Q  " HH Q NP NP NP PU ¨¨HH ¨¨HH ¨¨HH JJ NN VBD JJ NN IN JJ NNS Economic news had little effect on financial markets . Figure 2.1. Constituent structure for an English sentence taken from the Penn Tree- bank (Marcus et al., 1993).

root punct dobj pobj § ¤§ ¤ nsubj prep amod § amod ¤ § amod ¤ §? ¤§? ¤? §? ¤?§ ¤? §? ¤? ? JJ NN VBD JJ NN IN JJ NNS PU ROOT Economic news had little effect on financial markets . Figure 2.2. Dependency structure for an English sentence taken from the Penn Tree- bank, converted to the Stanford Typed Dependencies representation.

per text. There are, however, treebanks developed from other types of texts, such as the Penn-Helsinki Parsed Corpus of Middle English (Kroch and Tay- lor, 2000). Parallel treebanks are another variety of syntactically annotated corpora, with texts in one language and their translations in another (Adesam, 2012). Parallel treebanks are usually intended for machine translation tasks, for example, the Prague Czech-English Dependency Treebank (Cmejrekˇ et al., 2004). There are various types of treebanks with respect to the choice of annota- tion schemes. The selection of annotation scheme for a treebank is determined by various factors such as its relation to a linguistic theory. In recent years, a number of different schemes have been proposed for syntactic annotation, some based on phrase structure and others on dependency structure, some based on specific linguistic theories and others attempting to be theory-neutral (Nivre, 2008b). The majority of available treebanks are based on constituency annotation, as in the Penn Treebank for English (Marcus et al., 1993), depen- dency annotation, as in the Prague Dependency Treebank for Czech (Hajicˇ et al., 2001), or a hybrid representation based on both constituency and depen- dency structure, as in the TIGER Treebank (S. Brants et al., 2002). The most

34 widely used representations are constituency and dependency. Figure 2.1 and Figure 2.2 show an English sentence annotated with constituency and depen- dency structure. Constituency structure (also referred to as phrase structure) is defined with phrases that are built of smaller phrases. In other words, each sentence is de- composed into its constituent parts. As pointed out by Hladká and Ribarov (1998), the Penn treebank for English has been very influential for the devel- opment of similar treebanks such as Penn Arabic Treebank (ATB) (Maamouri et al., 2004), Penn Chinese Treebank (Xue et al., 2005) and so forth. Syntactic bracketing is a constituency-based representation format that was used in early large-scale project such as the Lancaster Parsed Corpus (Garside et al., 1992) and the original Penn treebank (Marcus et al., 1993). The an- notation contains part-of-speech tagging for tokens and syntactic relations for phrase categories. An example from the IBM Paris Treebank, using a variant of the Lancaster annotation, taken from Nivre (2008b), is shown in Figure 2.3.

[N Vous_PPSA5MS N] [V accedez_VINIP5 [P a_PREPA [N cette_DDEMFS session_NCOFS N] P] [Pv a_PREP31 partir_PREP32 de_PREP33 [N la_DARDFS fenetre_NCOFS [A Gestionnaire_AJQFS [P de_PREPD [N taches_NCOFP N] P] A] N] Pv] V]

Figure 2.3. Constituency annotation in the IBM Paris Treebank.

In dependency-based representations, on the other hand, syntactic structure is viewed as a set of linked asymmetric and binary head-dependent relations rather than as a set of nested constituents. Every word in a dependency rep- resentation normally has at most one head governing it, and each head and dependent relation is marked and annotated with functional categories indi- cating the grammatical function (such as subject and object) of the dependent to the head. Dependency structure has become increasingly common in re- cent years, particularly for languages with flexible word order. The Prague Dependency Treebank for Czech (Hajicˇ et al., 2001) has been very influential in this development, and dependency-based treebanks now exist for Arabic

35 (Hajicˇ et al., 2004), Basque (Aduriz et al., 2003), Danish (Kromann, 2003), Greek (Prokopidis et al., 2005), Russian (Boguslavsky et al., 2000), Slovene (Džeroski et al., 2006), Turkish (Oflazer et al., 2003), Chinese (Chang et al., 2009), and Finnish (Haverinen et al., 2013), among other languages. The Stanford Typed Dependencies (STD) Representation (de Marneffe and Manning, 2008) is a dependency-based annotation scheme that was originally developed as an automatic procedure for converting a constituency-based rep- resentation into a dependency-based one. STD has been designed to be cross- linguistically valid, and the scheme has become a de facto standard for En- glish. So far, it has been successfully adapted to different types of languages such as Chinese (Chang et al., 2009), Finnish (Haverinen et al., 2010), and Modern Hebrew (Tsarfaty, 2013). In the basic version of STD, the depen- dency annotation of a sentence always forms a tree that contains all tokens of the sentence (including punctuation) and is rooted at an artificial root node prefixed to the sentence. There is also a collapsed version of STD as opposed to the basic version, where some tokens may not correspond to nodes in the dependency structure and a single node may have more than one incoming arc. A more detailed description of STD and the grammatical relations (de- pendency labels) will be given in Chapter 5, where I present the construction of the Uppsala Persian Dependency Treebank, which is based on Stanford dependencies. Moreover, de Marneffe et al. (2014) propose the Universal Stanford Depen- dencies, which is an improved taxonomy of STD to better cover grammatical relations across many languages. The proposed universal taxonomy can eas- ily be mapped onto the existing dependency schemes described in Chang et al. (2009), Bosco et al. (2013), Haverinen et al. (2013), Seraji et al. (2013), Tsarfaty (2013), and Mcdonald et al. (2013), which are drawn from STD (de Marneffe et al., 2014). Since the scheme was introduced after I released the Uppsala Persian Dependency Treebank I have not yet applied it to the tree- bank. However, some relations in the Universal Stanford Dependencies are influenced by the relations introduced in Seraji et al. (2013), as will be dis- cussed in Section 5.3. In addition to purely constituency- and dependency-based schemes, there are schemes that combine elements of both. SUSANNE (Sampson, 1995), for instance, was developed by extending the original constituency-based scheme to include a scheme of grammatical functions. Additional cases that make use of two independent annotation layers, one for constituency- and one for de- pendency structure, are the TIGER annotation scheme for German (S. Brants et al., 2002), and the VISL (Visual Interactive Syntax Learning) scheme for 22 languages developed on a small scale and subsequently used in develop- ing treebanks for Portuguese (Afonso et al., 2002) and Danish (Bick, 2003). The Prague Dependency Treebank (Hajicˇ et al., 2001), the Turin University Treebank (Bosco and Lombardo, 2004), and the Sinica treebank (Huang et al., 2000) are further examples of treebanks combined with semantic annotation

36 Table 2.1. An example of the English sentence Economic news had little effect on financial markets., taken from the Penn Treebank (Marcus et al., 1993), annotated with the Google universal part-of-speech tags (Petrov et al., 2012) and the STD presented in CoNLL format.

ID FORM LEMMA CPOSTAG POSTAG FEATS HEAD DEPREL PHEAD PDEPREL

1 Economic _ ADJ JJ _ 2 amod _ _ 2 news _ NOUN NN _ 3 nsubj _ _ 3 had _ VERB VBD _ 0 root _ _ 4 little _ ADJ JJ _ 5 amod _ _ 5 effect _ NOUN NN _ 3 dobj _ _ 6 on _ ADP IN _ 3 prep _ _ 7 financial _ ADJ JJ _ 8 amod _ _ 8 markets _ NOUN NNS _ 6 pobj _ _ 9 . _ . . _ 3 punct _ _

schemes. In the Prague Dependency Treebank, a layer of tecto-grammatical annotation has been added to the surface dependency structure to provide a deeper semantics-oriented analysis of the syntactic structure. The Turin Uni- versity Treebank follows the same trend by adding annotation of semantic roles to the grammatical functions (Bosco and Lombardo, 2004). The Chinese Sinica treebank uses a dependency-based annotation with semantic roles in addition to constituent structure (Keh-Jiann et al., 2003). Treebanks exist in different standard encoding formats. Certain formats have become de facto standards through the influence of major treebank projects or shared task. These include, among many others, the Lisp-like bracketing style in the Penn Treebank 1.0 and the PML format in the Prague Dependency Treebank. The TIGER-XML in the German TIGER project and the CoNLL format emanating from the CoNLL shared tasks on multi-lingual dependency parsing (Buchholz and Marsi, 2006; Nivre et al., 2007) are two more major standard formats. In the CoNLL format, for instance, sentences are separated by a blank line. Each token starts on a new line and consists of the following ten fields separated by a single tab character: token counter (ID), word form (FORM), lemma or stem of the word (LEMMA), coarse-grained part-of-speech tag (CPOSTAG), fine-grained part-of-speech tag (POSTAG), syntactic or morphological features (FEATS), head of the token (HEAD), dependency relation to the HEAD (DEPREL), projective head of the token (PHEAD), and dependency relation to the PHEAD (PDEPREL). The format is illustrated in Table 2.1 for an English sentence with a syntactic annotation based on STD. Developing a treebank is a labor intensive task. Combining human an- notation and parsing is therefore a common annotation strategy. Bootstrap- ping a statistical parser is the most promising technique for increasing the size of a treebank. As the development process is usually performed semi- automatically, the task is an inherently error-prone process requiring a consis- tent and careful post-processing and validation. Constructing treebanks and creating tools for automatic syntactic analysis (parsing) usually go hand in

37 hand, as there is a symbiotic relation between designing resources and evolv- ing data-driven tools. The advantage of the method is that when errors are corrected in the treebank, the parser, when retrained on the corrected data, provides a more correct analysis of new input sentences.

2.2 Tools Language technology tools are programs for the generation and analysis of language. Among the most basic and important automatic tools are tools for preprocessing, sentence segmentation and tokenization, part-of-speech tag- ging, and parsing. Various types of language technology tools perform anal- ysis at different levels, as there are hierarchical inter-dependencies between the tools. Considering the hierarchical relations, some syntactic parsers, for instance, rely strongly on words that have already been morphologically an- alyzed and tagged with parts of speech. Similarly, part-of-speech tagging re- quires texts to be segmented into sentences and further for words to be tok- enized and distinguished from each other in order to perform analysis at the word level. There is a close connection between tools and annotation, as dis- cussed in the previous section, because annotated data is used for training and evaluation. In general, we want the two to be compatible.

2.2.1 Preprocessing Text preprocessing (normalization) is the process of converting a non-standard textual representation into a canonical form. This process is typically consid- ered the first task for any NLP system, and it is language-dependent. Prepro- cessing data is counted as an essential step in counteracting the effect of the principle garbage in, garbage out. When noisy data is sent in, bad results are returned in the output data, which diminishes the accuracy of the language analysis. Huge quantities of textual data are constantly being uploaded to the Internet. The data often includes non-standard token types such as words written in spe- cific digit sequences, mixed case words (WinNT, SunOS), misspelled words, acronyms, abbreviations, mixed writing styles of multi-word expressions, uni- versal resource locators (URLs) and e-mail addresses, roman numerals, and so forth. Preprocessing may additionally involve the elimination or conver- sion of typical noise such as extra line breaks, extra punctuation marks before or after words, lack of space as well as extra spaces between words, etc.. A preprocessor basically rewrites texts in a standard form. Text normalization is traditionally performed by an in-house tool and is treated in a more or less ad hoc fashion, often by using rules or machine learning methods at different levels (C. Zhu et al., 2007).

38 2.2.2 Sentence Segmentation Sentence segmentation, commonly considered the second process in a pipeline of natural language processing, deals with splitting a text into single sentences by recognizing the sentence boundaries. Automatic recognition of sentence boundaries via a computer program is a straightforward process when a lan- guage has an explicit sentence termination using punctuation. However, the task can be complex for languages lacking clearly defined punctuation marks as sentence boundaries. Thai, for instance, does not use any punctuation marks to define sentence boundaries. Aroonmanakun (2007) in an experiment shows that Thai sentence segmentation performed by different persons can be dif- ferent. The same may also hold for languages with clearly defined sentence boundaries, because adherence to a set of rules for sentence and clause bound- aries may vary dramatically depending on author and type of text. For in- stance, it may be unclear whether a title is a sentence or not. Punctuation marks are clue symbols indicating the structure of a written text, in terms of where intonation, pauses, and emphasis are to be observed. The concept of what constitutes a sentence is to some extent arbitrary and depends largely on an author’s adherence to conventions. Delimiters such as full stop, question mark, and exclamation mark are usually used as sentence boundaries in most NLP applications (Palmer, 2000). However, the punctu- ation mark used to mark a full stop (.) in English and other European lan- guages is also used in abbreviations, as a decimal point, and to mark suspense or ellipses (...), etc. This can obstruct correct sentence segmentation because the full stop may not terminate the sentence. Thus, identifying such cases is another essential sub-task for sentence segmentation. Errors and inconsis- tencies of punctuation in a text can further expand the scope of the sentence segmentation problem, which in turn makes recognizing sentence boundaries very difficult. State-of-the-art sentence segmentation tools make use of var- ious morphological and syntactic features, and the best feature set can vary for different languages or genres (Fung et al., 2007). Unfortunately, sentence segmentation is an underestimated task that rarely undergoes evaluation.

2.2.3 Tokenization Tokenization is the process of segmenting a sentence into separate tokens, i.e., a sequence of characters forming a single unit. It is usually combined with sentence segmentation into a single tool. In computational linguistics, a token is an element similar to a word and other linguistic elements such as numbers, abbreviations, and punctuations. Automatic identification of token boundaries can be a complex task due to the fact that various languages have different word boundary markers. There- fore, there exist different approaches for identifying where one word ends and another starts. For languages that mark words in a text with regular white

39 space, defined as space-delimited languages in Palmer (2000), automatic to- kenization is partly performed by identifying white space and punctuation. On the other hand, languages like Chinese, Japanese, and Thai, which do not systematically mark words in a text (Jurafsky and Martin, 2008) and are de- fined as unsegmented languages in Palmer (2000), have a more challenging tokenization process. In space-delimited languages, tokenization is still complicated when it comes to multi-word expressions, such as compound words, or multi-word technical terms that signify a single concept, e.g., ice cream, Artificial Intel- ligence, etc. The compound ice cream for instance, consists of two words, while mentally it corresponds to a single concept. In such cases, it is not al- ways feasible to determine word boundaries with white space. On the other hand, it is not always possible to define word boundaries with a concept-based criterion either (Aroonmanakun, 2007). Therefore, compound terms are often treated as multiple tokens in the tokenization process and are then analyzed as multi-word expressions in further steps of language analysis such as parsing.

2.2.4 Part-of-Speech Tagging Part-of-speech (PoS) taggers are tools for assigning part-of-speech categories to the words of a sentence. More specifically, a part-of-speech tagger assigns morphological categories to tokens in a text in order to disambiguate them. A part-of-speech tagger usually receives a tokenized text as input and delivers an output text tagged with part-of-speech tags with or without separate morpho- logical features (for terminology, see 2.1.1). Assigning a part-of-speech tag to each token is non-trivial for machines due to the existence of ambiguity. Disambiguating ambiguous words is a challenging task for taggers. Given that many words have more than one morphological category, a tagger needs to select the appropriate part-of-speech category. For example, having the word fly in a context requires different part-of-speech tags since the word involves different notions; as a noun it refers to a small insect and as a verb it refers to travelling through the air. The disambiguation process is often performed by looking at information that the tagger receives from the tag sequences (syn- tagmatic information). Handling unknown words is another challenging task for taggers, and normally the accuracy of different taggers is mainly deter- mined by the proportion of unknown words. This task is frequently performed by guessing algorithms (also called smarts by Manning and Schütze (1999)), which allow taggers to guess the part-of-speech of unknown words. Smarts make use of the morphological properties of the word, such as that words end- ing in –ed are likely to be past tense or past participle. Other cues are also employed. For example, information about surrounding words or the part- of-speech of surrounding words are used to make inferences about a word’s part-of-speech tag. In general, the preceding word is sometimes the most use-

40 ful clue for determining part-of-speech. In some languages where word order is flexible, the surrounding words contribute much less information about part- of-speech. On the other hand, the rich inflections of such a word might provide more information about part-of-speech (cf. Manning and Schütze 1999). Most tagging algorithms are usually placed into one of the following cate- gories: rule-based taggers or stochastic taggers. Rule-based algorithms use a large database of hand-written rules for disambiguating parts of speech. An example is EngCG, which is based on the Constraint Grammar architecture of Karlsson et al. (1995). On the other hand, stochastic algorithms apply ma- chine learning techniques and a training corpus to estimate the probability of a given word having a given tag in a given context. Hidden Markov Model (HMM) taggers are examples of this. There is also transformation-based tag- ging (Brill, 1995), which shares features of both rule-based and stochastic architectures. Such a tagger is rule-based when disambiguating a word in con- text and stochastic when the rules are automatically induced from a previously tagged training corpus (Jurafsky and Martin, 2008). Modern part-of-speech taggers are generally data-driven applications that are capable of being retrained for a new language given a tagged corpus of that language. Data-driven techniques for part-of-speech tagging have attracted great attention from many researchers in the computational linguistics com- munity, especially in the early 1990s and 2000s. Moreover, the technique has resulted in some successful data-driven part-of-speech taggers such as MX-POST (Ratnaparkhi, 1996) based on the maximum entropy framework, MBT (Daelemans et al., 1997) based on memory-based learning, Trigram ’n’ Tags (TnT) (T. Brants, 2000) and HunPoS (Halácsy et al., 2007) based on Hidden Markov Models (HMM). The tagger HunPoS is the one that will be used in my research, and it is introduced in Chapter 4. More recent work on data-driven taggers includes conditional random fields and support vector machines (Toutanova et al., 2003; Giménez and Màrquez, 2004; Kumar and Singh, 2010). Part-of-speech taggers are frequently evaluated in terms of tagging accu- racy. The state-of-the-art for part-of-speech tagging for English is 97–98% accuracy per word (Toutanova et al., 2003; Shen et al., 2007; Giesbrecht and Evert, 2009) which is close to the level of human annotators. However, this op- timal accuracy only holds when taggers are trained and evaluated on newspa- per texts, fiction, art, and user-generated texts, and does not apply to all types of texts or genres such as spoken language, informal writing, or Web pages (Giesbrecht and Evert, 2009). The performance of data-driven approaches can be affected by different factors such as the size of training data and the size of tagset. In general, larger training data sets improve tagging performance and the error rate decreases when the size of the tagset is reduced. There- fore, if a larger tagset (for specification of more linguistic features) and high performance are both desired, a larger training data set is needed.

41 Since most taggers accept segmented text as their input, it is crucial to make sure that the segmentation applied to preprocess texts exactly matches that used for training data in the part-of-speech tagging process. Due to interac- tion between different linguistic levels, when it comes to segmentation and annotation, differences in segmentation during training and tagging degrade the results of a part-of-speech tagger. Therefore, as mentioned in Chapter 1, I have added the important constraint that all tools should be compatible and the output of one tool must match the input requirements of the next.

2.2.5 Parsing Parsers are tools for performing syntactic analysis of natural language. More specifically, the automatic analysis indicates how words compose together to form a sentence by denoting their syntactic relations to each other. The analy- sis may further include other aspects of linguistic description such as semantic information. A syntactic parser usually receives a morphologically annotated sentence as input and returns a syntactically analyzed sentence as its output. In syntactic analysis, a sentence may map to a number of grammatical parse trees due to the pervasive ambiguity of natural language. Selecting the most plausible analysis among many alternatives is one of the major bottlenecks of syntactic parsing. Therefore, from tagging to full parsing, all algorithms need to be carefully selected to handle such ambiguity (Sarkar, 2011). In parsing, as for treebanks, (see Section 2.1.2), there are two main ap- proaches to syntactic analysis: phrase-based and dependency-based. Phrase- based parsers usually rely on parsing algorithms for context-free grammar (CFG), which is built on a set of recursive rules describing a context-free lan- guage. The three most widely used of these parsing algorithms are the Cocke- Kasami-Younger algorithm (CKY) (Kasami, 1965; Younger, 1967), the Earley algorithm (Earley, 1970), and the chart parsing algorithm (Kay, 1982; Ka- plan, 1973). The three approaches combine insights from the two searching strategies underlying most parsers, namely, bottom-up and top-down search, with dynamic programming for efficient handling of complex cases (Jurafsky and Martin, 2008). However, writing a set of rules for syntactic analysis in terms of CFG is not an easy task, considering the complexity of natural lan- guage. Phrase structure analysis sometimes provides additional information about long-distance dependencies and cases where the word order is less flex- ible (Sarkar, 2011). On the other hand, dependency analysis typically benefits languages with flexible word order. Since I have not applied the phrase-based syntactic analysis approach in my thesis, I will not further discuss models, algorithms, and frameworks pertaining to this technique. In dependency-based parsing, syntactic structure is represented by a de- pendency graph (Nivre, 2008a). The CoNLL 2007 shared task on dependency

42 parsing (Nivre et al., 2007) defines dependency graph as follows:

In dependency-based syntactic parsing, the task is to derive a syntactic structure for an input sentence by identifying the syntactic head of each word in the sentence. This defines a dependency graph, where the nodes are the words of the input sentence and the arcs are the binary relations from head to dependent. Often, but not always, it is assumed that all words except one have a syntactic head, which means that the graph will be a tree with the single independent node as the root. In labeled dependency parsing, we additionally require the parser to assign a specific type (or label) to each dependency relation holding between head word and dependent word.

The main philosophy behind the concept of dependency graphs is to con- nect a word, the head of a phrase, with its dependents through labeled directed arcs. This head-dependent relation can be either head-modifier or head-complement. In the head-modifier relation, the modifier functions as a fully independent component in relation to its head. Although the modifier always carries certain traits of its head, it can be omitted without affecting the core meaning of the phrase or the syntactic structure. By contrast, in the head-complement relation, the complement functions as a fully dependent component of its head. Here, the head contains certain traits of its complement and hence the complement cannot normally be left out. Furthermore, Nivre (2008a) defines the distinction between these types of head-dependent relations as endocentric (head-modifier) and exocentric (head-complement). An endocentric construction is dominated by a head that is the sole obligatory element and carries a large amount of semantic content. An example is the relation holding between the noun planets and the adjective new in the following sentence where the head noun can replace the whole without disrupting the syntactic structure.

• Astronomers have discovered [new] planets orbiting around Gliese– 667C.

By contrast, an exocentric construction is dominated by a group of syntacti- cally related words where the head cannot provide the semantic content of the whole, such as the relation holding between the preposition around and the noun Gliese–667C in the following example where the head around cannot replace the whole.

• Astronomers have discovered new planets orbiting around [Gliese– 667C].

43 The distinctive element in determining complements and modifiers is often specified in terms of valency, which is the core concept in the theoretical tradition of dependency grammar (Nivre, 2008a). Aside from the fact that dependency-based representations seem better suited for languages with free word order than phrase structure representations, the method has been shown to be useful in language technology applications, such as machine translation and information extraction, for detecting the underlying syntactic pattern of a sentence, because of its transparent encoding of predicate-argument structure (Kübler et al., 2009). While the performance of data-driven dependency parsers is continuously being improved, two approaches to dependency parsing have still remained at the center of the spotlight (Bohnet and Kuhn, 2012), namely, the transition- based (Yamada and Matsumoto, 2003; Nivre, 2003) and the graph-based ap- proach (Eisner, 1996; McDonald et al., 2005a). Two popular data-driven and open source dependency parsers, MaltParser (Nivre et al., 2006) and MST- Parser (McDonald et al., 2005b), are based on these two approaches. The two parsers were the top scoring systems in the CoNLL 2006 shared task for mul- tilingual dependency parsing (Buchholz and Marsi, 2006) and have since then been applied to a wide range of languages. The parsers will be presented in more detail in Chapter 6. Transition-based dependency parsing was pioneered by Yamada and Mat- sumoto (2003). A transition system consists of a set of states and transitions between the states when deriving dependency trees using treebank-induced classifiers to predict the next transition (Nivre, 2008a). Recent research has further shown that the accuracy of transition-based systems can be improved using a beam-search framework in combination with optimizing different fea- ture models during parsing (Yue Zhang and Clark, 2008; Bohnet and Nivre, 2012). On the other hand, the graph-based approach is based on global learning al- gorithms. In this approach, global optimization algorithms discover the high- est scoring tree with locally weighted models (McDonald et al., 2005a; Mc- Donald and Pereira, 2006; Koo and Collins, 2010). Furthermore, Bohnet and Kuhn (2012) present a parsing model based on the concept of combining the advantages of transition-based and graph-based ap- proaches. Their parser shows a substantial improvement in parsing accuracy when applied to English, Chinese, Czech, and German. Additional experi- ments have been done with combining different kinds of statistical parsers that are trained on treebanks, such as those explained in Sagae and Lavie (2006), Martins et al. (2008), and McDonald and Nivre (2011). Combining rule-based and statistical parsers has also been investigated in Aranzabe et al. (2012). Having discussed language technology tools in this section, I will, now, briefly describe Persian and its main linguistic features related to orthography, morphology and syntax, at the same time, I will discuss language technology challenges in processing Persian text.

44 Table 2.2. Dual-joining Persian characters. Isolated Final Medial Initial Names H. H. K. K. be H H K K pe H H K K te H H K K se h. h. k. k. jim h h k k ceˇ h h k k he-ye jimi (literally jim-like he) p p k k khe € € ƒ ƒ sin € € ƒ ƒ shin   “ “ s.ad¯ zad¯   “ “ ¨ t.a¯ z.a¯ ¨ © ª « ‘eyn ¨ © ª « qeyn ¬ ¬ ¯ ¯ fe † † ¯ ¯ qaf¯ ¸ ¸ » » kaf¯ ÀÀÃà gaf¯ ÈÈËË lam¯ ÐÐÓÓ mim à à K K nun è é ê ë he-ye do-cešmˇ (literally two-eyed he) ø ø K K ye

2.3 Persian Persian belongs to the Indo-Iranian branch of the Indo-European family. There are three variants of the language: Western Persian, referred to as Parsi or Farsi and spoken in Iran; Eastern Persian, referred to as and spoken in Afghanistan; and Tajiki, spoken in and Uzbekistan. Persian has also had a strong influence on neighboring languages such as Turkish, Armenian, Azerbaijanian, Urdu, Pashto, and Punjabi.

45 2.3.1 Persian Orthography The Persian2 writing system is based on the Arabic alphabet with 28 letters 3 and four additional letters: H ,h ,P,À , which represent the sounds /p/, /Ù/, /Z/, /g/. The additional letters still follow the Arabic writing system. These letters were created by adding extra dots to the existing letters H. ,h. ,P ,¸, representing the sounds /b/, /j/, /z/, and /k/. In the case of À, /g/, dots turned into a line above the existing letter ¸, representing /k/. In Persian, characters have different forms depending on their position in the word. Thus, characters can be divided into two groups on the basis of how they connect to other characters: dual-joining and right-joining. Dual- joining characters accept connections from both their right and left hand side. In this group, characters have two distinct shapes depending on their position in the word: initial or medial, and final or isolated respectively. However, three characters in this group, namely ¨ /’eyn/, ¨ /qeyn/, and è /he/ (he-ye do-cešm),ˇ appear in four distinct shapes. There are also two characters in this group, /t.a/¯ and /z.a/,¯ which have only one shape irrespective of their position in the word. Table 2.2 displays the initial, medial, final, and isolated forms of the characters in the dual-joining group. The right-joining characters do not accept any connection to their left hand side and have only one shape without any distinctive initial, medial, final, or isolated forms. These characters are illustrated in Table 2.3.

Vowels and Diacritics Modern Persian has six vowels: a,¯ a, i, e, u, o, of which three are long and three are short. Long vowels (a,¯ i, u) are usually conveyed by alphabet letters whereas short vowels (a, e, o) are represented by so-called diacritics. Lexical ambiguity can occur in Persian when short vowels are left out of tokens such that a string of consonants, sometimes together with long vowels, is what is represented in the script. Table 2.4 displays examples of homographs sharing the same letters with distinction in pronunciation, stressed syllables, meaning,

2Henceforth, by Persian I mean to contemporary Persian as spoken in Iran. 3According to the Academy of Persian Language and Literature, Persian has 33 letters. Al- though the Academy of Persian Language and Literature includes hamze as one of the Persian letters, it is normally not used in words of Persian origin. Hamze is mainly used in words bor- rowed from Arabic and as a hiatus-filler. Even in Arabic, hamze has its own specific use since it exclusively appears on other host letters as a phonetic sign (see Section 2.3.1). Hamze only oc- curs as a separate letter when it is isolated at the end of a word. Since hamze is a root consonant in the Arabic consonantal root system, which characterizes the Semitic languages, it is defined as a letter in Arabic. I believe that the inclusion of hamze as a letter in the Persian alphabet needs to be further discussed and considered by the Academy, because, as an Indo-European language, Persian does not follow the Arabic consonantal root system.

46 Table 2.3. Right-joining Persian characters. Characters Names @ alef X dal¯ X zal¯ P re P ze P že ð vav¯

Table 2.4. Examples of Persian homographs disambiguated by diacritics, N_SING = Noun Singular, V_PA = Past. Words Transcriptions PoS Translations Specifications ùJ‚» /kšti/ ? (unclear PoS) ? (unclear translation) Without diacritic  /keš’ti/ N_SING ship With diacritic /e/ ùJ‚»  /’kešt-i/ N_SING cultivation With diacritic /e/ ùJ‚» ùJ‚» /’kešt-i/ V_PA you planted With diacritic /e/

ùJ‚» /koš’ti/ N_SING wrestling With diacritic /o/

ùJ‚» /’košt-i/ V_PA you killed With diacritic /o/ and part-of-speech (PoS) categories. The transcription in the table follows the phonemic transcription.4 Diacritics are positioned above or below alphabet letters as guides to correct decoding to identify homographs. Diacritic signs are normally left unwritten in texts and are mostly used for beginning learners, since adult native speak- ers are expected to have already developed cognitive strategies for efficient linguistic performance (Baluch, 1992).5

4Phonemic transcription is further applied for all transcriptions throughout this book. 5However the short vowels /a, e, o/ are indicated by the letter @ (alef) only in cases when the vowels appear at the beginning of a word, such as /a/ in I. ƒ@ /asb/ (horse), /e/ in Õæ @ /esm/ (name), and /o/ in YJ Ó@ /omid/ (hope). Alef is further used to transmit the sound of the long vowel a¯ in word initial position. In this case, the vowel is indicated by a hat above alef such as   @ /a/¯ in H. @. The long vowel a¯ does not have a hat when it appears as the medial, final, or isolated character in a word, such as when a¯ appears as the medial character in the personal name @P@X /dar¯ a/¯ (Dara), as the final and joint character in AÓ /ma/¯ (we), or as the final and isolated character in @ñë /hava/¯ (air, whether). 47 Table 2.5. Persian homophonic letters. /t/ /h/ /s/ /z/ H h H X è € P  

Table 2.6. Diverse spellings of certain homophonic Persian words. Common Less Common Transcriptions Affected phonemes Translations †A K@ †A£@ /otaq/¯ /t/ room ñK@ ñ£@ /otu/ /t/ iron Q Q /empera¯tur/ /t/ emperor PñK@ Ó@ Pñ£@ Ó@ ém'AJK ém'AJ£ /tapan¯ ce/ˇ /t/ pistol PBAK PBA£ /tal¯ ar/¯ /t/ forum à@Qî E à@Qê£ /Tehran/¯ /t/ Tehran IJ ÊK. ¡J ÊK. /belit/ /t/ ticket àYJ K àYJ J£ /tapidan/ /t/ beat, pulse   àYJ JÊ « àYJ ¢Ê« /qaltidan/ /t/ roll úÍYJ“ úÍYJƒ /sandali/ /s/ chair

P@QƒAJ ƒ P@YƒAJƒ /sepasgo¯ zar/¯ /z/ grateful

Phoneme Diversity In Persian, a phoneme may be represented by different letters, which can cause disparities in letter substitution, especially in the case of transliterating for- eign words and deciding on a proper grapheme for a desired phoneme. The phoneme /t/ is represented by the two letters, H and , the phoneme /h/ by h and è, /s/ by the three letters H , €, and , and finally the phoneme /z/ by the four letters X , P ,  , and . Table 2.5 shows diverse letters for one and the same phoneme. However, in Arabic all these letters represent distinct phonemes. With respect to writing variations in Persian, words containing ho- mophonous letters can be spelled with different letters representing the same phoneme. Note that the writing variations of diverse letters usually occur with the phoneme /t/. Table 2.6 shows some examples of this category. Spellings that have been categorized as less common in Table 2.6 are actually uncom- mon. Nowadays, for instance, Tehran is hardly ever spelled with the less common form. However, the only available source is the texts themselves. Furthermore, in Persian encyclopedias such as Dehkhoda and Mo‘in, entries for less common spellings simply refer to the common ones.

48 Table 2.7. 12 different ways of writing the plural and definite form of the compound word  (the libraries of). øAëéKAgH. AJ» ø Aë éKA gH AJ» øAëéKA gH AJ» .  .  øAë éKAgH. AJ» øAë éKAg H. AJ» øAë éKA m 'AJ» ø AëéKA g HAJ» .  .  ø Aë éKAm'.AJ» ø Aë éKAg H. AJ» ø AëéKA m 'AJ» ø AëéKA gH AJ» .  .  øAëéKAm'.AJ» øAëéKAg H. AJ»

Word Boundaries and Different Space Characters When representing text digitally, there are different sizes and styles of spaces with different Unicode characters, such as no-break space (U+00A0), zero-width non-joiner (U+200B), word-joiner (U+2060), ideographic space (U+3000), zero-width no-break space (U+FEFF), and so forth. The use of various space characters of specific widths depends on the language charac- teristics. In Persian white space designates word boundaries, as it does in many languages. However, there is also another space in Persian so-called zero-width non-joiner (ZWNJ, known as pseudo-space, zero-space, or virtual space) as a boundary inside a word. The ZWNJ is a non-printing character in computerized typesetting of some script, placed between two characters to be printed with one final and one initial form unjoined next to each other. The ZWNJ keeps the word forms intact and close together without their being at- tached to each other. Considering the wide variety of typing styles and the optionality of shifting between white space and ZWNJ, one word may be written in various ways in a text. Compound words and inflectional affixes are strongly affected by this and can be typed either as attached to their adjacent word (when ignoring both spaces, thereby losing the internal word boundaries, such as the attached form of the word øAëéKA m 'AJ» , /ketab.x¯ ane-h¯ a-ye/¯ (gloss: book.house-pl-ez6, trans- lation: the libraries)),. or detached from it (when using white space instead of ZWNJ, such as the detached form of the word ø Aë éKA g HAJ» , /ketab¯ xane¯ ha¯ ye/), which in both cases raises issues in text processing. Inflectional. suffixes may follow compound words; for instance the word  /ketab.x¯ ane/¯ éKAgH. AJ» (gloss: book.house, translation: library) followed by the plural suffix Aë- /-ha/¯ together with the ezafe¯ (-ez) particle ø- /-ye/ might appear in 12 forms as shown in Table 2.7.

6The abbreviation -ez is used to mark ezafe construction in the gloss throughout this book. An ezafe¯ (-ez) is an unstressed clitic particle that links the elements within a noun phrase, adjective phrase or prepositional phrase, indicating the semantic relation between the joined elements. It is represented by the short vowel /e/ after consonants or /ye/ after vowels. For more description of the ezafe¯ construction, see Section 2.3.3.

49 Table 2.8. Different forms of hamze. Different types of Hamze Positions Transcriptions Z always isolated at the glottal /@/ end of a word

@ / @ hamze above/below alef glottal /@/ initial, medial, or final ð' hamze above vav glottal /U/ medial or final K' hamze above a tooth glottal /@/ initial or medial

Multi-word expressions usually consist of words representing separate lexi- cal categories. When white space is used instead of ZWNJ, words are treated as separate tokens, which causes problems in tokenizing texts. For exam- ple, Pñ¢ JÓéK . /be.manzur-e/ (gloss: to.intention-ez, translation: for the sake of) may appear either by joining the preposition éK. /be/ (to) to the noun Pñ¢ JÓ /manzur/ (intention) and building a single token (attached form), Pñ¢ JÖ ß. /be.manzur-e/ or as two single tokens delimited either with a space Pñ¢ JÓ éK. /be manzur-e/ or with ZWNJ Pñ¢ JÓéK /be.manzur-e/. The optionality of writ- ing multi-word expressions as attached. single word, detached single words (delimited with space), or as one distinct word (delimited with ZWNJ) in the Persian writing system raises issues in text processing, since a single word that should be treated in the same way will be treated in different ways.

Hamze in Persian One of the most ambiguous letters imported from Arabic to Persian is hamze(h) (Arabic hamza). Hamze is not normally used in words of Persian origin. It is mainly employed in words borrowed from Arabic, normally as one of the root consonants. It represents the glottal stop and has its own be- havior. The unique characteristic of hamze is that it is mostly written with a carrier; i.e., hamze is normally written above or below the letter @ (alef), above the letters ð (vav),¯ and above the so-called a tooth K' , which makes it distinct from the rest of the Arabic letters. Different forms of hamze are shown in Table 2.8. Although not the preferred spelling today, hamze may be used in words of Persian origin. For instance the word /paiz/¯ (autumn) is a Persian Q  KAK word, but it might be written with hamze as in /pa‘iz/.¯ Q KAK Mapping Unicode Characters Although apart from the four extra letters in Persian, Persian and Arabic share almost the same character encodings, there are a few stylistic disparities in the two letters ø (ye) and ¸ (kaf),¯ which have different Unicode characters 50 Table 2.9. Different forms of Persian and Arabic characters. Persian Arabic Name of letters ¸ ¼ kaf¯ ye ø ø H è te

Table 2.10. Digital characters for Persian (Extended Arabic-Indic Digits), Arabic (Arabic-Indic Digits) and Western. Persian Arabic Western 0 0 0 1 1 1 2 2 2 3 3 3 R 4 4 S 5 5 T 6 6 7 7 7 8 8 8 9 9 9

for Persian and Arabic. These letters can be represented by U064A for ø and U06A9 for in the Persian Unicode encoding and U0649 for (Arabic ye ¸ ø has two dots beneath) and U0643 for ¼ in Arabic Unicode encoding. Table 2.9 shows different shapes of these two letters for Persian and Arabic. The Unicode Standard has characters for Persian letters including a sepa- rate set of ten characters for digits called Extended Arabic-Indic Digits (Es- fahbod, 2004), since three of the ten digits in Persian (4, 5, and 6) differ from their Arabic counterparts. Digital characters for Persian, Arabic, and West- ern languages are presented in Table 2.10. Despite the existence of Unicode characters for Persian, there are still software applications that choose to set up Arabic Unicode characters for Persian letters and digits, or Arabic Unicode characters for Persian letters in combination with Western digits. As a conse- quence, a mixture of various encodings is found in a huge number of Persian texts on the Web, which needs to be considered in Persian text analysis. As an example, Ettelaat.com, one of the oldest newspaper in Iran, with 90 years of continuous publication, still has mixed character encoding of Arabic letters and English digits combined with the extra Persian characters.

51 Table 2.11. Examples of words derived from the present stem à@X /dan/¯ (to know) combined with various types of other stems and nouns as well as derivational affixes. Components Transcriptions PoS Translations à@X /dan/¯ Verbal stem to know € + à@X /dan-eš/¯ Noun knowledge ñk. + € + à@X /dan-eš-ju/¯ Noun student YJÓ + € + à@X /dan-eš-mand/¯ Noun scientist èAÇ + € + à@X /dan-eš-g¯ ah/¯ Noun university ø + èAÇ + € + à@X /dan-eš-g¯ ah-i/¯ Adjective academic à@ + ø + èAÇ + € + à@X /dan-eš-g¯ ah-i-¯ an/¯ Noun academics ø + èAÇ + € + à@X + Ñë /ham-dan-eš-g¯ ah-i/¯ Noun university-mate èY» + € + à@X /dan-eš-kadeh/¯ Noun faculty @ + à@X /dan-¯ a/¯ Adjective wise ø + ø + @ + à@X /dan-¯ a-y-i/¯ Noun wisdom èYK + à@X /dan-andeh/¯ Noun knower à@X + AK /na-d¯ an/¯ Adjective ignorant ø + à@X + AK /na-d¯ an-i/¯ Noun ignorance

2.3.2 Persian Morphology Persian is dominated by an affixal system and is sometimes described as a predominantly agglutinative language (Jeremiás, 2003; Hashabeiky, 2005). Words are formed by attaching affixes with various grammatical or semantic features to head words. The language is highly productive in word formation and it frequently uses derivational agglutination by combining affixes, verb stems, nouns, and adjectives to derive new words. However, the agglutination process is not as extreme as in, for example, Turkish. New words are also formed by combining two existing words into a new one (compounding).

Affixation Affixes are bound elements that cannot stand on their own and are used for inflection and derivation. Affixes are of different kinds for different parts of speech. Noun affixes are always attached to nouns and verb affixes are always attached to verbs. In Persian, affixes appear in the form of both prefixes and suffixes. Table 2.11 shows different words derived from the present stem à@X /dan/¯ (to know) combined with various types of derivational affixes.

Nominal Morphology Persian nouns have no grammatical gender. There is no definite article in the formal language. A single noun can signify a definite entity, for example:

.Iƒ@ ©¢¯ á ®Ê K

52 Table 2.12. Present indicative of the verb á ¯P /raftan/ (to go). Personal Endings Transcriptions Verb Conjugations Translations Ð- /-am/ ÐðPú× I go ø- /-i/ øðPú× you go (singular) X- /-ad/ XðPú× she/he goes Õç' - /-im/ Õç' ðPú× we go YK - /-id/ YK ðPú× you go (plural) YK - /-and/ YKðPú× they go telefon qat.‘ ast . phone disconnection be.pres.3sg . The phone is disconnected.

Indefiniteness, on the other hand, is indicated by either the clitic particle ø /-i/ joined to a noun, as in  /ketab-i/¯ (gloss: book-indef, translation: a book), úG.AJ» or the numeral /yek/ (a, one) preceding the noun, as in  /yek ¹K H. AJ» ¹K ketab/¯ (one book), and sometimes by a combination of both, as in úGAJ» ¹K /yek ketab-i/¯ (gloss: one book-indef, translation: a book). The indefinite. par- ticle can further be attached to plural nouns, as in  /ketab-h¯ a-y-i/¯ 7 úGAëH. AJ» (gloss: book-pl-gl-indef, translation: some books). There are several plural markers, -ha¯,-an¯ (with variants -gan¯ and -yan¯ ), and the Arabic plural mark- ers -at¯ ,-in,-un. The plural markers -in and -un are attached only to Arabic loanwords. Arabic broken plurals also exist in Persian. They follow Arabic template morphology and are directly inherited together with nouns borrowed from Arabic.

Verbal Morphology Verbs in Persian follow a conjugation pattern that is very regular and straight- forward. Verbs always carry information on tense, aspect and mood (TAM). They use personal endings and normally agree in person and number with the subject. Table 2.12 shows the inflection of the verb á ¯P /raft-an/ (gloss: go.past-inf, gloss: to go) in the present indicative TAM form.

Adjectives and Adverbs Adjectives and adverbs take the suffixes QK- /-tar/ and áK QK- /-tarin/ for the comparative and superlative forms respectively.

7The glide ye /-y-/ is inserted between the two long vowels /a-/¯ and /-i/

53 Table 2.13. Syntactic patterns in Persian. Head-final Head-initial Object–Verb Preposition–Object Demonstrative/Adjective–Noun Noun–Genitive Numeral–Noun Noun–Adjective Adverb–Adjective Noun–Relative Clause

2.3.3 Persian Syntax Persian is a verb-final language, but does not rigidly adhere to a fixed word order. The sentential constituents can easily change place without affecting the core meaning. Apart from a marker to indicate the accusative (object marker), there are no apparent markers in the language to highlight, for example, the boundary of a noun phrase, an adjective phrase, or a prepositional phrase.

Word Order Basic word order in Persian is SOV. However, the language branching direc- tion8 indicates a position between head-final (left-branching) and head-initial (right-branching) structure. Therefore the syntactic pattern has a mixed ty- pology. The language represents a hybridization of two opposite syntactic patterns belonging to a group of typically VO languages (as in Arabic) and a group of typically OV languages (as in Turkish) (Stilo, 2004). Table 2.13 shows a set of interrelated syntactic features in Persian. Sentences typically consist of an optional subject and an optional object, followed by a compulsory verb, i.e., (S) (O) V. Subjects can however be placed anywhere in a sentence. The use and order of the optional constituents are relatively arbitrary, and this scrambling characteristic (Karimi, 2003) makes Persian word order highly flexible. Person and number are inflected on the verb using personal endings. Table 2.14 shows the different personal endings in the past tense. As already mentioned, verbs normally agree in person and number with their subject. Nevertheless, in the following cases the verb does not agree in person with its subject:

• To make a humble reference to oneself, the speaker may replace the first person singular pronoun (the subject) with the noun èYJK /bande/ (servant) while the personal ending of the verb is still first. person singular.

• In polite usage, a singular subject is frequently followed by a verb in the third person plural.

8Branching direction means the position of the head word and its complement in a phrase, clause, or sentence.

54 Table 2.14. Personal endings in past tense (personal endings in present tense are illustrated in Section 2.3.2.) Personal Endings Transcriptions Translations Ð- /-am/ I ø- /-i/ you -∅ /-∅/ she/he Õç' - /-im/ we YK - /-id/ you YK - /-and/ they

• If a plural subject is inanimate the verb is usually inflected in singular form. However, the agreement restriction of inanimate entities does not always hold since there are no clear rules concerning how one may break this constraint (Hashabeiky, 2007).

• In the case of non-canonical subjects,9 the verb always stays in third singular regardless of the number and person implied by the subject.

Main Grammatical Functions The function of subject is accomplished in Persian by a noun, an infinitive, a numeral or other expression of quantity, or a pronoun. The subject may also appear only as a personal ending on the verb. Direct objects are characterized by the postposition @P (ra),¯ which is the single case marker in the language. The accusative marker, however, is not obligatory unless topicalization is involved (Windfuhr, 2009). In Persian, a complement other than the direct object is introduced by a prepositional phrase (Lazard, 1992). Hence, the sentential structure can shift between (S) (O ± ra)¯ V and (S) (PP) V.10 Adjectives and adverbs are rarely distinct, as a large number of adjec- tives may be used as adverbs. Prepositions function as heads of preposi- tional phrases and sometimes are incorporated into multi-word expressions. There are inherited prepositions consisting of simple prepositions such as P@ (from/of), éK. (to), etc. There are compound prepositions formed of an ad- jective, or an adverb accompanied by a simple preposition such as P@ h. PAg 9In Persian, non-canonical subjects appear in dative-subject constructions (Jahani et al., 2012). An example is given in Section 2.3.3, Light Verb Constructions. 10In colloquial Persian, adverbs of location can omit the preposition and follow the verb. For example: .à@Qî E ÐðPú× mi-rav-am Tehran¯ . cont-go.pres-1sg Tehran . I go Tehran. (Instead of I go to Tehran.)

55 /xarej¯ az/ (gloss: outside of, translation: outside). Furthermore, there are ezafe¯ prepositions consisting of nouns that have partially been deprived of their semantic content, such as P@ñK X I ‚ /pošt-e divar/ (gloss: back-ez wall, translation: behind the wall) and Q Ó øðP /ru-ye miz/ (gloss: face-ez table, translation: on the table) (Lazard, 1992). Ezafe¯ prepositions may be pre- ceded by simple prepositions as well, such as in P@ñK X I ‚ PX /dar pošt-e divar/ (gloss: in back-ez wall, translation: behind the wall) and Q Ó øðP QK /bar ru-ye miz/ (gloss: on face-ez table, translation: on the table). Prepo-. sitions can further be represented together with their object, such as in €P@ /az-aš/ (gloss: of-her/him, translation: of her/him) and îE /beh-aš/ (gloss: to-her/him, translation: to her/him). . Modifiers may precede or follow the word they modify. There are a num- ber of modifiers, such as adverbial clause modifiers, adverbial modifiers, ad- jectival modifiers, appositional modifiers, comparative modifiers, negation modifiers, noun compound modifiers, nominal phrase as adverbial modifiers, possession modifiers, prepositional modifiers, prepositional modifiers in light verb constructions, quantifier phrase modifiers, relative clause modifiers, and temporal modifiers. A more detailed description of various modifiers is given in Chapter 5.

Pronominal Clitics Clitics are attached to different parts of speech without being related to the words they attach to. In Persian, clitics only appear in the form of enclitics; proclitics do not exist in Persian. Pronouns often appear as pronominal clitics (Ð- /-am/ 1sg, H - /-at/ 2sg, €- /-aš/ 3sg, àAÓ - /-eman/¯ 1pl, àA K- /-etan/¯ 2pl, àA ƒ- /-ešan/¯ 3pl) which are bound pronoun forms. The clitics are shown in Table 2.15. Pronominal clitics can be attached to nouns, adjectives, adverbs, prepositions, and verbs. Table 2.16 presents different pronominal clitics accompanied by the wordPA¿/kar/¯ (work). The following example shows the third person singular pronominal clitic €- /-aš/ (pc-3sg) to denote the direct object attached to the verb YKYK X /did-and/ (see.past-3pl). The subject is absent, but the information is given by the verb through the attached personal ending YK - /-and/ (3pl). .€Y KYK X did-and-aš . see.past-3pl-pc.3sg . They saw her/him.

In addition, pronominal clitics function as possessive genitive  .AJ» /ketab-aš/¯ (gloss: book-pc3sg, translation: her/his book), object of a preposition €P@ /az- aš/ (gloss: of-pc.3sg, translation: of her/him), partitive genitive  /panj- €A Ki. JK 56 Table 2.15. Pronominal clitics. Pronominal Clitics Transcriptions Translations Ð- /-am/ my H - /-at/ your €- /-aš/ her/his àAÓ - /-eman/¯ our àA K- /-etan/¯ yours àA ƒ- /-ešan/¯ their

Table 2.16. Pronominal clitics accompanied by the word PA¿ /kar/¯ (work). Components Transcriptions Translations ÐPA¿ /kar-am/¯ my work HPA¿ /kar-at/¯ your (sg) work €PA¿ /kar-aš/¯ her/his work àAÓPA¿ /kar-em¯ an/¯ our work àA KPA¿ /kar-et¯ an/¯ your (pl) work àA ƒPA¿ /kar-eš¯ an/¯ their work ta-aš/¯ (gloss: five-cl-pc.3sg, translation: five of it), non-canonical subject  YK@ú× €YK /bad-aš mi-ay-ad/¯ (gloss: bad-pc.3sg cont-come.pres-3sg, trans- lation : she/he. dislikes). Possession is expressed by the clitic ezafe¯ and a noun or a pronoun, or by the pronominal clitics.

Complex Sentences A complex sentence may contain multiple clauses such as independent clauses or dependent clauses. An independent or coordinate clause is a clause that is not syntactically dependent on another clause, and can stand independently as a sentence, while a dependent or subordinate clause is a clause that makes no sense by itself and is therefore dependent on another clause. Coordinate clauses are generally sentences that are coordinated by different types of con- junctions, for example, the house is big and the area is calm. Subordinate clauses, on the other hand, may consist of an adverbial clause, a clausal com- plement, a relative clause, or a combination thereof. In Persian, an adverbial clause is introduced by markers (as single words or as multi-word expressions) such as é» úæ¯ð /vaqti ke/ (gloss: when that, translation: when), é» úÍAg PX /dar hal-i¯ ke/ (gloss: in state-indef that, translation: while), QÃ@ /agar/ (if), etc., that modify the main clause. A clausal complement, often introduced by the complementizer é» /ke/ (that), is a dependent clause with an internal subject, functioning as an object to the main clause. A relative clause modifier is al-

57 ways introduced by the relative marker é» /ke/11 (that, which, who, whom) modifying a nominal constituent. The relativizer /ke/ does not vary depending on animacy or function of the head noun as it does in English, for instance.

Light Verb Constructions Persian makes extensive use of so-called complex predicates or light verb con- structions (LVC), which consist of preverbal parts of speech such as a noun, adjective, adverb, or preposition forming a complex lexical predicate together with the light verbs. Light verbs are verbs with weak semantic contents of their own. Some of the frequent light verbs in Persian are àXQ» /kard-an/ (gloss: do.past-inf, translation: to do) in àXQ» Qº¯ /fekr kard-an/ (gloss: think do.past-inf, translation: to think), àX@X /dad-an/¯ (gloss: give.past-inf, transla- tion: to give) in àX@X €ñà /guš dad-an/¯ (gloss: ear give.past-inf, translation: to listen), àX P /zad-an/ (gloss: hit.past-inf, translation: to hit) in àX P ©J k. /jiq zad-an/ (gloss: scream hit.past-inf, translation: to scream), á ¯Qà /gereft-an/ (gloss: take.past-inf, translation: to take) in á ¯Qà €ðX /duš gereft-an/ (gloss: shower take.past-inf, translation: to shower/take a shower), á ƒ@X /dašt-an/¯ (gloss: have.past-inf, translation: to have) in á ƒ@X IƒðX /dust dašt-an/¯ (gloss: friend have.past-inf, translation: to love), and àXA J¯@ /oftad-an/¯ (gloss: fall.past-inf, translation: to fall) in àXA J¯@ †A ® K@ /ettefaq¯ oftad-an/¯ (gloss: event fall.past-inf, translation: to happen). Main verbs may further function as light verbs in a very abstract semantic interpretation, for instance the verb àXPñ k /xord-an/ (gloss: eat.past-inf, translation: to eat) in the complex predicates àXPñ k á ÓP /zamin xord-an/ (gloss: ground eat.past-inf, translation: to fall down). The light verb element in the complex predicate carries the TAM in- flection, tense, aspect, and mood, while the other component (noun, adjective, adverb, or preposition) carries most of the semantic content of the complex predicate. In non-canonical subject constructions, as previously mentioned regarding subject-verb agreement, the verb does not inflect for number and person with the dative subject, always staying in third person singular instead. In this case, preverbal elements carry the dative subject. The following example shows the adjective €ñ k /xoš/ (good) that functions as the adjectival complement of the complex predicate (adjectival complement in light verb construction)  àYÓ @ €ñ k /xoš amad-an/¯ (gloss: good come.past-inf, translation: to like). In addition, €ñ k /xoš/ (good) carries the dative subject of the sentence in the form of a pronominal clitic, namely, the first singular Ð- /-am/.

11In subordinate constructions, é» /ke/ marks complement clauses, relative clauses, and various types of adverbial clauses.

58  .YK @ú× Õæ ñk €P@ az-aš xoš-am mi-ay-ad¯ . of-pc.3sg good-pc.1sg cont-come.pres-3sg . I like her/him.

Persian has various types of light verb constructions depending on the various preverbal elements. Different varieties of light verb construc- tions are presented below with brief descriptions. More detailed descriptions of the different light verb constructions, with examples, are given in Chapter 5.

• Adjectival complement in a light verb construction: An adjectival complement in a light verb construction is a preverbal adjective that combines with a light verb to form a lexical unit. In the following example the adjective clean and the light verb do together build an adjectival complement in light verb construction.  .YJ»ú× ¸AK @P èAJ ƒ éJm' ð@ u taxte siyah¯ ra¯ pak¯ mi-kon-ad . she/he board black ra¯ clean cont-do.pres-3sg . She/he cleans/wipes the blackboard.

• Direct object in a light verb construction: A direct object in a light verb construction is a preverbal direct object that combines with a light verb to form a lexical unit.12 In the below example, the word touch and the light verb do create a direct object in a light verb construction. Note that, there is also another direct object, sweets, followed by the accusative marker ra¯ in this sentence.

.YJ»ú× ÖÏ @P AëúæK Q ƒ @P@X Dar¯ a¯ širini-ha¯ ra¯ lams mi-kon-ad . Dara sweet-pl ra¯ touch cont-do.pres-3sg . Dara touches the sweets.

• Nominal subject in a light verb construction: A nominal subject in a light verb construction is a preverbal nominal subject that combines with a light verb to form a lexical unit (normally with a passive meaning). In the example below, the noun implementation and the light verb become together make a nominal subject in a light verb construction. There is also a sentence level subject, ceremony, in this sentence.

12Note that this unit may in turn take a direct object. Hence the need to distinguish the light verb object from an ordinary direct object.

59 .Yƒ @Qk. @ Õæ @QÓ marasem¯ ejra¯ šod . ceremony implementation become.past.3sg . The ceremony was implemented.

• Prepositional modifier in a light verb construction: A prepositional modifier in a light verb construction is a preverbal prepositional phrase that combines with a light verb to form a lexical unit. In the following example the preposition to and its object cry, together with the verb fall, build a prepositional modifier in a light verb construction.

.XAJ¯@ éK QÃ éK. ð@ u be gerye oftad¯ . she/he to cry fall.past.3sg . She/he fell into cries (she/he burst into tears).

Preverbal Particles Preverbal particles, or preverbs, are elements that immediately precede main verbs. Preverbs are normally prepositions or adverbs whose meaning is mod- ified by the main verb (Lazard, 1992). Examples of preverbs are PX /dar/ (in), QK /bar/ (on),  /pas/ (back, again), PAK /baz/¯ (again), @Q¯ /fara/¯ (above), ðQ¯ /foru/. (down), and so forth. These particles. load main verbs with new mean- ings. In other words, they interact closely with the main verbs; examples are  PX /dar/ (in) in àYÓ @PX /dar amad-an/¯ (gloss: in come.past-inf, translation: to eventuate, to result) and  /pas/ (back) in á ¯Qà  /pas gereft-an/ (gloss: back take.past-inf, translation : take back, regain). The construction of verbs with preverbs is very similar to that of LVC. However, preverbal elements in LVC often co-occur with semantically bleached or light verbs to build pred- icates or new verbs in Persian, whereas preverbs co-occur with full lexical verbs to modify the content of the verbs. The particles directly precede main verbs; however if the future auxiliary or the negative morpheme - K /-ne/ (not) is present then they will intervene between the particle and the main verb.

The Ezafe¯ Construction The ezafe¯ construction is a distinct characteristic of Persian syntactic structure that plays a significant role in phrasal categories. The literal meaning of ezafe¯ is addition, and it describes the dependency between a head and its modifiers. This dependency is usually charaterized by the interplay of phonology, mor- phology and syntax (Bögel et al., 2008) which is easily noticeable in speech since it is pronounced as /e/ or /ye/. Karimi (1989) in regard to this claims that “ezafe¯ is not a case assigner, but it does transfer the case of the head noun

60 to its complement”. Moreover, this unstressed clitic particle is orthographi- cally unwritten when it appears after consonants, which may raise difficulties in syntactic text analysis. The orthographic realization of ezafe¯ occurs only in special cases where the word ends in the long vowels /a/,¯ and /u:/, as well as where the word ends in the silent “h”.13 In this case, ezafe¯ is represented by the suffix ø /-ye/. As ezafe¯ indicates the semantic relations of phrasal constituents, it ties the modifier to the head word. In other words, ezafe¯ links a modifier (adjectival or adverbial) to a noun as well as a genitive attribute (complement) to its noun. The last element in the phrase never carries any ezafe¯ clitic. This fact can indi- cate the end of a phrasal category. In the following example, áÓ ú¯@Q ªk HAJ» /ketab-e¯ joqrafi-e¯ man/ (gloss: book-ez geography-ez I, translation: my. geog-. raphy book) the first word, book, is the head word of the noun phrase that car- ries an ezafe¯ to link the next word, geography, to itself. The second word, ge- ography, is further the head word of its dependent, my, which is linked by an- other ezafe¯ clitic. Finally, the noun phrase ends with the third word, my, which does not carry any ezafe¯ clitic. Ezafe¯ additionally functions to mark nominal  14 determination (Lazard, 1992), for example øPAKYJ ®ƒ@ ÈAJK@X øA¯@ /aq¯ a-ye¯ Dani¯ al-e¯ Esfandiari/ (gloss: Mr.-ez Danial-ez Esfandiari, translation : Mr. Da- nial Esfandiari).

2.4 Existing Corpora and Tools for Persian Having examined the Persian language and the challenges it poses for auto- matic text analysis, we will now look at existing resources and tools for Per- sian. Since the scope of this thesis is confined to the morphosyntactic domain, I will only present resources and tools in this area.

2.4.1 Morphologically Annotated Corpora Below are morphosyntactically annotated resources in the form of corpora for Persian.

• Bijankhan Corpus (Bijankhan, 2004): The first linguistically annotated Persian corpus was the Bijankhan Corpus (Bijankhan, 2004) released in 2004. The corpus was created from on-line material consisting of texts with different genres and topics such as newspaper articles, fiction, technical descriptions, and texts about culture and art. The corpus contains 2,597,939 tokens and

13The silent h is employed to represent a terminal -e. Accordingly, the ezafe¯ appears as ø /-ye/, as in éKA g /xane/¯ (house) and øéKA g /xan¯ e–ye/, otherwise it is unwritten. 14  Since A¯@ /aq¯ a/¯ (Mr.) ends in the long vowel /a/,¯ ezafe¯ is visible and appears as ø /-ye/. 61 is annotated with morphosyntactic and partly semantic features. The Bijankhan Corpus is freely available in a plain text document and will be presented in more detail in Chapter 3.

• The Persian Linguistic Data Base (PLDB) (Assi, 2005): The Persian Linguistic Data Base contains information about pronun- ciation and grammatical annotation with a morphosyntactic tagset of 44 tags. The database consists of more than 56 million words from contemporary texts. The corpus is not freely available.

• The Persian 1984 corpus (QasemiZadeh and Rahimi, 2006): The Persian 1984 corpus comprises a translation of the novel 1984 by George Orwell annotated in the MULTEXT-East framework. The corpus consists of 6,604 sentences, and about 100,000 words annotated with parts of speech. The corpus is part of the MULTEXT-East parallel corpus (Erjavec et al., 2003).

• Tanzil Quran Corpus (ZarrabiZadeh, 2007): Tanzil is an international Quranic project that was launched in early 2007 with the aim of providing a highly-verified and accurate Quran text. In addition, the project has provided a detailed digital version of texts, over 100 translations of the Quran into over 40 languages, including 11 translations into Farsi. Related data and the translations are offered in various formats on the project Web site, and can be used for research purposes.

• Hamshahri Collection (AleAhmad et al., 2009): The Hamshahri Collection is a corpus containing 318,000 documents relating to the years 1996 to 2007. All documents in the corpus have the label Cat indicating the category of each document (economic, political, etc.). The data has been created by the research group of Tehran University with support from Iran Telecommunication Research Center. The corpus is intended to be used for studying different features of information-retrieval algorithms like indexers and retrieval models as well as Persian clustering and classification, stemming and so forth.

• Comparative Persian-English Corpus (Hashemi et al., 2010): ’s Persian-English Comparative Corpus (UTPECC) has been created from two distinct news sources, Persian News from Hamshahri news agency and English News from the BBC news agency. To align documents in two different languages, in addition to dating news, additional documents and similar contents have also been consid- ered. The corpus has been produced at the Intelligent Systems Research

62 Laboratory of Tehran University.

• Peykare (Bijankhan et al., 2011): Peykare is a large corpus containing circa 110 million words of written and spoken contemporary Persian. Bijankhan et al. (2011) define con- temporary Persian as the modern Persian spoken as the formal language of Iran since the beginning of the latest era (1847). Thus, the corpus has been prepared from various types of texts dating from 1847 until the present. In total, 35,058 text files have been extracted from books, magazines, newspapers, web pages, unpublished texts, and manuscripts, and have been organized chronologically and into different linguistic varieties. These varieties are based on different political milestones15 which have been used as distinctive boundaries, since the lexical items used by native speakers, in particular media, have been strongly affected by political events in different eras. A small portion of Peykare to be used as a training data set, was collected randomly from different topics in order to cover varieties of lexical and grammatical structures. The training set, initially consisting of 10,612,187 tokens, was reduced to 9,781,809 tokens after taking some multi-word expressions as a single word in the tokenization process. The training set is annotated with an EAGLES-based part-of-speech tagset. Neither Peykare nor the training set are freely available.

• Furqan Quran Corpus (Astiri et al., 2013): The underlying text of Furqan Corpus is based on the Quran that has been designed and implemented at Ferdowsi Web Technology Lab, Mashhad University. The corpus has more than 587 megabytes of data, including all information about text and translated verses of the Quran in Persian and English, morphological and syntactic analysis of the verses in Arabic, Persian and English, stemming, and many other items in the RDF format.

• Mizan English-Persian Parallel Corpus (Mizan, 2013): The corpus contains more than one million English sentences (often in classic literature) and their translations into Farsi that have been pro- vided by the Supreme Council of Information and Communication Tech-

15The milestones are taken from (Bijankhan et al., 2011, p. 146) and are presented below: • 1847–1906: AD before the period of ‘Mashroutiyat’ (Constitutionality), • 1906–1925: from Constitutionality until the first king of the Pahlavi dynasty, • 1925–1941: from the first king of the Pahlavi dynasty to the second king, • 1941–1978: from the second king of the Pahlavi dynasty to the Islamic revolution, • 1978–1988: from the Islamic revolution to the end of the war with Iraq, • 1988–2006: from the end of the war until 2006 when the designing of Peykare ended, • 2006–Present: from 2006 until now, when text collecting for Peykare resumed.

63 nology in Iran. The corpus can be used in various applications, espe- cially machine translation and natural language processing.

2.4.2 Syntactically Annotated Corpora Recently, we have witnessed the emergence of three treebanks, namely, the HPSG-based PerTreeBank (Ghayoomi, 2012), the Uppsala Persian Depen- dency Treebank (Seraji et al., 2012b), and the Persian Dependency Treebank (Rasooli et al., 2013). The development of the Uppsala Persian Dependency Treebank and the HPSG-based treebank began almost simultaneously in different places and with different annotation schemes. Shortly after, the Persian Dependency Treebank was developed in Iran with an annotation scheme based on traditional Persian grammar. Below follows a brief descrip- tion of the HPSG-based treebank (PerTree–Bank) (Ghayoomi, 2012) and the Persian Dependency Treebank (Rasooli et al., 2013). The Uppsala Persian Dependency Treebank (Seraji et al., 2012b) will be presented in detail in Chapter 5.

• The Persian Treebank (PerTreeBank) (Ghayoomi, 2012): PerTreeBank consists of 1012 sentences and 27,659 tokens taken from the Bijankhan Corpus (Bijankhan, 2004). The treebank structure is defined based on an HPSG scheme, where the constituent structure is encoded for phrase categories with the concepts of head-subject, head-complement, head- adjunct, and head-filler representing sub-categorization requirements (Ghayoomi, 2012). The original part-of-speech tags in the Bijankhan Corpus, with three-level of length (e.g., N,COM,SING), were converted into the MulText-East framework to encode the morphosyntactic and semantic information as atomic tags (e.g., Ncs). Unfortunately, the treebank lacks annotation guidelines to separately describe each syntactic relation used in the treebank. The treebank development was discontinued when it was at a size of 1012 sentences. PerTreeBank has recently been automatically converted into a dependency structure named DepPerTreeBank.

• The Persian Dependency Treebank (PerDT) (Rasooli et al., 2013): The Persian Dependency Treebank (PerDT) consists of circa 30,000 inde- pendent sentences and 498,081 tokens derived from several sources in- cluding Web URLs, stories, and lectures. Instead of achieving a bal- anced corpus in terms of different genres, the data source consists of iso- lated sentences taken from the Web, and these are specifically selected based on different verb types. To include all types of verbs, including rare verbs, a syntactic valency lexicon of Persian verbs (Rasooli et al., 2013) is used. The morphosyntactic features are annotated using the

64 part-of-speech tagger Maximum Likelihood Estimation (MLE) and the morphological analyzer Virastyar.16 The treebank has been developed semi-automatically through a bootstrapping procedure using MSTParser (McDonald and Pereira, 2006). The treebank annotation scheme follows the traditional Persian linguistic grammar, and consists of 43 syntactic relations listed in Table 2.17.

2.4.3 Sentence Segmenentation and Tokenization There are no open source tools for performing sentence segmentation and to- kenization for Persian. However, a tool called Standard Text Preparation for Persian Language (STeP-1) (Shamsfard et al., 2010) has been designed to em- ploy a multi-task operation, including pre-processing, tokenizing, morpholog- ical analyzing, and spell checking. The software is not open source.

2.4.4 Part-of-Speech Taggers There is no open source part-of-speech tagger for Persian. However, there have been some experiments reporting quite good results on the performance of several part-of-speech tagging methods such as TnT, memory-based tagger (MBT) and Maximum Likelihood Estimation (MLE) (Raja et al., 2007). The corpus used in these experiments was the Bijankhan Corpus. Training and test sets were created by randomly dividing the corpus into two parts with an 85% to 15% ratio, and each experiment was repeated five times to avoid accidental results. The overall accuracies reported for the three taggers in are 96.6%, 96.6%, and 95.9% respectively.

2.4.5 Parsers To my knowledge there is only one freely available parser for Persian, namely the link grammar parser (Dehdari and Lonsdale, 2008). The modules of the parser have been constructed based on open source technologies such as two- level morphology (Koskenniemi, 1983) and the dependency-like link grammar (Sleator and Temperley, 1993). The parser takes a sentence as input and de- composes all inflectional morphemes through a lexicon-based morphological analyzer. Subsequently, when the morphological analysis is concluded, the results are transferred to the parser for syntactic analysis. The parser links as- sociated word pairs such as subject + verb, object + verb, preposition + object, adverbial modifier + adjective, and auxiliary + main verb. Each link possesses a label that represents a syntactic relation. However, the links are represented in such a way that every node involved in a link cannot be uniquely tied to a

16http://www.virastyar.ir/

65 Table 2.17. Syntactic relations in the Persian Dependency Treebank. Category Description ACL complement clause of adjective ADV adverb ADVC adverbial complement of verb AJCONJ conjunction of adjective AJPP prepositional complement of adjective AJUCL adjunct clause APOSTMOD adjective post-modifier APP apposition APREMOD adjective pre-modifier AVCONJ conjunction of adverb COMPPP comparative preposition ENC enclitic non-verbal element LVP light verb particle MESU measure MOS mosnad MOZ ezafe dependent NADV adverb of noun NCL clause of noun NCONJ conjunction of noun NE non-verbal element of infinitive NEZ ezafe complement of adjective NPOSTMOD post-modifier of noun NPP preposition of noun NPREMOD pre-modifier of noun NPRT particle of infinitive NVE non-verbal element OBJ object OBJ2 second object PARCL participle clause PART interrogative particle PCONJ conjunction of preposition POSDEP post-dependent PRD predicate PREDEP pre-dependent PROG progressive auxiliary PUNC punctuation mark ROOT root SBJ subject TAM tamiz VCL complement clause of verb VCONJ conjunction of verb VPP prepositional complement of verb VPRT verb particle

66 token position in the sentence. The parser provides no explicit way to extract the head of the sentence (Seraji et al., 2012a).

67 3. Uppsala Persian Corpus

This chapter presents the development of the Uppsala Persian Corpus1 (UPC) (Seraji et al., 2012a) which is based on the Bijankhan Corpus (Bijankhan, 2004). As my goal was to develop a treebank for Persian, and the Bijankhan Corpus was the only freely available annotated corpus of Persian, I used the corpus as the starting point of my work. However, the corpus was not created for language technology purposes. Therefore, I developed a modified version of the corpus and called it the Uppsala Persian Corpus. In the UPC some properties are inherited from the Bijankhan Corpus and some I have created through adjustments and improvements to make the corpus more suitable for syntactic analysis. The chapter presents the principal modifications made to the character encoding, segmentation, and annotation scheme in the Bijankhan Corpus.

3.1 The Bijankhan Corpus The Bijankhan Corpus was released in 2004 as the first manually annotated Persian (Farsi) corpus. It was created from on-line material containing texts of different genres and topics such as newspaper articles, fiction, technical descriptions, and texts about culture and art. The corpus consists of nearly 2.6 million tokens and is annotated for parts of speech. The corpus’s original tagset contains 550 tags. A tag name starts with the name of the most general tag and continues with the names of the subcate- gories until it reaches the name of the leaf tag. An example of a three-level tag is N_PL_LOC, where N represents noun, PL defines number as plural, and LOC specifies the tag as locative. In Oroumchian et al. (2006), the tagset has been defined as a hierarchical tree structure, but the tag system is in fact atomic. This size of tagset is used to achieve a more fine-grained morphological anal- ysis. Later on, the number of tags was reduced to 40 in an attempt to facilitate machine learning (Oroumchian et al., 2006). All tags with three or more levels were accordingly reduced to two-level tags. In other words, the above exam- ple was reduced to N_PL. Some two-level tags were also reduced to one-level tags, for example conjunctions, prepositions, and pronouns (Oroumchian et al., 2006). This version of the Bijankhan Corpus is in Unicode text format. The tagset is shown in Table 3.1.

1http://stp.lingfil.uu.se/∼mojgan/UPC.html

68 Table 3.1. Part-of-speech tags in the Bijankhan Corpus. Category Description Frequency ADJ Adjective 22 ADJ_CMPR Comparative adjective 7443 ADJ_INO Participle adjective 27195 ADJ_ORD Ordinal adjective 6592 ADJ_SIM Simple adjective 231151 ADJ_SUP Superlative adjective 7343 ADV Adverbial 1515 ADV_EXM Example adverbial 3191 ADV_I Interrogative adverbial 2094 ADV_NEGG Negation adverbial 1668 ADV_NI Non-question adverbial 21900 ADV_TIME Time adverbial 8427 AR Arabic word 3493 CON Conjunction 210292 DEFAULT Default 192 DELM Delimiter 256486 DET Determiner 45898 IF Conditional 3122 INT Voice letters 113 MORP Morpheme 3027 MQUA Quantifier of sort 361 MS Mathematical sign 261 NN Number range 2 NP Noun phrase 52 N_PL Plural noun 160419 N_SING Singular noun 967545 OH Interjection 283 OHH Harbinger 20 P Preposition 319858 PP Preposition phrase 880 PRO Pronoun 61859 PS Short phrase 333 QUA Quantifier 15418 SPEC Species/quality indicator 3809 V_AUX Auxiliary verb 15870 V_IMP Imperative verb 1157 V_PA Past tense verb 80594 V_PRE Predicate verb 42494 V_PRS Present tense verb 51738 V_SUB Subjunctive verb 33820

69 The corpus further comes with statistical software for corpus processing. This includes the calculation and extraction of language features such as condi- tional distribution probability and word frequency. It also includes recognition of homonyms and synonyms, and construction of concordances and lexicons. The Bijankhan Corpus is a pioneering effort for the Persian language and a valuable resource for my work. However, because the corpus was not intended for natural language processing, it has certain characteristics that make it less suitable for automatic processing. In order to use the corpus as the basis for a treebank, I therefore had to make certain adaptations and improvements which will be described in the rest of the chapter. The resulting corpus was released as the Uppsala Persian Corpus. The available version of the Bijankhan Corpus as of 2006,2 which has been used in this thesis project, lacks sentence segmentation and is unevenly nor- malized. The corpus contains different texts with a wide range of inconsis- tencies in tokenization and orthography, and is encoded in various types of character sets. The corpus additionally contains annotation errors as well as variations in the annotation, such as inconsistent application of part-of-speech tags across the corpus. These types of inconsistencies can lead to low quality of morphological and syntactic analysis provided by taggers and parsers.

3.2 Uppsala Persian Corpus The Uppsala Persian Corpus is the modified version of the Bijankhan Corpus with additional sentence segmentation and consistent tokenization. The corpus is currently the largest freely available, linguistically annotated and manually validated corpus for Persian. Due to the modifications in segmentation and part-of-speech tagging, the UPC contains more tokens and fewer tags. The corpus consists of 2,704,893 tokens and is annotated with a tagset of 31 tags with morphological information. Next, I will describe the different steps in creating the UPC. The corpus is freely available in plain text and open source under a GNU General Public License.

3.2.1 Character Encodings As mentioned in Chapter 2, Persian shares 28 of its 32 letters with Arabic, the rest having been specifically invented for Persian. Stylistically, however, the 28 shared letters are not exactly the same (see Table 2.9). For example, two of the characters, ø (ye) and ¸ (kaf),¯ differ in shape and therefore have different Unicode encodings. Persian and Arabic also share the same digits, although they have different styles for the numbers 4, 5, and 6, each with different Unicode characters.

2This version of the Bijankhan Corpus is available at http://ece.ut.ac.ir/dbrg/bijankhan

70 In the Bijankhan Corpus, texts are encoded with various types of charac- ter sets. Letters have a mixture of both Persian and Arabic styles. In terms of digits, the variation is even larger, and the digital characters appear in Per- sian, Arabic, and Western styles (see Section 2.3.1). Thus, in normalizing the Bijankhan Corpus, all letters in Arabic style with Arabic Unicode characters were converted to Persian style and Persian Unicode encoding. Arabic and Western digits were all converted to Persian digits. Normalization of the char- acter encodings was performed by employing the tool PrePer, which will be introduced in Section 4.1.1.

3.2.2 Sentence Segmentation and Tokenization As noted earlier, the Bijankhan Corpus was not designed for NLP applications. For instance, it did not have sentence segmentation or consistent tokenization, which are important features for a part-of-speech tagger and a syntactic parser. In the UPC, I have added a sentence segmentation, with sentences separated by one of the punctuation marks ‘.’, ‘!’, ‘?’, or combinations thereof. In addition, the punctuation mark ‘:’ has been treated as a sentence separator when used to introduce a list of alternatives. Tokenization has been made more consistent than in the original corpus, and better adapted for automatic processing so that it can be reproduced on new text. The basic rule for tokenization is that white space and punctuation marks define token boundaries. However, in order to improve the quality of tokenization, two preprocessing steps were carried out before applying the main rule.

• White space is converted to ZWNJ to ensure no boundary: Word-internal white space was converted to ZWNJ in order to con- form to standard conventions. Standard conventions are stylistic and orthographic rules introduced by the Academy of Persian Language and Literature (APLL).3 Accordingly, white space inside compound words, as well as inside words incorporating clitics and affixes, was converted to ZWNJ to make sure that tokens in the treebank never contain internal white space. This has been done only if such cases could be identified deterministically.

• ZWNJ/no space is converted to white space in cases where preverbal particles are joined to main verbs or other elements: White space was added between preverbal particles and their hosts to en- able them to be separated in tokenization. This phase was necessary for my further analysis on the syntax level. As mentioned in Section 2.3.3, preverbs are elements that immediately precede main verbs, except in

3http://www.persianacademy.ir/fa/das.aspx

71 negative forms and future tense contexts. Thus, preverbs are not bound elements, like prefixes to the main verbs. They are in fact separate el- ements, like prepositions and adverbs, that can stand on their own and change position in a sentence when they are not behaving as verb parti- cles. In the Bijankhan Corpus, preverbs were accompanied with the future auxiliary verb either by ZWNJ or no space (without being attached to the auxiliary verb, due to the existence of right-joining characters at the end  of the preverbs), for example PX /dar/ (in) in YÓ@ Yë@ñkPX /dar xah-ad¯ amad/¯ (gloss: in-will.fut.3sg come.past, translation: will become) and QK /bar/ (on) in IƒA g Yë@ñkQK /bar xah-ad¯ xast/¯ (gloss: on-will.fut.3sg get.past,. translation: will get up).. Moreover, the particles, in associa- tion with the main verbs, were treated inconsistently in the corpus. They were sometimes joined to the main verbs in a single token and some- times separated from the main verbs and treated as distinct tokens. Thus in UPC, the preverbal particles were split from the main verbs and the fu- ture auxiliary verb and received the part-of-speech tag PREV (preverb). Note again that the modifications were performed only when the cases could be identified deterministically. My reason for not joining the preverbs to the main verbs with ZWNJ (or no space) as I did for clitics and affixes are that: (1) preverbs are distinct items that can stand independently in a string, whereas clitics and affixes are bound units that cannot stand on their own; and (2) another element such as the negative morpheme and the future auxiliary verb can easily separate them from the main verb, which does not happen with clitics or affixes.

After performing the two preprocessing steps, white space was treated as a to- ken boundary. Fixed expressions separated with white space in the Bijankhan Corpus such as '  /az anj¯ a-i¯ ke/ (gloss: from that.place-indef that, é» úG Am. @ P@ translation gloss : since, where), é» áK @Ñ«P éK. /be raqm-e in ke/ ( : to despite- ez this that, translation: despite), ðP áK @ P@ /az in ru/ (gloss: from this face,  translation: hence), Iêk à @ P@ /az an¯ jahat/ (gloss: from that direction, trans- lation: thence) were treated. inconsistently, sometimes as one single token with one tag (without considering the fact that the tokens were separated by white space) and sometimes as multiple tokens with multiple tags. For a more con- sistent analysis, and as a consequence of the main tokenization rule, these fixed expressions were split into their distinct tokens. Unfortunately, this does not mean that each and every case in the UPC has been treated accordingly. There are further cases, such as when fixed expres- sions are attached to one another. Such expressions have not been separated from each other. There are additionally instances where clitics and affixes are

72 attached to their head words without white space. Such instances have not been separated from their heads either. Plenty of similar types exist in the corpus which have not been treated in the UPC due to the fact that they could not be identified deterministically and unambiguously. For instance, the word IƒAÓ /mast/¯ could be a compound word containing the personal pronoun AÓ /ma/¯ (we) and the clitic form of the copula verb Iƒ /st/ (is) and referring to we are, or it could simply be a singular noun meaning yoghurt. With re- spect to the corpus size and the available research time it was impossible to go through the corpus and fix all cases manually one by one. On the other hand, I could not handle such cases automatically, since this could lead to many incor- rect conversions by affecting other orthographically similar words with totally different morphological categories and senses. Therefore, I had one option, namely to fix only those cases that could be fixed automatically and unam- biguously and accept the remaining cases as the consequences of that option. Hence, I left them as they are, but analyzed them at the syntactic level instead by giving them special analysis. Moreover, since there may be cases in the new data that cannot be handled by the automatic tokenizer, it is beneficial that such cases exist as samples in the training data.

3.2.3 Morphological Annotation In normalizing the morphological annotation within the Bijankhan Corpus, I have made two types of changes. First I have modified the tagset by adding, removing, and merging categories. Secondly, I have corrected errors and in- consistencies in the annotation. The morphological annotation in UPC consists of atomic part-of-speech tags that encode a subset of the features found in the original Bijankhan Corpus. The tagset is listed with explanations in Table 3.2. New tags are shown in italics. More detailed changes are presented below.

1. Added and replaced tags For the improvement of corpus annotation, I introduce 11 new part-of- speech tags. Apart from one added tag (PREV), the rest of the new tags are introduced as replacements for the former ones. The new labels in the UPC tagset represent a more thorough treatment of the morphological structure than the Bijankhan tagset, with respect to traditional Persian grammar.

a) ADJ_VOC The tag OH that had been used for the word Aƒñ k /xoš-a/¯ (gloss: good-voc, translation: good for, blessed) was replaced by the tag ADJ_VOC. The interjection @- /-a/¯ (long vowel) is attached at the end of the word €ñ k /xoš/ (good) to indicate a vocative case.

73 Table 3.2. Part-of-speech tags in the UPC and the corresponding tags in the Bi- jankhan Corpus (BC in the table). BC Tags UPC Tags Description Frequency ADJ, ADJ_ORD, ADJ Adjective 241113 ADJ_SIM, QUA SPEC ADJ_CMPR ADJ_CMPR Comparative adjective 6766 ADJ_INO ADJ_INO Participle adjective 3434 ADJ_SUP ADJ_SUP Superlative adjective 7776 OH ADJ_VOC Vocative adjective 2 ADV, ADV_NI, ADV Adverb 27081 MQUA, QUA, SPEC ADV_EXM ADV_COMP Adverb of comparison 3081 ADV_I ADV_I Adverb of interrogation 3657 ADJ_SIM, ADV_LOC Adverb of location 2492 N_SING ADV_NEGG, ADV_NEG Adverb of negation 3993 QUA ADV_TIME ADV_TIME Adverb of time 14707 P CLITIC Accusative marker 35820 CON, IF, NP CON Conjunction 210954 DEFAULT, DELM Delimiter 257377 NN, DELM DET, QUA DET Determiner 52345 AR FW Foreign Word 4421 INT, OH, PS INT Interjection 664 N_PL N_PL Plural noun 161383 MORP, N_SING, N_SING Singular noun 875699 QUA, SPEC NN NUM Numeral 73371 OHH N_VOC Vocative noun 53 P, PP, P Preposition 308122 QUA PREV Preverbal particle 750 PRO, QUA PRO Pronoun 68236 MS SYM Symbol 260 V_AUX V_AUX Auxiliary verb 13053 V_IMP V_IMP Imperative verb 1239 V_PA, V_PRE V_PA Past tense verb 71716 V_PA, MORP V_PP Past participle verb 36055 V_PRE, V_PRS V_PRS Present tense verb 95625 V_SUB V_SUB Subjunctive verb 34270

74 b) ADV_COMP The tag ADV_EXM that had been used for descriptive adverbs with comparative senses such as YJ KAÓ /manand/¯ (as, like), ÉJÓ /mesl/ (as, such as, for example) was replaced by ADV_COMP (adverb of comparison).

c) ADV_LOC In the Bijankhan Corpus a number of adverbs such as BAK. /bal¯ a/¯ (up), á KAK /payin/¯ (down), ñÊg /jolo/ (forward), àð QK /birun/ (out), had been  tagged irregularly with. different labels such as . ADJ_SIM (simple adjective) and N_SING (singular noun), even when the words were modifying different verbs. Such cases were all replaced by the tag ADV_LOC.

d) CLITIC In the Bijankhan Corpus, the accusative marker @P (ra)¯ had been annotated as a preposition P. In the UPC, I modified the annotation label to CLITIC. Since the accusative marker ra¯ always follows the object in Persian, it can be considered as a postposition or a clitic case marker rather than a preposition.

e) FW All Arabic words in the Bijankhan Corpus had received the tag AR, while other foreign words had received the tag N_SING. Hence, all types of foreign words, when they were discovered, were homoge- neously tagged with the label FW.

f) NUM Numerals in the Bijankhan Corpus had been labeled by N_SING. Numerals were systematically searched for throughout the corpus and given the tag NUM.

g) N_VOC Nominal forms such as @PAÇXPðQK /parvardegar-¯ a/¯ (gloss: Lord-voc, translation: Lord), AK@Yg /xoda-y¯ a/¯ (gloss: God-voc, translation: God) that are used in calling out to attract attention had been tagged by the label OHH which was replaced by N_VOC.

h) PREV: Preverbal particles accompanied (either with no space or intervening ZWNJ) by verbs and the future auxiliary verb á ƒ@ñ k 4 /xast-an/¯ (gloss: will.past-inf, translation: to will) were split from

4 á ƒ@ñ k (to will, to want) is the base form of the future auxiliary verb which changes form in future tense and is inflected by person.

75 their head and given the part-of-speech tag PREV (preverbal particle) as described in Section 3.2.2.

i) SYM In the Bijankhan Corpus different mathematical signs as well as various types of units of measure that had been marked by the tag MS received the label SYM instead, as the concept symbol is more representative of such cases.

j) V_PP All past participle verbs with the tag V_PA (past tense verb) were modified to V_PP (past participle verb).

2. Removed tags Some part-of-speech tags such as QUA, MQUA, SPEC, MORP, and DEFAULT had not been applied in accordance with the traditional Persian grammatical description and were unevenly distributed in the Bijankhan Corpus. The tags were used for words belonging to different parts-of- speech. These were reconsidered and removed. Affected words received their relevant morphological labels instead to be more consistent. The tags are presented as follows.

a) QUA The tag QUA was used for words belonging to different parts of speech. The tag was applied to singular nouns, different types of determiners, pronouns, various forms of adverbs, adjectives, and prepositions. The tag was modified to N_SING (singular noun) for   words such as IK Q»@ /aksariyyat/ (majority), IJ ʯ@ /aqaliyyat/ (mi- nority), DET (determiner) for words such as ÐAÖß /tamam/¯ (all), Që /har/ (each), PRO (pronoun) for words such as ú攪K . /ba‘zi/ (some), úkQK . /barxi/ (some), ADV_NEG (adverb of negation) for word such as i Jë /hic/ˇ (no, never), ADV (adverb) for words such as ùÒ» /kam-i/ (gloss: little-indef, translation: slightly), úÎ Jk /xeyli/ (very), ADJ (adjective)   for words such as Q‚ K. /bištar/ (more, major), Q»@ /aksar/ (most), and P (preposition) for the word such as àAJ Ó /miyan/¯ (between). b) MQUA The tag MQUA, which had been used for the compound words ɯ@Yg /haddeaqal/ (at least) and Q»@Yg /haddeaksar/ (at most), was converted to ADV. The two compound words were tagged irregularly in the corpus, alternating between QUA, MQUA, CON, N_SING, and

76 ADV. c) SPEC The tag SPEC was used for different words associated with different parts of speech. The tag was converted to N_SING for words such as AK /ta/ (fold, piece), as a noun of measure, I ‚Ó /mošt/ (handful), éKñà /gune/ (species), ¨ñK /no‘/ (type, sort), éKñÖ ß /nemune/ (sample), ADV for words such as á Jk  /cenin/ˇ (such), àA Jk  /cenˇ an/¯ (so), á Jj  Òë /hamcenin/ˇ (also), and ADJ for words such as XðYªÓ /ma‘dud/ (few), áK YJk  /candin/ˇ (several, multiple), èYÔ« /omde/ (major). d) MORP The tag MORP was used for some passive forms such as èYƒ /šod-e/ (gloss: become.past-pp, translation: become), éJ¯P /raft-e/ (gloss: go.past-pp, translation: gone), for the adjective éËAƒ /sale/¯ (years, for instance in the expression: years-old), and for some nouns such as Q® K /nafar/ (person). The tag was replaced by the word’s associated part-of-speech tags such as V_PP (past participle verb), ADJ, and N_SING respectively. e) NN The tag NN was applied only to two cases in the Bijankhan Corpus, indicating dates such as 79/3/1. The tag was removed during tok- enization since this sequence of numbers and slashes were split into separate tokens and received associated tags. The numbers and the slashes were then marked by NUM and DELM respectively.

f) V_PRE In the Bijankhan Corpus, the copula verb was handled differently in different tenses. Copula verbs in present tense as well as words accompanied by copula clitics were annotated as V_PRE (verb predicate), in present continuous tense as V_PRS (verb present), and in past tense as V_PA. In the UPC the tag V_PRE was replaced by V_PRS to represent the copula verb in present tense and to be consistent with other copula verbs in terms of being marked by tenses. g) DEFAULT Delimiters in the Bijankhan Corpus were marked by the tag DELM. However, the tag DEFAULT had also been used irregularly for de- limiters. The tag was completely replaced by DELM to be consistent with the rest of the delimiters.

77 3. Merged tags Some tags in the UPC represent major part-of-speech categories only, while some also represent morphological features. In the UPC, I have sometimes merged tags that were specified with morphological features into their main or related categories. These were tags that could easily be merged with their main or other related categories and still be well representative. The idea was to reduce the size of the tagset. This approach was applied to a number of part-of-speech tags with no particular reason for variations in granularity or informativeness. Thus the tags ADJ_ORD and ADJ_SIM were merged with ADJ. In addition, the tag ADV_NI was merged with ADV, as well as the following tags: IF to CON, OH to INT, NP to CON, PP to P, and PS to INT.

4. Errors and inconsistencies Persian has a huge number of homographs with multiple senses. The word ¬Q £ /zarf/, for instance, is used as a singular noun meaning dish and as a preposition meaning within. Depending on its role in a sentence the word receives different part-of-speech tags. In the Bijankhan Corpus, the word was annotated as a preposition even when it referred to the notion of dish. The corpus contained many such cases of erroneous part-of-speech tags, and these were corrected when detected. Once an annotation error was discovered, it was often the case that the same type of error could be found elsewhere in the corpus. By browsing the corpus systematically, errors of the same type were traced and corrected.

Table 3.3 shows a sample sentence taken from the Bijankhan Corpus and the modified version of the sentence in the UPC. To facilitate a comparative de- scription of the changes in the UPC, words in the table are marked with ID number and treated words are highlighted in red. However, words in the Bi- jankhan Corpus and UPC do not appear with ID numbers.  The expression é» Iêk à @ P@ /az an¯ jahat ke/ (gloss: of that direction that, translation: since), with ID. 6 in the Bijankhan Corpus was treated as a single token and annotated with the tag CON. However, this fixed expression that contained four elements were not even merged into a single token but sepa- rated with white spaces. Given my tokenization scheme, which assumes that tokens do not contain spaces, I basically had two options: either to merge fixed expressions into a single element or to split them into independent ele- ments. As such expressions were treated inconsistently in the Bijankhan Cor- pus, sometimes as a single token annotated autonomously, despite the white spaces, and sometimes as multiple units with different tags, I split such fixed expressions into their components and treated them as multiple distinct tokens specified with associated part-of-speech information. My justification for not merging multi-word expressions into a single token is that it is impossible to

78 Table 3.3. A sample sentence taken from the Bijankhan Corpus and the corresponding sentence modified in the UPC.

ID Bijankhan Corpus Tags UPC Tags Translations 1 áK @ DET áK @ DET this 2 é«ñÒm.× N_SING é«ñÒm.× N_SING total 3 P@ P P@ P of 4 àA J£ñÒë N_PL àA J£ñÒë N_PL compatriots-ez 5 AÓ PRO AÓ PRO our 6   CON P from é» Iêk. à@ P@ P@ 7 - - à @ PRO that 8 - - Iêk . N_SING point 9 - - é» CON that 10 áK QKQK ADJ_SUP áK QKQK ADJ_SUP most superior  .  . 11 IJ ʯ@ N_SING IJ ʯ@ N_SING minority-ez 12 Qk. AêÓ ADJ_SIM Qk. AêÓ ADJ immigrant 13 P@ P P@ P from 14 Q¢  N_SING Q¢  N_SING point-ez 15   N_PL   N_PL status-ez ø Aë IJ ª¯ñÓ øAë IJ ª¯ñÓ 16  ADJ_SIM  ADJ social ú «AÒJk. @ ú«AÒJk. @ 17 , DELM , DELM , 18 ' ADJ_SIM ' ADJ specialized ú 攒m ú攒m 19 , DELM , DELM , 20 ADJ_SIM ADJ scientific ù ÒÊ« ùÒÊ« 21 , DELM , DELM , 22 ADJ_SIM ADJ cultural ú ÆJëQ¯ úÆJëQ¯ 23 ð CON ð CON and 24 ADJ_SIM ADJ capital ø @ éK AÓQå ø@éK AÓQå 25 PX P PX P in 26 øAëPñ‚» N_PL øAëPñ‚» N_PL countries-ez 27 N_SING N_SING host àAK .Q Ó àAK .Q Ó 28 éK. P éK. P to 29 PAÖÞ N_SING PAÖÞ N_SING account 30 V_PRS V_PRS cont-go.pres-3pl YKðP ú × YKðPú× 31 , DELM , DELM , 32 N_PL N_PL assets-ez ø Aë éK AÓQå øAëéK AÓQå 33 Ó ADJ_SIM Ó ADJ national ú Î úÎ 34 H. ñ‚m× ADJ_SIM H. ñ‚m× ADJ considered 35  V_PRS  V_PRS cont-become.pres-3pl YKñƒ ú × YKñƒú× 36 ð CON ð CON and 37 A«ñK ADV_NI A«ñK ADV typically 38 I«AK . N_SING I«AK . N_SING cause-ez 39 PAj J¯@ N_SING PAj J¯@ N_SING honor-ez 40 YKPñ ‚» V_PRE YKPñ ‚» V_PRS country-be.pres-3pl 41 . DELM . DELM .

79 merge these expressions in new data because of ambiguities. Hence, their status as multi-word expressions will be recognized in the syntactic analysis instead. In the Bijankhan Corpus, affixes such as plural suffixes and verb prefixes, as well as clitics that were separated from their head words by white space, were separated with ZWNJ in UPC, for example   /moqeiyat - ø Aë IJ ª¯ñÓ ha-ye/¯ (gloss: status -pl-ez, translation: status), /sarmaye¯ -i/ (cap- ø @ éK AÓQå ital), /mi- rav-and/ (gloss: cont-go.pres-3pl, translation: they go), YKðP ú × /sarmaye¯ -ha-ye/¯ (gloss: asset -pl-ez, translation: assets), and ø Aë éK AÓQå YKñ ƒ ú× /mi- šav-and/ (gloss: cont- become.pres-3pl, translation: they be- come) with IDs 15, 24, 30, 32, and 35 respectively. The tag ADJ_SIM in the Bijankhan Corpus was merged with ADJ in UPC. The tag is used for words with ID numbers 12, 16, 18, 20, 22, 24, 33, and 34. The tag ADV_NI was also merged with ADV in the UPC, for example the word with the ID 37. In the Bijankhan Corpus words accompanied by the copula clitics were annotated as V_PRE, for example YKPñ ‚» /kešvar-and/ (gloss: country- be.pres.3pl, translation: are country) with ID number 40. The tag V_PRE was modified to V_PRS in the UPC. The character (ye) in Arabic style (with two dots beneath) in the ø Bijankhan Corpus was converted to Persian ø (ye) in the UPC, without any dots and with Persian Unicode encoding. Affected words in the table are   /moqeiyat -ha-ye/¯ (gloss: status -pl-ez, translation: status), ø Aë IJ ª¯ñÓ  /ejtema‘i/¯ (social), ' /taxassosi/ (specialized), /elmi/ (sci- ú «AÒJk. @ ú 攒m ù ÒÊ« entific), /farhangi/ (cultural), /sarmaye¯ -i/ (capital),  ú ÆJëQ¯ ø @ éK AÓQå ø AëPñ‚» /kešvar-ha-ye/¯ (gloss: country-pl-ez, translation: countries), /mi- rav- YKðP ú × and/ (gloss: cont-go.pres-3pl, translation: they go), /sarmaye¯ - ø Aë éK AÓQå ha-ye/¯ (gloss: asset -pl-ez, translation: assets), Ó /melli/ (national), and ú Î YKñ ƒ ú× /mi- šav-and/ (gloss: cont- become.pres-3pl, translation: they be- come) with IDs 15, 16, 18, 20, 22, 24, 26, 30, 32, 33, and 35. The modifica- tion was further applied to other characters in Arabic style as well as digits in Arabic and Western styles (see 3.2.1). As most NLP applications rely on normalized, white space tokenized, and consistently annotated data, normalizing the Bijankhan Corpus at different lev- els was the fundamental procedure in developing the Uppsala Persian Corpus as a prerequisite for building a treebank. By making sure that tokens do not

80 need to be composed or decomposed for subsequent processing in the UPC, I ensured that such words will be identified as distinct units to be processed in the syntactic analysis. I aimed to make changes consistent with linguistic units as long as these units are reproducible with an automatic tokenizer on new text. When this was not possible I fell back on white space tokenization and added the linguistic information to the syntactic annotation. In this way I guaranteed that the linguistic units in the annotated corpus would be comparable with new tokenized text. The main improvements were systematically accomplished by automatic or semi-automatic processing. Automatic processing was used for specific cases when there was no risk of affecting words with multiple parts of speech. For example, the accusative marker ra¯ with the part-of-speech tag P was auto- matically modified to CLITIC. Automatic processing was further applied to correct misspellings. Semi-automatic processing, on the other hand, was used for cases containing multiple part-of-speech categories. Although detecting and correcting the part-of-speech annotation of such words is usually carried out by bigram or trigram searching, the method is not applicable in certain cases. Semi-automatic processing was therefore applied manually, back and forth, by browsing the corpus systematically and tracing the errors. For in- stance YªK /ba‘d/ (next, later, dimension) is a word with multiple notions and part-of-speech. information. The word can be used simply as an adjective, as an adverb or, with totally different pronunciation (which is unimportant in texts), as a noun. In the Bijankhan Corpus, the word was sometimes annotated as a noun, a conjunction, or a determiner when it was serving as an adjec- tive or adverb. The errors were sometimes traced through bigram or trigram searching and corrected. However, cases such as where the word was used as a noun in the corpus and was given wrong tag were undetectable by this method, as the tag N_SING was also applied for the word in adjectival and adverbial state. Thus, the correction was carried out manually when the errors were discovered. In this way, UPC, as a re-tokenized version of the Bijankhan Cor- pus with additional sentence segmentation and more consistent morphological annotation, can serve as a normalized and balanced corpus of contemporary Persian for language technology purposes.

81 4. Normalization, Segmentation and Morphological Analysis for Persian

As mentioned in Chapter 1, one of the goals of this thesis is to build a dependency-based treebank for Persian by first improving a part-of-speech analyzed corpus to serve as the treebank data. Furthermore, I aim to develop tools for automatic text processing and analysis of Persian, such as tools for sentence segmentation and tokenization, part-of-speech tagging, and parsing. In addition to reusing existing resources and tools, which is a practical ne- cessity, I impose a compatibility requirement on my resources and tools. To satisfy this requirement, I first of all want to be able to run the tools in a pipeline, where the output of one tool is compatible with the input require- ments of the next. Moreover, I want the tools to render the same analysis that is found in the annotated corpora, so that they can be used with additional tools derived from these corpora. Thus, for each and every step of process- ing, from normalization to syntactic parsing, I have developed a tool that is compatible with my annotated corpora. In building these tools, I have made use of standard methods and state-of-the-art tools, in particular, the sentence segmentation and tokenization tools in Uplug (Tiedemann, 2003), the part- of-speech tagger HunPoS (Halácsy et al., 2007), and the data-driven parser generator MaltParser (Nivre et al., 2006). Figure 4 shows the pipeline of tools for automatic processing and analysis of Persian. In this chapter I describe the tools that go with UPC, that is, tools for preprocessing, sentence segmentation and tokenization, and part-of-speech tagging. The tools for syntactic parsing will be described after presenting the Uppsala Persian Dependency Treebank in Chapter 5.

4.1 Preprocessing, Sentence Segmentation and Tokenization In this section I present the first two tools in the pipeline, namely PrePer and SeTPer. The section ends with a joint evaluation of the tools. PrePer1 and SeTPer2 are both freely available tools for the normalization and segmentation of Persian texts. They are open source under a GNU General Public License.

1http://stp.lingfil.uu.se/∼mojgan/preper.html 2http://stp.lingfil.uu.se/∼mojgan/setper.html

82 Preprocessor PrePerBLARK Pipeline for Persian

Sentence Segmenter & Tokenizer: SeTPer

PoS Tagger: PoS tagged TagPer Corpus: UPC

Parser: Treebank: ParsPer UPDT

Figure 4.1. Persian natural language processing pipeline.

4.1.1 The Preprocessor: PrePer As mentioned earlier, one of the major bottlenecks in automatic processing of Persian is the lack of standardization in Persian orthography in terms of different writing styles, spacing and font encoding. Persian orthography is not consistent. A word may be spelled in various forms and with different Unicode characters in a text. Compound words and inflectional affixes are highly prob- lematic in this regard, and can be spelled either as attached to or detached from their adjacent word (see Section 2.3). These inconsistencies can easily impact the tokenization process, which in turn affects the quality of morphological and syntactic analysis. Therefore, prior to any morphosyntactic analysis the input text needs to pass through a preprocessor module. For that reason, in my pipeline I have inserted a preprocessor for Persian, called PrePer, to take care of various encodings and typing styles in different genres. PrePer (Seraji et al., 2012a) is an open source tool, developed in the pro- gramming language Ruby, for editing and cleaning up Persian texts to solve inconsistency issues. The program uses the Virastar module (Bargi, 2011) for some formatting tasks. It handles miscellaneous cases and normalizes texts into computational standard script. PrePer, via Virastar, takes care of the occurrences of mixed character encodings. When normalizing texts, all letters in Arabic style with Arabic character encoding are converted to Persian style with mappings to Persian character encoding. Furthermore, Arabic and Western digits are all converted to Persian digits. PrePer furthermore treats cases that Virastar does not treat, such as the following cases where white

83 space can unambiguously be identified as token-internal. In this case, white space will instead be replaced by ZWNJ to create a single token.

1. Nouns and plural suffixes Aë- /-ha/¯ , à@ - /-an/,¯ àAK - /-yan/,¯ àAÇ - /-gan/,¯ H@ - /-at/¯ , and áK - /-in/, e.g.: Aë HAJ» ...... AëHAJ» (/ket.ab/¯ + /-ha/¯ ...... books).

à@ QgX ...... à@ QgX (/doxtar/ + /-an/¯ ...... girls)

àAK ñj‚@X ...... àAK ñj‚@X (/d anešju/¯ . + /-yan/¯ ...... students) .

àAÇ PAJƒ ...... àAÇPA Jƒ (/setare/¯ + /-gan/¯ ...... stars)

H@ QëA¢  ...... H@QëA ¢  (/tazahor/¯ + /-at/¯ ...... demonstrations)

áK Q¯A‚Ó ...... áK Q¯A‚Ó (/mos afer/¯ + /-in/ ...... passengers)

2. Any noun that ends in silent h and the indefinite clitic ø@- /-i/, e.g.: ø@ éKA g ...... ø@éKA g (/xane/¯ + /-i/ ...... a house)

3. Any noun indicating trade names and the abstract suffix ø- /-ye/ or úG- /-i/, e.g.:

ø QÃPP ...... øQÃPP (/zargar/ + /-i/ ...... goldsmith’s trade)

úG @ñKA K ...... úG@ñKA K (/n anv¯ a/¯ + /-i/ ...... bakery)

4. Any noun and the abstract suffix ø- /-ye/ forming adjectives, e.g.: ø Q‚»A g ...... øQ‚»A g (xakestar¯ + /-i/ ...... gray)

84 Table 4.1. Personal endings in past tense. Personal Endings Transcriptions Translations Ð- /-am/ I ø- /-i/ you ∅ ∅ she/he Õç' - /-im/ we YK - /-id/ you YK - /-and/ they

Table 4.2. Copula clitics. * The third singular è /-h/ in formal usage is consistently used along with the verb Iƒ@ /ast/ (is). Copula Clitics Transcriptions Translations Ð@- /-am/ I ø@- /-i/ you è-* /-h/ she/he Õç' @- /-im/ we YK @- /-id/ you YK@ - /-and/ they

5. Any adjective and the abstract suffix ø- /-ye/ forming nouns, e.g.: ø QÓQ ¯ ...... øQÓQ ¯ (/qermez/ + /-i/ ...... redness)

6. Nouns and different pronominal clitics, e.g.:

àA K Q¯X ...... àA KQ¯X (/daftar/ + /-etan/¯ ...... your office)

7. Any preceding word and personal endings shown in Table 4.1 as well as copula clitics shown in Table 4.2 e.g.:   YK@ èYÓ@ ...... YK@èYÓ @ (/amad-e/¯ + /-and/ ...... they have come)

8. Nouns and verbal stems in compound forms. Verbal stems shown in Table 4.3 are usually used as the second element of a compound word and serve as derivational suffixes.

85 Table 4.3. Verbal stems in the formation of compound words. Verbal stems Transcriptions Example words Preprocessed words Translations    úæK Q¯@- /-afarini/¯ úæK Q¯@ ¬CJk@ úæK Q¯@¬CJk@ dispute making - /-alud/¯   sleepy XñË@ XñË@H. @ñk XñË@H. @ñk - /-amiz/¯     successful Q Ó@ Q Ó@ IJ ®¯ñÓ Q Ó@IJ ®¯ñÓ P@Y K@ - /-andaz/¯ P@Y K@ Õæ„k P@Y K@Õæ „k perspective     XðYK@ - /-andud/ XðYK@ Q ¯ XðYK@ Q ¯ pitchy - /-angiz/ wonderful Q ÂK@ Q ÂK@ I ®Âƒ Q ÂK@ I ®Âƒ - /-avar/¯   funny Pð@ Pð@ èYJk Pð@èYJk €AK - /-paš/¯ €AK H@ €AK H@ sprinkler   .   .  QK YK- /-pazir/ QK YK I.  ƒ@ QK YKI.  ƒ@ vulnerable á»@QK - /-parakan/¯ úæ»@QK  éªK Aƒ úæ»@QK éªK Aƒ rumor-spreading P@XQK - /-pardaz/¯ P@XQK  ÈAJ k P@XQK ÈAJ k dreamer PðQK- /-parvar/ PðQK Ð@X PðQKÐ@X stockman  QK- /-pariš/ úæ„QK ¸@PX@ úæ„QK¸@PX@ agnosia    èðQK- /-pažuh/ èðQK @X èðQK@X scholar €ñK - /-puš/ €ñK  èPP €ñK èPP armored AÒJ K- /-peyma/¯ AÒJ K @ñë AÒJ K@ñë airplanes øPñk - /-xori/ øPñk @Y « øPñk@ Y « dining - /-xiz/ early riser Q g Q g Qm Q gQm - /-dan/¯ physicist à@X à@X ¹K Q ¯ à@X¹K Q ¯ - /-resan/¯   injurious/ill-wisher àAƒP àAƒP I.  ƒ@ àAƒPI.  ƒ@ à@ QK P- /-rizan/¯ à@ QK P ÀQK. à@ QK PÀQK. fall @P - /-za/¯ @P I ƒA‚k @P I ƒA‚k allergen - /-zoda/¯     stress desensitization @XP @XP K@XPK øP - /-zi/ øPH . @ øPH . @ aquatic - /-sazi/¯   building øPAƒ øPAƒ àAÒJkAƒ øPAƒàAÒJkAƒ øPñƒ - /-suzi/ øPñƒ @ øPñƒ @ fire - /-sanj/   punctilious i. Jƒ i. Jƒ éJºK i. JƒéJºK Ạƒ- /-šekan/ Ạƒ àñ KA ¯ Ạƒàñ KA ¯ law-breaker €AJ ƒ- /-šenas/¯ €AJ ƒ á ÓP €AJ ƒá ÓP geologist

àA ‚¯- /-fešan/¯ àA ‚¯ àAg . àA ‚¯àAg . zealot àA J» - /-konan/¯ àA J» èYJ k àA J»èY J k laughing  ñK - /-nevis/  ñK t' PAK  ñKt ' PAK historian H. AK - /-yab/¯ H. AK PP@H . AK PP@ assessor

86 9. Suffixes shown in Table 4.4 and their adjacent words forming adjective- adverbs and adjective-nouns.3

Table 4.4. Adjectival and nominal suffixes. Suffixes Transcriptions Example words Processed words Translations PAƒ- /-sar/¯ PAƒ ÐQå PAƒÐQå ashamed ¸- /-ak/ ¸ Qå„ ¸Qå„ little boy - /-gane/¯ childish éKAÇ éKAÇ ém'. éKAÇém'. QÃ- /-gar/ Qà Õæƒ QÃÕæƒ tyrant úÃ- /-gi/ úà XQ儯@ úÃXQ儯@ depression á Ã- /-gin/ á à Õæ„k á ÃÕæ„k angry YJÓ - /-mand/ YJÓ HðQ KY JÓ HðQ K rich ¸AK - /-nak/¯ ¸AK I ‚kð ¸AK I ‚kð terrible P@ð- /-var/¯ P@ð YJ Ó@ P@ðYJ Ó@ hopeful Pð- /-var/ Pð á m  Pðá m  eloquent YKð - /-vand/ YKð QîD YKðQîD  citizen   IK - /-yat/ IK Q»@ IK Q»@ majority

10. Nouns and the indefinite suffix ø- /-ye/ forming indefinite nouns, e.g.: ø Qå„ ...... øQå„ (/pesar/ + /-i/ ...... a boy)

11. Verbal stems and the suffix ¸@- /-ak/¯ forming nouns, e.g.: ¸@ Pñk ...... ¸@Pñk (/xor/ + /-ak/¯ ...... food)

12. Verbal past stems and the suffix P@- /-ar/¯ forming nouns, e.g.: P@ YKQk ...... P@YKQk (/xarid/ + /-ar/¯ ...... buyer)

13. Verbal present stems and the suffix PAÇ- /-gar/¯ forming nouns, e.g.:

3In Persian, adjectives rather frequently play different grammatical roles in a sentence and can easily be exchanged for nouns and adverbs (Lazard, 1992). For instance à@ñk (young) is an adjective but can simply fill the role of a noun in the following sentence: .

.Iƒ@ XP@ðèPA K à@ñk . áK @ in javan¯ taze-v¯ ared¯ ast . this young new-entered be.pres.3sg . This young woman/man is a newcomer.

87   PAÇ PñÓ @ ...... PAÇPñÓ @ (/amuz/¯ + /-gar/¯ ...... instructor )

14. Nouns and the suffix éK@ - /-aneh/¯ forming adverbs, e.g.: éK@ XQÓ ...... éK@XQÓ (/mard/ + /-ane/¯ ...... manly)

15. The negative prefix -AK /na-/¯ (im-, in-, un-, -less) and adjectives or verbal stems, as well as the negative prefix -úG /bi-/ (im-, in-, un-, -less) and adjectives, e.g.: .

a) the negative prefix -AK /na-/¯ and adjectives, e.g.: IƒPX AK ...... IƒPXA K (/na-/¯ + /dorost/ ...... incorrect)

b) the negative prefix -AK /na-/¯ and verbal stems, e.g.: €AJ ƒ AK ...... €AJ ƒA K (/na-/¯ + /šenas/¯ ...... unknown)

c) the negative prefix - /bi-/ and adjectives, e.g.: úG.   I ¯X úG ...... I ¯XúG (/bi-/ +. /deqqat/ ...... careless).

4.1.2 The Sentence Segmenter and Tokenizer: SeTPer The sentence segmenter and tokenizer SeTPer (Seraji et al., 2012a) was devel- oped for segmenting texts based on Persian sentence boundaries, which com- prise full stop, exclamation mark, and question mark, and tokenizing a ZWNJ normalized text. SeTPer was created by reusing and modifying the sentence segmenter and tokenizer tools in the modular software platform Uplug, a sys- tem designed for the integration of text processing tools (Tiedemann, 2003). The Uplug sentence segmenter and tokenizer is a rule-based program that can be adapted to various languages by using regular expressions for matching common word and sentence boundaries. SeTPer treats the full stop, the ques- tion mark, and the exclamation mark as sentence boundaries. Table 4.5 shows tokens that are handled as token separators by SeTPer. The tokenizer also handles numerical expressions, web URLs, abbreviations, and titles. Acronyms are rarely used in Persian but might exist in text messaging and social media platforms, and they are therefore also handled. To fulfill

88 Table 4.5. List of token separators. apostrophe “ and ” parentheses ( and ) brackets [ and ] colon : semicolon ; dash - exclamation mark ! question mark ? at sign @ slash / backslash \ percent % asterisk * tilde ∼ the compatibility requirement mentioned earlier, the output of the sentence segmentation and tokenization tool must match the input requirements of the next tool in the pipeline, namely, the part-of-speech tagger.

4.1.3 The Evaluation of PrePer and SeTPer To evaluate the normalization and segmentation tools I carried out an exper- iment on the performance of the normalizer, PrePer, and the sentence seg- menter and tokenizer, SeTPer. For the experiment, I used texts from the web- based journal www.hamshahri.com. I downloaded multiple texts from differ- ent genres and then randomly picked 100 sentences containing 2778 tokens to develop a test set. As my experiment involved some manual work, I opted for a small-sized sample to make the evaluation task more feasible. I then created a gold set by manually normalizing the internal word boundaries and character sets and then segmenting the text into sentence and token levels. I normalized the test set with PrePer and then segmented it with SeTPer. The evaluation showed that all 100 sentences were correctly segmented at sentence level with an accuracy of 100%. The evaluation of normalization and tokenization of tokens furthermore resulted in 99.25% recall, 99.59% precision, and 99.42% F-score. The experiment showed that some cases were not handled by the normalizer and the tokenizer. Examples were bigram words that were mis- takenly typed without any space. Further cases were words that were typed together with digits without white space. Thus, the automated segmented file contains 10 fewer words than the gold file. Table 4.6 shows all words that the normalizer and tokenizer were not able to handle. Normally, PrePer correctly converts Western quotation marks to Persian style, which is angle quotes « ». However, when several quotation marks are

89 Table 4.6. Words not treated by segmentation tools. Words/Symbols Expected spellings Glosses Translations P@PñK .QÓ P@ PñK.QÓ aforementionedof aforementioned of AK.1382 AK. 1382 1382with 1382 with éK.10TS3 éK. 10TS3 10653to 10653 to ðQ ƒ ð Q ƒ browsingand browsing and XñkéJ .KAj . ºK Xñk éJ.KAj . ºK unilateralself unilateral self Q® K10 Q® K 10 10people 10 people     illustratorfamous famous illustrator PñîD„ÓQÃQK ñ’ PñîD„Ó QÃQK ñ’ áK CK @øQîD„Òë áK CK @ øQîD„Òë Hamshahrionline Hamshahri online « » “ “ » « ” ” 292X@YªK 292 X@YªK number292 number 292 Iƒ@é J¯Qà Iƒ@ éJ¯Qà takenis is taken

included in a sentence, PrePer cannot fully succeed in the conversion. An example is given below:   ".Õç' Q Ãú× á ‚k . @P "@YÊK I. ƒ\ éËAƒ Që AÓ\ :YJJ®Ã Aëà@ an-h¯ a¯ goft-and : “ ma¯ har sale¯ “ šab-e Yalda¯ ” ra¯ jašn this-pl say.past-3pl : “ we every year “ night-ez Mithra ” ra¯ celebration mi-gir-im . ” cont-take.pres-1pl . ” They said: “We celebrate “the night of Mithra” every year.”

The second opening angle quote was instead converted to a closing angle quote and the first closing angle quote became an opening angle quote. For an easy follow up, the example below is only shown in gloss, along with Persian quotation marks.

The expected conversion: They said: «we every year «night-ez Mithra» ra¯ celebration cont-take.pres- 1pl.»

The rendered conversion: They said: «we every year »night-ez Mithra« ra¯ celebration cont-take.pres- 1pl.»

90 4.2 The Statistical Part-of-Speech Tagger: TagPer My goal in creating a tagger for Persian was to develop a robust, data-driven part-of-speech tagger to disambiguate ambiguous words (words with more than one tag) and annotate unknown words (not in the training data). The part-of-speech tagger TagPer (Seraji et al., 2012a) was developed for Persian using the statistical part-of-speech tagger HunPoS (Halácsy et al., 2007), an open source reimplementation of TnT (T. Brants, 2000). TagPer4 is released as a freely available tool for part-of-speech tagging of Persian and is open source under a GNU General Public License. HunPoS is based on Hidden Markov Models with trigram language mod- els which allow the user to tune the tagger by applying different feature set- tings. The tagger is similar to TnT with the difference that it (optionally) estimates emission/lexical probabilities based on current and previous tags. One additional difference to TnT lies in the fact that the tagger is open source whereas TnT is not. The strong side of TnT, namely its suffix-based guess- ing algorithm, which is used for handling unseen words, is also implemented in HunPoS. Moreover, HunPoS can use a morphological analyzer to narrow down the list of alternatives (possible tags) that the algorithm needs to deal with, which not only speeds up searching but also significantly improves pre- cision. In other words, the morphological analyzer generates the possible tags, to which weights are assigned by a suffix-based guessing algorithm (Halácsy et al., 2007). The tagger has various options for training, and I made use of this flexibility by testing several parameters. To optimize HunPoS for Persian, I ran a num- ber of experiments on the development set5 of the UPC with different feature settings and feature combinations. I experimented with the order of the tag transition probability by setting the option -t to either bigram tagging or the default trigram tagging, in order to estimate the probability of a tag based on the previous tags. I also examined the order of the emission probability -e for estimating the probability of a token based on the tag of the token itself as well as the previous tags. The results of training the tagger with a combination of different feature settings showed that, as could be predicted, by applying the trigram models I achieved a higher accuracy than with the bigram models. Ta- ble 4.7 shows a comparison of different models for tag transitions and word emissions. For tag distributions of unseen words based on tag distributions of rare words (words seen less than N times in the training corpus) I used the option -f with the default value 10. I tested the -s parameter, which sets the length

4http://stp.lingfil.uu.se/∼mojgan/tagper.html 5In all UPC experiments the first 10% of the UPC is the development set, the second 10% is the test set, and the remaining 80% is the training data. For model selection the experiments have been run on the development set and for model assessment the experiments have been run on the test set.

91 Table 4.7. Comparison of different models for tag transitions and word emissions. Tag Transitions Word Emissions Accuracy (%) bigram unigram 96.42 bigram bigram 96.53 trigram unigram 96.61 trigram bigram 96.81

Table 4.8. Comparison of different models for unseen words. Max Suffix Length Max Frequency Accuracy (%) 10 10 96.81 8 10 96.80 4 10 96.70 of the longest suffix to be considered by the algorithm when estimating an unseen word’s tag distribution, at the default value 10. It is worth mentioning that the most desirable possible value of this parameter (-s) may depend on the morphology and orthography of the language involved (Halácsy et al., 2007). To examine the tagger performance for unseen words, I varied the length of the suffixes. I tested suffixes of length 10 (the default value), 8 and 4. Looking at the results in Table 4.8, I can infer a decrease in accuracy when reducing the length of the suffixes. Thus, for Persian, setting suffix length to 10 yields the best results. TagPer was developed when HunPoS was trained on UPC containing 31 atomic part-of-speech tags with encoded morphological information. There are 15 main part-of-speech categories consisting of adjective, adverb, clitic, conjunction, delimiter, determiner, foreign word, interjection, symbol, noun, numeral, preposition, preverbal particle, pronoun, and verb. In addition, cat- egories such as adjective, adverb, noun, and verb are annotated for morpho- logical and some semantic features. The tagset is listed with explanations in Table 3.2.

4.2.1 The Evaluation of TagPer There are various possibilities for estimating tagging accuracy. Hence, I se- lected three different ways to evaluate TagPer. I first carried out a tagging es- timation (model assessment) where HunPoS was trained on 90% of the UPC and evaluated on the remaining 10%. The tagger achieved an overall accu- racy of 97.46% (Seraji et al., 2014). With respect to the performance of other data-driven part-of-speech taggers, such as TnT, memory-based tagger, and Maximum Likelihood Estimation, HunPoS is a good alternative for part-of- speech tagging of Persian. The result reported here is the best published result for Persian so far, though the scores may not be directly comparable with

92 Table 4.9. Recall, precision, and F-score for different part-of-speech tags when Tag- Per was evaluated on a subset of UPC. Part-of-Speech Recall (%) Precision (%) F-Score (%) Frequency ADJ 93.22 92.88 93.05 25905 ADJ_CMPR 97.97 95.89 96.92 643 ADJ_INO 63.82 76.70 69.67 387 ADJ_SUP 99.87 98.77 99.32 807 ADJ_VOC 100 100 100 1 ADV 87.97 89.01 88.49 5388 ADV_COMP 88.57 87.50 88.03 245 ADV_I 75.32 92.42 83.00 389 ADV_LOC 90.70 87.23 88.93 226 ADV_NEG 86.61 84.08 85.32 366 ADV_TIME 95.10 89.89 92.42 1553 CLITIC 100 100 100 3770 CON 99.27 98.12 98.69 20612 DELM 99.98 99.88 99.93 24502 DET 97.00 93.30 95.11 5212 FW 61.65 77.43 68.64 412 INT 77.16 57.64 65.99 127 N_PL 98.41 98.37 98.39 15653 N_SING 97.36 97.78 97.57 90527 NUM 99.28 99.43 99.36 6256 N_VOC 92.30 85.71 88.88 13 P 98.16 98.20 98.18 76419 PREV 86.66 82.53 84.55 60 PRO 93.84 97.06 95.42 7018 SYM 0 0 0 0 V_AUX 95.61 99.32 97.43 1232 V_IMP 67.34 83.01 74.36 196 V_PA 98.19 97.73 97.96 7907 V_PP 98.35 96.73 97.53 3520 V_PRS 98.81 98.70 98.75 9166 V_SUB 96.81 95.77 96.29 3047 those of Raja et al. (2007), as it is unclear whether the two studies used the same training-test split. Table 4.9 shows the recall, precision, and F-score for different part-of-speech tags when TagPer was evaluated on a subset of UPC. As shown in Table 4.9, the part-of-speech tags for vocative adjective and ac- cusative marker gave the highest results by achieving 100% for recall, preci- sion, and F-score. In the case of vocative adjective, however, there is only one occurrence. The tag for foreign word, with 61.65% gave the lowest result for recall, and interjection, with 57.64% and 65.99%, gave the lowest scores for precision and F-score respectively. There were no symbols in the test set, but the system mistakenly analyzed three tokens as SYM, which resulted in zero

93 Table 4.10. Recall, precision, and F-score for different part-of-speech tags when TagPer was evaluated on 100 automatically tokenized sentences (2778 tokens) taken from the web-based journal Hamshahri. Part-of-Speech Recall (%) Precision (%) F-Score (%) Frequency ADJ 97.38 94.90 96.13 254 ADJ_CMPR 100 100 100 8 ADJ_INO 100 100 100 5 ADJ_SUP 100 100 100 8 ADV 100 95.91 97.91 18 ADV_COMP 100 100 100 5 ADV_I 100 100 100 4 ADV_LOC 100 100 100 1 ADV_NEG 100 100 100 3 ADV_TIME 100 94.44 97.14 18 CLITIC 100 100 100 50 CON 96.15 99.50 97.79 201 DELM 99.09 98.64 98.86 221 DET 98.57 95.83 97.18 72 N_PL 100 100 100 179 N_SING 97.41 98.58 97.99 916 NUM 90.47 97.43 93.82 39 P 98.58 98.70 98.64 349 PRO 92.64 98.43 95.45 64 V_AUX 100 100 100 4 V_PA 100 98.18 99.08 55 V_PP 97.95 100 98.96 48 V_PRS 99.24 99.24 99.24 132 V_SUB 100 100 100 24

for recall, precision, and F-score. The three wrongly analyzed tokens were Latin acronyms which should have the tag N_SING. I additionally made an independent tagging evaluation of the tagger in the pipeline. I applied TagPer to the 100 randomly selected sentences used in the evaluation of the previously introduced tools (PrePer and SeTPer). As part of this task, I performed two different tagging evaluations. I used the tagger first on the automatically tokenized text and then on the manually tokenized text. In the automatically tokenized text experiment, I manually annotated the manually normalized and segmented gold file with part-of-speech informa- tion using the same tagset that TagPer was built on, to be served as gold. I then tagged the test file (the automatically tokenized text) with TagPer. The tagging evaluation revealed 97.91% recall, 98.27% precision, and 98.09% F- score for the test set with 100 sentences and 2778 tokens. Table 4.10 shows the results for recall, precision, and F-score for different part-of-speech tags when TagPer was evaluated on 100 automatically tokenized sentences. As shown

94 Table 4.11. Recall, precision, and F-score for different part-of-speech tags when TagPer was evaluated on 100 manually tokenized sentences (2788 tokens) taken from the web-based journal Hamshahri. Part-of-Speech Recall (%) Precision (%) F-Score (%) Frequency ADJ 98.88 94.98 96.89 258 ADJ_CMPR 100 100 100 8 ADJ_INO 100 100 100 5 ADJ_SUP 100 100 100 8 ADV 100 97.91 98.94 17 ADV_COMP 100 100 100 5 ADV_I 100 100 100 4 ADV_LOC 100 100 100 1 ADV_NEG 100 100 100 3 ADV_TIME 100 94.44 97.14 18 CLITIC 100 100 100 50 CON 96.63 99.50 98.04 202 DELM 100 99.54 99.77 221 DET 98.57 95.83 97.18 72 N_PL 100 100 100 179 N_SING 98.16 99.56 98.85 914 NUM 100 100 100 42 P 99.29 98.82 99.05 352 PRO 94.11 98.46 96.24 65 V_AUX 100 100 100 4 V_PA 100 98.18 99.08 55 V_PP 100 100 100 49 V_PRS 100 100 100 132 V_SUB 100 100 100 24

in Table 4.10, the part-of-speech tags for comparative adjective, participle ad- jective, superlative adjective, adverb of comparison, adverb of interrogative, adverb of location, adverb of negation, accusative marker, plural noun, auxil- iary and subjunctive verbs all resulted in 100% recall, precision, and F-score. On the other hand, the tag for numeral, with 90.47% and 93.82% gave the low- est results for recall and F-score respectively, and the tag for adverb of time, with 94.44%, shows the lowest precision. I then automatically tagged the manually normalized and segmented text (the gold file in the previous experiment) and compared the tagging with the manually tagged gold file. The evaluation revealed an overall tagging accuracy of 98.78% on the test set with 100 sentences and 2788 tokens (as previously noted, the manually tokenized text includes 10 more tokens than the auto- matically tokenized text). This experiment showed an improvement of 0.51% in precision compared to the results of the automatically tokenized text. Ta- ble 4.11 shows the results for recall, precision, and F-score for different part-

95 of-speech tags when TagPer was evaluated on 100 manually tokenized sen- tences. As shown in Table 4.11, the part-of-speech tags for comparative adjective, participle adjective, superlative adjective, adverb of comparison, adverb of interrogative, adverb of location, adverb of negation, accusative marker, plural noun, numeral, auxiliary verb, past participle verb, present tense verb, and subjunctive verb all attain 100% recall, precision, and F-score, whereas the tag for pronoun, with 94.11% and 96.24%, gave the lowest results for recall and F-score respectively. The tag for adverb of time, with 94.44%, delivers the lowest precision. The evaluation measures of tagging performance in the presented exper- iments are very good. The results show that TagPer is the best automatic- learning part-of-speech tagger for Persian so far, since it outperforms other part-of-speech taggers developed for Persian. Thus, the achieved scores can be considered as the state-of-the-art results.

96 © Mojgan Seraji

5. Uppsala Persian Dependency Treebank

This chapter presents the Uppsala Persian Dependency Treebank (UPDT)1 with a syntactic annotation scheme based on Stanford Typed Dependencies. The chapter describes the principles of data selection and the overall ap- proach to syntactic annotation. It further gives a comparative analysis of the UPDT and another dependency treebank developed in parallel with the UPDT, namely, the Persian Dependency Treebank (PerDT) (Rasooli et al., 2013).

5.1 Corpus Overview The Uppsala Persian Dependency Treebank (Seraji et al., 2012b; Seraji et al., 2013; Seraji et al., 2014) is a syntactically annotated corpus of contemporary Persian based on dependency grammar. The treebank consists of 6,000 anno- tated and validated sentences, 151,671 word tokens, and 15,692 word types. Table 5.1 presents statistical information about the treebank. The average sen- tence length in the treebank is 25 words, ranging from a few words to over 150 words. The treebank data is from different genres, including newspaper articles and fiction, as well as technical descriptions and texts about culture and art taken from the open source, validated UPC (see Section 3.2).

Table 5.1. A statistical overview of the UPDT. Categories Number Sentences 6,000 Tokens 151,671 Types 15,692 PoS tags 31 Dependency labels 96

To select sentences for the treebank, I extracted the first 6,000 sentences of the UPC. I decided not to randomly choose the data from various parts of the UPC, as this could impact text cohesion (discourse). In other words, it could interrupt the continuity of multiple syntactic characteristics such as presup- positions, implications, anaphoric elements, and the natural tense structure, which consequently could impact the frequency distribution. As shown in

1http://stp.lingfil.uu.se/∼mojgan/UPDT.html

99 Figure 5.1, this means that the data overlaps with the development set of the UPC that was used for various part-of-speech tagging experiments in Chap- ter 4.2 as well as in the parsing experiments described in Seraji et al. (2012c). In Seraji et al. (2012c) I trained two parsers (MaltParser and MSTParser) on the UPDT by using different part-of-speech tag sets, once with gold standard part-of-speech tags taken from UPC and once with tags automatically gener- ated during training and testing. For automatic generation of part-of-speech features I used TagPer. However, I excluded the treebank data from UPC and retrained TagPer to avoid overlap. Moreover, in future experiments I want to avoid overlap between the treebank data and the training or test set of the UPC. Selecting the first 6,000 sentences was thus the best compromise I could find.

UPDT UPC

Dev Test Train

Figure 5.1. Data selection of the UPDT.

The treebank is freely available in CoNLL-format2 and is open source un- der a Creative Commons Attribution 3.0 Unported License (CC BY 3.0). A comprehensive description of the extended version of Stanford Typed Depen- dencies for Persian and the morphosyntactic features can be found in Seraji et al. (2013).

5.2 Treebank Development To annotate the sentences in the treebank, I used MaltParser (Nivre et al., 2006) in a bootstrapping scenario. I started by training MaltParser on a seed data set of 215 manually validated sentences and used the induced model to parse the rest of the treebank corpus. I then selected a subset of these sentences for manual correction, added them to the training set, retrained the parser, and reparsed the remaining corpus. This process was iterated as the size of the treebank grew and the quality of the parser improved. The selection of sentences for human validation could have been done using active learning (Hwa, 2004; Sassano and Kurohashi, 2010). However, since the treebank was

2http://ilk.uvt.nl/conll/#dataformat

100 relatively small I did not do that, opting instead for a simple approach and proceeding sequentially. In order to annotate and correct the syntactic annotation in the tree structure I used the free tree-editing software TrEd.3 TrEd (Hajicˇ et al., 2001) is a fully programmable and customizable graphical user interface for tree-like struc- tures, and was the main annotation tool used for the Prague Dependency Tree- bank. From TrEd I exported annotations in the PML format4 to the CoNLL-X format (Buchholz and Marsi, 2006), which is the official distribution format of UPDT.

5.3 Annotation Scheme I use a syntactic annotation scheme based on dependency structure, where each dependency relation is annotated with a functional category, indicating the grammatical function of the dependent to the head. The annotation scheme is based on Stanford Typed Dependencies (STD) (de Marneffe et al., 2006), which has become a de facto standard for English. As mentioned earlier in Section 2.1.2, there is additionally a revised version of Stanford dependen- cies, namely, the Universal Stanford Dependencies (de Marneffe et al., 2014). In developing the UPDT I could not use the universal grammatical relations described in de Marneffe et al. (2014) since the treebank was released much earlier than when the scheme was introduced. However, some relations that are referred to as universal grammatical relations in the Universal Stanford Dependencies are taken from the relations that are introduced as added rela- tions to the STD for Persian described in Seraji et al. (2013). The dependency relations are designated by the labels fw for foreign words, dep-top for topic dependent, and dep-voc for vocative dependent in UPDT. These relations are generalized in the Universal Stanford Dependencies to foreign, dislocated, and vocative respectively. The extended version of STD for Persian has a total of 96 dependency re- lations of which 48 (including 10 new additions) are used for indicating basic relations. The remaining 48 labels are complex, and are used to assign syntac- tic relations to words containing unsegmented clitics. In the following sections I will describe the basic relations from STD, including the new relations in the UPDT and the complex relations. Moreover, there are relations in the original STD that are excluded in the UPDT due to the fact that these are not relevant in Persian. These relations are introduced as unused relations which will further be described.

3TrEd is licensed under a GNU General Public License and is available at http://ufal.mff.cuni.cz/∼pajas/. 4Prague Markup Language (PML) is a generic data format used for the storing and interchange of linguistic annotations. The PML format is based on XML.

101 5.4 Basic Relations In the STD representation, the dependency annotation of a sentence always forms a tree representing all tokens of the sentence (including punctuation marks) and rooted at an artificial root node prefixed to the sentence. Thus, I adopt the so-called basic version of STD (with punctuation retained), as opposed to the collapsed version, where some tokens may not correspond to nodes in the dependency structure and a single node may have more than one incoming arc. In general, every token in a sentence is assigned a syntactic head and one dependency label.

5.4.1 Relations from Stanford Dependencies I have used 38 grammatical relations from the 50 relations in the original STD.5 These are defined below in alphabetical order according to the abbrevi- ated name of the dependency labels that appear in the parser output. For each relation, I give examples taken from UPDT. I then discuss in Section 5.4.2 the new relations that I have found necessary to introduce for the annotation of Persian. acomp: Adjectival Complement An adjectival complement of a verb is an adjectival phrase that functions as the complement (like an object of the verb).  . YƒPú× áºÜ ØAK Q¢  éK. èYJ ®« áK @ in aqide be nazar na-momken¯ mi-res-ad . this idea to thought neg-possible cont-reach.pres-3sg . This idea seems to be impossible. acomp(YƒPú× ,áºÜ ØAK ) acomp(cont-reach.pres-3sg, neg-possible) advcl: Adverbial Clause Modifier An adverbial clause modifier is a clause modifying the verb (temporal clause, conditional clause, etc.).  YK@ñ Kú× Aëà @ øAëéJ¯AK ,YJ ƒAK . èXP €Yg IƒPX à@Q ‚ëð QK áK @ ém'AJk      . XQ à P@Q¯ PñK.QÓ øèPAJ ƒ ‡J ¯X IÓA m• ð Ñm.k ‘J j‚ ø@QK. 5I have considered the relations in the Stanford Typed Dependencies manual from 2008. The relations were revised and changed for the Stanford Parser v.3.3 in December 2013, about six months after I released the treebank. Some of the applied relations have been removed in the new version of the manual. These relations are the dependency labels for complementizer (complm), relative clause modifier (rcmod), and relative (rel).

102 ˇcenanˇce¯ in pažuhešgar-an¯ dorost hads zad-e baš-and¯ , if this researcher-pl correct guess hit.past-pp sub.be.pres-3pl , yafte-h¯ a-ye¯ an-h¯ a¯ mi-tavan-ad¯ bara-ye¯ tašxis-e hajm va finding-pl-ez this-pl cont-can.pres-3sg for-ez diagnosis-ez volume and zexamat-e¯ daqiq-e sayare-ye¯ mazbur qarar¯ gir-ad . thickness-ez exact-ez planet-ez aforementioned place sub.take.pres-3sg . If these researchers made a correct guess, their findings/results can determine the exact volume and thickness of the aforementioned planet. advcl(XQà , èXP ) advcl(sub.take.pres-3sg, hit.past-pp) advmod: Adverbial Modifier An adverbial modifier of a word is a (non-clausal) adverb or adverbial phrase that serves to modify the meaning of the word.   . YJ ƒP H Ag éK. HAK Qå„ PX l'. PYJK. àPñK. PAK@ as¯ ar-e¯ Born be.tadrij dar našriy-at¯ be ˇcap¯ work.pl-ez Born to.gradualness in magazine-pl to publication resid . reach.past.3sg . Born’s artworks were published gradually in magazines. advmod(  ) YJ ƒP ,l'. PYJK. advmod(reach.past.3sg, to.gradualness)  .XñK. èXAÓ@ é‚ Òë ð@ u hamiše am¯ ade¯ bud . she/he always ready be.past.3sg . She/he was always ready.  advmod( èXAÓ@ , é‚ Òë) advmod(ready, always) amod: Adjectival Modifier An adjectival modifier of a nominal is any adjectival phrase that serves to modify the meaning of the nominal.     .Iƒ@ AêÓX@ øP@QºK úÃYKP PX QKúGYKAÓ úG AJ KX ÈAJ.KX éK. ð@ 103 u be donbal-e¯ donya-i¯ mandani-tar¯ dar zendegi-e tekrari¯ -e she/he to after-ez world-indef lasting-more in life-ez repetitive-ez adam-h¯ a¯ ast . people-pl be.pres.3sg . She/he is after a more lasting world in people’s repetitive life. amod (úGAJ KX ,QKú GYKAÓ ) amod(world-indef, lasting-more) amod(úÃYK P ,øP@QºK) amod(life-ez, repetitive) appos: Appositional Modifier An appositional modifier of a nominal is another nominal that serves to modify the first. It includes parenthesized examples.   .XQ» HA ¯CÓ , é‚@Q¯ ék. PAg PñÓ@ QK Pð ,Qƒñ» XPAKQK . AK. XAK.@ÐCƒ@ PX ð@ u dar Eslam¯ ab¯ ad¯ ba¯ Bernard¯ Kušner , vazir-e omur-e she/he in Islamabad with Bernard Kouchner , minister-ez affair.pl-ez xareje-ye¯ faranse¯ , molaq¯ at¯ kard . foreign-ez France , meeting do.past.3sg . She/he met Bernard Kouchner, French foreign mininster, in Islamabad. appos(XPAKQK ,QKPð ) appos(Bernard,. minister) aux: Auxiliary An auxiliary of a clause is a non-main verb of the clause, e.g. modal auxiliary,  àXñK . (be), and á ƒ@X (have) in a composed tense.  . Iƒ@ éJƒ@Y K Xñk. ð CJ.¯ é» XPA‚ . @P øQ g Yë@ñkú× ð@ u mi-xah-ad¯ ˇciz-i ra¯ be-saz-ad¯ ke qablan she/he cont-want.pres-3sg thing-indef ra¯ sub-build.pres-3sg that before vojud na-dašt-e¯ ast . exist neg-have.past-pp be.pres.3sg . She/he wants to create something that did not exist before. aux(XPA‚ ,Yë@ñkú× ) aux(sub-build.pres-3sg,. cont-want.pres-3sg) aux( éJƒ@Y K, Iƒ@ ) aux(neg-have.past-pp, be.pres.3sg)

104 auxpass: Passive Auxiliary A passive auxiliary of a clause is a non-main verb of the clause which contains the passive information.  . Yƒ èYK X úæ„ÖÞ éÓñ¢JÓ P@ h. PAg èPAJ ƒ á Ëð@ avalin sayyare xarej¯ az manzume-ye šamsi did-e šod first planet outside of system-ez solar see.past-pp become.past.3sg . . The first planet outside the solar system was sighted. auxpass( èYKX,Yƒ) auxpass(see.past-pp, become.past.3sg) cc: Coordination A coordination is the relation between an element of a conjunct and the co- ordinating conjunction word of the conjunct. One conjunct of a conjunction, normally the first, is taken as the head of the conjunction.    .YK Pú× YKñJ K AK ðP ð ÉJ m' AK. @P IJ ª¯@ð øAJ KX àPñK . ­ËX@ Adolf¯ Born donya-ye¯ vaqeiyat¯ ra¯ ba¯ taxayol va roya¯ peyvand Adolf Born world-ez reality ra¯ with imagination and dream link mi-zan-ad . cont-hit.pres-3sg . Adolf Born links the world of reality with imagination and dream. cc(ÉJm ' ,ð) cc(imagination, and)

èXQºK hQ¢Ó @P úæƒ@ñ kPX á Jk  àñ J»A K àA J‚»AK  IËðX AÓ@ XðQ ¯@ ð@ .Iƒ@ u afzud amma¯ dolat-e Pakest¯ an¯ ta.konun¯ ˇcenin she/he add.past.3sg but government-ez Pakistan until.now such darxast-i¯ ra¯ matrah na-kard-e ast . request-indef ra¯ raise neg-do.past-pp be.pres.3sg . She/he added: but the goverment of Pakistan has not yet proposed such a request. cc( èXQºK ,AÓ@) cc(neg-do.past-pp, but)

105 ccomp: Clausal Complement A clausal complement of a verb or adjective is a dependent clause with an internal subject which functions like an object of the verb, or adjective.

P@ QKBAK . ñJ »ñK QîD PX AëéJK Që á ÂKAJ Ó é» YëXú× àA ‚ Aëúæ PQK. .Iƒ@ ¸PñK ñJ K barresi-ha¯ nešan¯ mi-dah-ad ke miyangin-e¯ hazine-ha¯ dar study-pl indication cont-give.pres-3sg that average-ez cost-pl in šahr-e Tokio bal¯ a-tar¯ az Niu-York ast . city-ez Tokyo high-more than New-York be.pres.3sg . Studies show that average costs in Tokyo city are higher than in New York. ccomp(YëXú× ,QKBAK ) ccomp(cont-give.pres-3sg,. high-more)  . I ƒ@X éÓ@X@ AëH YÓ AK €ðA¿ áK @ é» YëXú× àA ‚ Aëúæ PQK. barresi-ha¯ nešan¯ mi-dah-ad ke in kavoš¯ ta¯ moddat-ha¯ study-pl indication cont-give.pres-3sg that this search until time-pl edame¯ dašt¯ . continuation have.past.3sg . Studies show that this search continued for a long time. ccomp(YëXú× ,I ƒ@X ) ccomp(cont-give.pres-3sg, have.past.3sg) complm: Complementizer A complementizer of a clausal complement (ccomp) is the word introducing it. It will be the subordinating conjunction é» (that).

P@ QKBAK . ñJ »ñK QîD PX AëéJK Që á ÂKAJ Ó é» YëXú× àA ‚ Aëúæ PQK. .Iƒ@ ¸PñK ñJ K barresi-ha¯ nešan¯ mi-dah-ad ke miyangin-e¯ hazine-ha¯ dar study-pl indication cont-give.pres-3sg that average-ez cost-pl in šahr-e Tokio bal¯ a-tar¯ az Niu-York ast . city-ez Tokyo high-more than New-York be.pres.3sg . Studies show that average costs in Tokyo city are higher than in New York. complm(QKBAK , é») complm(high-more,. that)  . I ƒ@X éÓ@X@ AëH YÓ AK €ðA¿ áK @ é» YëXú× àA ‚ Aëúæ PQK. 106 barresi-ha¯ nešan¯ mi-dah-ad ke in kavoš¯ ta¯ moddat-ha¯ study-pl indication cont-give.pres-3sg that this search until time-pl edame¯ dašt¯ . continuation have.past.3sg . Studies show that this search continued for a long time. complm(I ƒ@X , é») complm(have.past.3sg, that) conj: Conjunct A conjunct is a relation between two elements connected by a coordinating conjunction, such as ð (and), AK (or), etc. Conjunctions are treated asymmet- rically. The head of the relation is the first conjunct, and other conjunctions depend on it via the conj relation.    .YK Pú× YKñJ K AK ðP ð ÉJ m' AK. @P IJ ª¯@ð øAJ KX àPñK . ­ËX@ Adolf¯ Born donya-ye¯ vaqeiyat ra¯ ba¯ taxayol va roya¯ peyvand Adolf Born world-ez reality ra¯ with imagination and dream link mi-zan-ad . cont-hit.pres-3sg . Adolf Born links the world of reality with imagination and dream. conj(ÉJm ' ,AKðP) conj(imagination, dream) cop: Copula A copula is the relation between the complement of a copula verb and the copula itself. The copula verb is taken as a dependent of its complement, except when the complement is a prepositional phrase (second example below).

. Iƒ@ YJÓ Që ¹K àPñK . Born yek honarmand ast . Born an artist be.pres.3sg . Born is an artist. cop(YJÓ Që ,Iƒ@ ) cop(artist, be.pres.3sg)

. ÐXñK. AJ.KP@ PX HPñ ‚Ó ø@QK. AÖß@X Ñë éºJ.ƒ àB ñ‚Ó AK. áK @ XPñÓ PX 107 dar mored-e in ba¯ mas‘ul-an-e¯ šabake ham daeman¯ bara-ye¯ in case-ez this with official-pl-ez network likewise constantly for-ez mašverat dar ertebat¯ bud-am . consultation in contact be.past-1sg . In this case I was constantly in contact with network officials for consultation too. root(ROOT, ÐXñK) root(ROOT, be.past-1sg). prep(ÐXñK, PX) prep(be.past-1sg,. in) dep: Dependent The dependent relation is used when it is impossible to determine a more precise dependency relation between two words, or when the dependency re- lation is deemed to rare or insignificant to merit its own label. In the following example, the past participle verb éJ¯Qà (taken) is placed in circumposition6 to emphasize the preposition P@ (from) as a point of departure.      ,qÊK øQK ðA’ AK éJ¯Qà Aëém'. ø@QK. á ‚ËX ð èXAƒ øAëøPAƒQK ñ’ P@ ... .AêËA‚ÃPQK . ø@QK. èYJ j  K ð PñÓQÓ ... az tasvir.sazi-h¯ a-ye¯ sade va delnešin bara-ye¯ baˇce-ha¯ ... from image.making-pl-ez simple and pleasant for-ez child-pl gereft-e ta¯ tasavir-i¯ talx , marmuz va piˇcide bara-ye¯ take.past-pp to image.pl-indef bitter , mysterious and complex for-ez bozorgsal-h¯ a¯ . adult-pl . ... from simple and pleasant illustration for children to bitter, mysterious and complex images for adults. dep(P@ , éJ¯Qà ) dep(from, take.past-pp) det: Determiner A determiner is the relation between a nominal head and its determiner.

...YJ ƒAK . èXP €Yg IƒPX à@Q ‚ëð QK áK @ ém'AJk  ˇcenanˇce¯ in pažuhešgar-an¯ dorost hads zad-e baš-and¯ ... if this researcher-pl correct guess hit.past-pp sub.be.pres-3pl ...

6Circumposition implies a position where a prepositional phrase is surrounded by prepositions, more specifically, it has a preposition and a postposition.

108 If these researchers made a correct guess ... det(à@Q ‚ëð QK ,áK @) det(researcher-pl, this) dobj: Direct Object The direct object of a verb is the nominal which is the (accusative) object of the verb.    . YK Pú× YKñJ K AK ðP ð ÉJ m' AK. @P IJ ª¯@ð ø AJ KX àPñK . ­ËX@ Adolf¯ Born donya¯-ye vaqeiyat ra¯ ba¯ taxayol va roya¯ peyvand Adolf Born world-ez reality ra¯ with imagination and dream link mi-zan-ad . cont-hit.pres-3sg . Adolf Born links the world of reality with imagination and dream. dobj(YK Pú× ,AJKX ) dobj(cont-hit.pres-3sg, world) mark: Marker A marker of an adverbial clause modifier (advcl) is the word introducing it. It will be a subordinating conjunction different from é» (that), e.g., the multi-word expressions é» úæ¯ð /vaqti ke/ (gloss: when that, translation: when), é» úÍAg PX /dar hal-i¯ ke/ (gloss: in state-indef that, translation: while), é» QÃ@ /agar ke/ (gloss: if that, translation: if), etc.  ... èYJ .‚k @P XñÒm× é®K é» úÍAg PX YÔg@ Ahmad dar hal-i¯ ke yaqe-ye Mahmud ra¯ ˇcasbid-e ... Ahmad in state-indef that collar-ez Mahmoud ra¯ attach.past-pp ... While Ahmad attached Mahmoud’s collar ... mark( èYJ‚k ,[é» úÍAg]PX) mark(attach.past-pp, .  in [state-indef that])

...YJ ƒAK . èXP €Yg IƒPX à@Q ‚ëð QK áK @ ém'AJk  ˇcenanˇce¯ in pažuhešgar-an¯ dorost hads zad-e baš-and¯ ... if this researcher-pl correct guess hit.past-pp sub.be.pres-3pl ... If these researchers made a correct guess ... mark( èXP , ém' AJk ) mark(hit.past-pp,  if)

109 mwe: Multi-Word Expression The multi-word expression (modifier) relation is used for certain multi-word expressions that behave like a single function word, in particular conjunctions and prepositions. Examples include: é» áK @ Xñk. ð AK. /ba¯ vojud-e in ke/ (gloss: with existence-ez this that, translation: despite), é» áK @ éK ékñK AK /ba¯ tavajjoh be in ke/ (gloss: with attention to this that, translation :. with. respect. to), é» áK @ QK. èðC« /alave¯ bar in ke/ (gloss: addition to this that, translation: in addition to), á Jj  Òë ð /va hamcenin/ˇ (gloss: and also, translation: and also), é» áK @ øAg. éK. /be ja-ye¯ in ke/ (gloss: to place-ez this that, translation: instead of), /be xater-e¯ in ke/ (gloss: to sake-ez this that, é» áK @ Q£Ag éK. translation  gloss translation : because of), ÉJ J.¯ P@ /az qabil-e/ ( : of type-ez, : such as), AÓ@ ð /va amma/¯ (gloss: and but, translation: but), ÉJ ËX éK. /be dalil-e/ (gloss: to reason-ez, translation: because of), IÊ« éK. /be ellat-e/ (gloss: to cause-ez, translation: because of), /be xater-e/¯ (gloss: to Q£Ag éK. sake-ez, translation: for the sake of), é» áK @AK /ta¯ in ke/ (gloss: than this that, translation: rather than). The first token of a multi-word expression is treated as the head of the expression, and subsequent elements are attached in a chain with each word being dependent on the immediately preceding one by the mwe relation.  èXñK. P@YÓ PX ø@èPAJ ƒ éK.XAg. IÊ« éK. €PQË áK @ é» Xñƒú× Pñ’ á Jk  .Iƒ@ ˇcenin tasavvor mi-šav-ad ke in larzeš be ellat-e so thought cont-become.pres-3sg that this vibration to reason-ez jazebe-ye¯ sayyare-i dar madar¯ bud-e ast . gravity-ez planet-indef in orbit be.past-pp be.pres.3sg . It is thought that this vibration has been due to the gravity of a planet in orbit/it is thought that this vibration has been caused by the gravity of a planet in orbit. mwe( éK ,IÊ« ) mwe(to,. reason) neg: Negation Modifier The negation modifier is the relation between a negation word and the word it modifies.    . èXñK. úÆJëQ ¯ éºÊK. ú«AÒJk . @ HC ’ªÓ éK ‡J ®m' hQ£ tarh-e tahqiq na mo‘zal-at¯ -e ejtema‘i¯ balke farhangi bud-e . project-ez research no issue-pl-ez social but cultural be.past-pp .

110 The research project has not been a social issue but a cultural problem. neg(HC ’ªÓ , éK ) neg(issue-pl, no) nn: Noun Compound Modifier A noun compound modifier of a nominal is any noun that serves to modify the head noun. In UPDT, this relation is also used for compound names, with the first name as the head.   .XQ» HA ¯CÓ , é‚@Q¯ ék. PAg PñÓ@ QK Pð , Qƒñ» XPAKQK . AK. XAK.@ÐCƒ@ PX ð@ u dar Eslam¯ ab¯ ad¯ ba¯ Bernard¯ Kušner , vazir-e omur-e she/he in Islamabad with Bernard Kouchner , minister-ez affair.pl-ez xareje-ye¯ faranse¯ , molaq¯ at¯ kard . foreign-ez France , meeting do.past.3sg . She/he met Bernard Kouchner, French foreign mininster, in Islamabad. nn(XPAKQK ,Q ƒñ» ) nn(Bernard,. Kouchner) npadvmod: NP as Adverbial Modifier This relation captures various location where something that is syntactically a noun phrase is used as an adverbial modifier in a sentence. These usages include: (i) a measure phrase, which is the relation between an adjective, adjective/adverb/prepositional modifier and the head of a measure phrase modifying it; (ii) extent phrases, which modify verbs but are not objects; (iii) financial constructions involving an adverbial noun phrase (iv) floating reflexives; and (v) certain other absolute noun phrase constructions. A temporal modifier (tmod) is a subclass of npadvmod that is distinguished as a separate relation.  ... YëXú× øPQ K@ àYK . éK. IƒAÓ àA¾ Jƒ@ ¹K P@ Q‚ K. øQËA¿ 20 ...... 20 kalori¯ bištar az yek estekan¯ mast¯ be badan enerži ... 20 calory more than a cup yogurt to body energy mi-dah-ad ... cont-give.pres-3sg ...... gives 20 more calories of energy to the body than a cup of yogurt ... npadvmod(Q‚ K ,øQËA¿) npadvmod(more, . calory)

111 nsubj: Nominal Subject A nominal subject is a noun phrase that is the syntactic subject of a clause. The governor of this relation might not always be a verb; when the verb is a copula, the root of the clause is the complement of the copula verb, which can be an adjective or noun. (When the complement is a prepositional phrase, the copula is taken as the root of the clause.)

...YJ ƒAK . èXP €Yg IƒPX à@Q ‚ëð QK áK @ ém'AJk  ˇcenanˇce¯ in pažuhešgar-an¯ dorost hads zad-e baš-and¯ ... if this researcher-pl correct guess hit.past-pp sub.be.pres-3pl ... If these researchers made a correct guess ... nsubj( èXP, à@Q ‚ëð QK ) nsubj(hit.past-pp, researcher-pl) nsubjpass: Passive Nominal Subject A passive nominal subject is a noun phrase that is the syntactic subject of a passive clause.  .Yƒ èYK X úæ„ÖÞ éÓñ¢JÓ P@ h. PAg èPAJ ƒ á Ëð@ avvalin sayyare¯ xarej¯ az manzume-ye šamsi did-e šod first planet outside of system-ez solar see.past-pp become.past.3sg . . The first planet outside the solar system was sighted. nsubjpass( èYKX , èPAJƒ) nsubjpass(see.past-pp, planet) num: Numerical Structure A numeric modifier of a noun is any number phrase that serves to modify the meaning of the noun.

.XPñkú× YJ ®ƒñà 3 ÐAƒ Sam 3 gusfand mi-xor-ad . Sam 3 sheep cont-eat.pres-3sg . Sam eats 3 sheep. num(YJ ®ƒñà ,3) num(sheep, 3)

112 number: Element of Compound Number An element of a compound number is a part of a number phrase or currency amount.

.YJ» I k@XQK  IÓ@Q « PBX àñJ ÊJ Ó 466 YK AK. ð@ u bayad¯ 466 miliun dolar¯ qaramat¯ pardaxt¯ kon-ad . she/he should 466 million dollar compensation pay sub.do.pres-3sg . She/he should pay $ 466 million in compensation. number(PBX,àñJ ÊJÓ) number(dollar, million) parataxis: Parataxis The parataxis relation (from Greek for ‘place side by side’) is a relation between the main verb of a clause and other sentential elements, such as a sentential parenthetical, or a clause after colon (:) or semicolon (;).

.Yƒ QK X H@éƒPYÓ : PXAÓ madar¯ : madrese-at dir šod . mother : school-pc.2sg late become.past.3sg . Mother: you are late for school. parataxis(PXAÓ ,QKX) parataxis(mother, late) pobj: Object of a Preposition The object of a preposition is the head of a noun phrase following the preposition. (The preposition may in turn be modifying a noun, verb, etc.)

...YJ»ú× úÃYK P éJ »QK PX ð@ u dar Torkiye zendegi mi-kon-ad ... she/he in Turkey life cont-do.pres-3sg ... She/he lives in Turkey. pobj(PX , éJ»QK) pobj(in, Turkey) poss: Possession Modifier The possession modifier relation holds between a noun and its possessive determiner, or a genitive complement. In Persian a noun is usually followed by a modifier or a genitive complement with ezafe¯ marking on the head noun. The relation poss is used when the modifier is a noun, pronoun or infinitive,

113 except in the case of compound names, where the nn relation is used instead. (For adjectival and participial modifiers in ezafe¯ constructions, the amod relation is used.) In the case of lexicalized units without ezafe¯ the relation is defined as mwe.7

ém' IƒX dast-e baˇcˇce . hand-ez child child’s hand poss(IƒX , ém') poss(hand-ez, child). preconj: Preconjunct A preconjunct is the relation between the head of a coordinated phrase and a word that appears at the beginning bracketing a conjunction (such as either, both, neither in English).  .XP@X ‘’m' Ðñm.' øAJ KX PX ék ð úæ•AK P øAJ KX PX ék ð@ ...... u ˇce dar donya-ye¯ riyazi¯ va ˇce dar donya-ye¯ ... she/he also in world-ez mathematics and also in world-ez nojum taxassos dar-ad¯ . astronomy expertise have.pres-3sg . ... She/he has expertise in the worlds of both mathematics and astronomy. preconj(PX , ék) preconj(in, also) predet: Predeterminer A predeterminer is the relation between a noun and a word that precedes and modifies the meaning of its determiner.  ... AëÈAƒ áK @ ÐAÖß tamam¯ -e in sal-h¯ a¯ ... all-ez this year-pl ... All of these years ...

7Poss is an unfortunate choice of name, since this relation covers much more than the narrow possessive relation. However, for the sake of conformance with STD for English, the label is retained rather than being renamed to genitive modifier (genmod), or even nominal modifier (nmod), which would be more appropriate.

114 predet(AëÈAƒ ,ÐAÖß) predet(year-pl, all) prep: Prepositional Modifier A prepositional modifier of a verb, adjective, or noun is any prepositional phrase that serves to modify the meaning of the verb, adjective, noun, or even another preposition.

... YJ»ú× úÃYK P éJ »QK PX ð@ u dar Torkiye zendegi mi-kon-ad ... she/he in Turkey life cont-do.pres-3sg ... She/he lives in Turkey. prep(YJ»ú× ,PX) prep(cont-do.pres-3sg, in) prt: Phrasal Verb Particle The verb particle relation holds between the verb and its particle.  . YÓ@ Yë@ñk PX HPñ“ ék éK. be ˇce surat dar xah-ad¯ amad¯ . to what shape in will.fut-3sg come.past . How will it be.  prt(YÓ@ ,PX) prt(come.past, in) punct: Punctuation This relation is used for any piece of punctuation in a clause.  . Yƒ èYK X úæ„ÖÞ éÓñ¢JÓ P@ h. PAg èPAJ ƒ á Ëð@ avvalin sayyare¯ xarej¯ az manzume-ye šamsi did-e šod first planet outside of system-ez solar see.past-pp become.past.3sg . . The first planet outside the solar system was sighted. punct( èYKX,.) punct(see.past-pp, .)

115 quantmod: Quantifier Phrase Modifier A quantifier modifier is an element modifying the head of a quantifier phrase. (These are modifiers in complex numeric quantifiers, not other types of ‘quantification’.)   .YëXú× àA ‚ @P ÁK P éK. é®J ¯X èX XðYg I«Aƒ sa‘at¯ hodud-e dah daqiqe be zang ra¯ nešan¯ mi-dah-ad . clock about-ez ten minute to bell ra¯ show cont-do.pres-3sg . The clock shows about ten minutes before the break. quantmod( èX ,XðYg) quantmod(ten, about-ez) rcmod: Relative Clause Modifier A relative clause modifier of a noun is a relative clause modifying the noun. The relation points from the noun to the head of the relative clause, normally a verb.    .Y‚»ú× QK ñ’ éK. Iƒ@ øPAg. ð@ ÉJ m' é¢J k PX ¡®¯ é» @P úGAëQ g ð@ u ˇciz-ha-i¯ ra¯ ke faqat dar hite-ye taxayol-e u she/he thing-pl-indef ra¯ that only in scope-ez imagination-ez she/he jari¯ ast be tasvir mi-keš-ad . running be.pres.3sg to illustration cont-draw.pres-3sg . She/he only portrays things that lie within the scope of her/his imagination. rcmod( ) úGAëQ g ,øPAg. rcmod(thing-pl-indef, running) rel: Relative A relative of a relative clause is the relative marker “ é»” /ke/ that introduces it (and which cannot be analyzed as a relative pronoun).    .Y‚»ú× QK ñ’ éK. Iƒ@ øPAg. ð@ ÉJ m' é¢J k PX ¡®¯ é» @P úGAëQ g ð@ u ˇciz-ha-i¯ ra¯ ke faqat dar hite-ye taxayol-e u she/he thing-pl-indef ra¯ that only in scope-ez imagination-ez she/he jari¯ ast be tasvir mi-keš-ad . running be.pres.3sg to illustration cont-draw.pres-3sg . She/he only portrays things that lie within the scope of her/his imagination. rel(øPAg , é») rel(running,. that)

116 root: Root The grammatical relation root points to the root of the sentence. A fake node ‘ROOT’ is used as the governor. The ROOT node is indexed with ‘0’, since the indexing of real words in the sentence starts at 1. The root of the sentence is normally a verb but in the case of copula constructions can be a noun, pronoun, adjective or adverb. The copula is taken as the root of the sentence only when its complement is a prepositional phrase (analyzed as prep).  .Iƒ@ Ag ú¾J.ƒ ‡ËAg é» Iƒ@ éJƒ@Y K A«X@ èAÆj Jë àPñK . Born hiˇc-gah¯ eddea¯ na-dašt-e¯ ast ke xaleq-e¯ Born never-time claim neg-have.past-pp be.pres.3sg that creator-ez sabk-i xas¯ ast . style-indef particular be.pres.3sg . Born never claimed to be the creator of a particular style. root(ROOT, éJƒ@Y K ) root(ROOT, neg-have.past-pp)

.Iƒ@ YJÓ Që ¹K ð@ u yek honarmand ast . she/he an artist be.pres.3sg . She/he is an artist. root(ROOT, YJÓ Që ) root(ROOT, artist)

.Iƒ@ èXAªË@†ñ ¯ ð@ PA¿ kar-e¯ u foqolade¯ ast . Work-ez she/he outstanding be.pres.3sg . Her/his work is outstanding. root(ROOT, èXAªË@†ñ ¯ ) root(ROOT, outstanding)   . Iƒ@ Xñk AK. àXQ» IJ .m• ÈAg PX éJK @ øñÊg. Ð@YÓ ,Èð@ øAëI҂ ¯ PX dar qesmat-ha-ye¯ avval , modam¯ jolo-ye aine¯ dar hal-e¯ in part-pl-ez first , constantly front-ez mirror in position-ez sohbat kardan ba¯ xod ast . talk doing with self be.pres.3sg .

117 In the first parts, she/he is constantly talking to herself/himself in front of a mirror. root(ROOT, Iƒ@ ) root(ROOT, be.pres.3sg) tmod: Temporal Modifier A temporal modifier of a verb, noun or adjective is a bare noun constituent that serves to modify the meaning of the constituent by specifying a time. (Other temporal modifiers are prepositional phrases, which are introduced as prep.)

.Yƒ éJ‚» øYJJ Ëð@P PX éJƒYà éJ.‚j . JK  ñKñK . Õç'A g xanom-e¯ Buto panjšanbe gozašte dar Ravalpendi¯ košt-e madam-ez Bhutto Thursday last in Rawalpindi kill.past-pp šod . become.past.3sg . Mrs. Bhutto was killed last Thursday in Rawalpindi. tmod( éJ‚» , éJ ‚j JK ) tmod(kill.past-pp,. . Thursday) xcomp: Open Clause Complement An open clause complement (xcomp) of a verb or adjective is a clause complement without its own subject, whose reference is determined by an external subject. These complements are always non-finite.   . á ¯P øèXAÓ@ YJ‚ @ú× mi-ist-ad am¯ ade-ye¯ raft-an . cont-stand.pres-3sg ready-ez go.past-inf . She/he stands ready to go.  xcomp( èXAÓ@,á ¯P ) xcomp(ready-ez, go.past-inf)

5.4.2 New Relations While I have tried to keep the labels and construction set as close as possible to the original STD scheme, I have extended the scheme in order to include all syntactic relations that could not be covered by the primary scheme developed for English. Altogether I have added 10 new relations to describe various relations in light verb constructions (LVC), such as adjectival complement

118 in LVC acomp-lvc, direct object in LVC dobj-lvc, nominal subject in LVC nsubj-lvc, prepositional modifier in LVC prep-lvc; the accusative marker ra¯ acc; object of comparative cpobj; comparative modifier cprep; topic depen- dent dep-top; vocative dependent dep-voc; and foreign words fw. Table 5.2 lists all atomic labels used in the syntactic annotation of UPDT, with new relations in italics. The new relations are explained and discussed below.

1. Light Verb Constructions (LVC) The light verb construction in Persian is a pervasive phenomenon and, as noted in Section 2.3.3, different preverbal parts of speech, such as nouns, adjectives, adverbs, or prepositions, can form complex predicates together with light verbs. However, the internal complements of light verb con- structions do not represent the same syntactic structures as ordinary com- plements do. When analyzing these constructions, two extreme positions can be adopted. We can treat them as either opaque lexicalized units or as entirely transparent syntactic constructions. Neither of these options is quite adequate. We cannot treat LVCs as completely lexicalized units, for instance by using the mwe relation for multi-word expressions, since they are different from other multi-word expressions such as compound prepo- sitions and conjunctions. In particular, other words such as modifiers may be placed between the LVC-elements and the verb. Hence, the structure within LVCs is not completely fixed and solid, like the fixed structure in the multi-word expressions. On the other hand, we cannot treat LVCs as transparent syntactic con- structions either, due to the fact that as soon as preverbal parts of speech get semantically involved with light verbs (or with a certain types of main verbs that operate as light verbs in abstract semantic relations) they lose their internal structures. The complex predicate àXPñ k Õæ„k /cešmˇ xord- an/ (gloss: eye eat.past-inf, translation: losing fortune8 or to be put un- der a spell) in the meaning XPñk Õæ„k ÐAƒ /Sam cešmˇ xord/ (gloss: Sam eat.past.3sg eyes, translation: Sam lost fortune) is a typical LVC that can never be treated as syntactically transparent since the sentence will lose its conceptual content. The word Õæ„k /cešm/ˇ (eye) can never be treated as the (direct) object of the sentence since the word has already lost its lex- ical sense and its internal structure in combination with the verb àXPñ k 9

8The expression refers to traditional belief in many parts of the world including Iran, and is used of a person who has lost her/his fortune or been put under a spell based on negative energy generated by envy or the evil eye. To prevent receiving such energy people usually knock on wood. 9The occurrence of the main verb àXPñ k /xord-an/ (gloss: eat.past-inf, translation: to eat) as light verb in Persian is quite common. More examples of similar cases are: àXPñ k á ÓP /zamin xord-an/ (gloss: ground eat.past-inf, translation: to fall down), àXPñ k I‚º ƒ /šekast xord- an/ (gloss: defeat eat.past-inf, translation: to lose, to fail), àXPñ k é’ « /qosse xord-an/ (gloss: 119 /xord-an/ (gloss: eat.past-inf, translation: to eat) when building the com- plex predicate in the sense of losing fortune. An additional argument against treating LVCs as syntactically transpar- ent is that LVCs consisting of a verb and an object can themselves take direct objects, and certain elements cannot move elsewhere in the sentence but need to stand right before the verb. For instance the complex predi-  cate àX@X ÉK ñm' /tahvil dad-an/¯ (gloss: delivery give.past-inf, translation: to deliver) in X@X ÉKñm' @P HAJ» ð@ /u ketab¯ ra¯ tahvil dad/¯ (gloss: she/he book ra¯ delivery give.past.3sg, . translation: She/he delivered the book) is an example of a LVC (dobj-lvc) placed in the vicinity of the direct object book. The object delivery, which is in a light verb construction relation to the verb X@X /dad/¯ (gave), cannot move around or be placed elsewhere in the sentence. Only modifiers may be placed between the dobj-lvc and the verb. For example the word library may be used in the sentence without involving any preposition but instead being linked to the dobj-lvc by an ezafe¯ construction, as in:    .X@X éKAm'AJ» ÉK ñm' @P H. AJ» ð@ u ketab¯ ra¯ tahvil-e ketab.x¯ ane¯ dad¯ . . she/he book ra¯ delivery-ez book.house give.past.3sg . She/he delivered the book to the library.

As shown in Figure 5.2 the word library is placed as a dependent to the relation dobj-lvc and not to the light verb X@X /dad/¯ (gave). Therefore, I chose a middle ground that indicates both the internal structure of the LVC and its special status as a complex predicate. In other words, I handled LVCs as a separate category in the treebank by specifying four different relations in light verb constructions, which are presented below.

a) acomp-lvc: Adjectival Complement in LVC An adjectival complement in a light verb construction is an adjective sorrow eat.past-inf, translation: to sorrow), àXPñ k Õ愯 /qasam xord-an/ (gloss: swear/oath eat.past-inf, translation: to swear/oath), àXPñ k Èñà /gul xord-an/ (gloss: deception eat.past- inf, translation: to be deceived). Moreover, the verb àXPñ k /xord-an/ (to eat) in the case of àXPñ k Õæ„k /cešmˇ xord-an/ (gloss: eye eat.past-inf, translation: losing fortune), àXPñ k Èñà /gul xord-an/ (gloss: deception eat.past-inf, translation: to be deceived), and àXPñ k I‚º ƒ /šekast xord-an/ (gloss: defeat eat.past-inf, translation: to lose, to fail) are intransitive con- structions and as soon as the concepts turn into transitive constructions, the light verbs àX P /zad-an/ (to hit) and àX@X /dad-an/¯ (to give) will be used, as in àX P Õæ„k /cešmˇ zad-an/ (gloss: eye hit.past-inf, translation: to give somebody the evil eye) and àX P Èñà /gul zad-an/ (gloss: deception hit.past-inf, translation: to deceive), and àX@X I‚º ƒ /šekast dad-an/¯ (gloss: defeat give.past-inf, translation: to beat).

120 داد gave root

. تحویل کتاب او . S/he book delivery-e nsubj dobj dobj-lvc punct

کتابخانه را rā library acc poss

او کتاب را تحویل دادکتابخانه .

Figure 5.2. Syntactic annotation of a Persian sentence. Gloss: she/he book ra¯ delivery- ez book.house give.past.3sg. Translation: She/he delivered the book to the library.

that forms a complex lexical predicate together with the verb. For instance, the adjective Õæ„m.× /mojasam/ (incarnate) in the following example functions as the adjectival complement of the complex predicate àXQ» Õæ„m.× /mojasam kard-an/ (gloss: incarnate do.past-inf, translation: to visualize).

. YJ»ú× Õæ„m.× Xñk áë X PX @P ð@ Pñ’k hozur-e u ra¯ dar zehn-e xod mojasam mi-kon-ad presence-ez she/he ra¯ in mind-ez own incarnate cont-do.pres-3sg . . She/he visualizes her/his presence in her/his mind.

acomp-lvc(YJ»ú× , Õæ„m.×) acomp-lvc(cont-do.pres-3sg, incarnate)

b) dobj-lvc: Direct Object in LVC A direct object in a light verb construction is a noun that forms a com- plex lexical predicate together with the verb. Thus, dobj-lvc denotes a direct object functioning as the nominal part of the complex predicate. In the following example, the complex predicate YKXQ» m ' /paxš 121 kard-and/ (gloss: broadcast do.past-3pl, translation: they broadcast) consists of the light verb YKXQ» /kard-and/ (gloss: do.past-3pl, transla- tion: did) and the nominal part m ' /paxš/ (broadcast).

. YKXQ» m' @P éÓAKQK . barname¯ ra¯ paxš kard-and . program ra¯ broadcast do.past-3pl . They broadcast the program.

dobj-lvc(YKXQ» , m ') dobj-lvc(do.past-3pl, broadcast)

c) nsubj-lvc: Nominal Subject in LVC A nominal subject in a light verb construction is a noun that forms a complex lexical predicate together with the verb. The nsubj-lvc and dobj-lvc are similar and at the same time different, depending on the verb. Intransitive verbs take nsubj-lvc whereas transitive verbs take dobj-lvc. In the following example Broadcast functions as the nominal subject of the intransitive verb àY ƒ /šod-an/ (gloss: become.past-inf, translation: to become). Yƒ m ' /paxš šod-an/ (gloss: broadcast become.past-3sg, translation: was broadcast) is the intransitive form of the verb àXQ» m ' /paxš kard-an/ (gloss: broadcast do.past-inf, translation: to broadcast).

. Yƒ m' éÓAKQK . barnane¯ paxš šod . program broadcast become.past.3sg . The program was broadcast.

nsubj-lvc(Yƒ, m ') nsubj-lvc(become.past.3sg, broadcast)

d) prep-lvc: Prepositional Modifier in LVC A prepositional modifier in a light verb construction is a prepo- sition/prepositional phrase that forms a complex lexical predicate together with the verb. In the following example the preposition éK. /be/ (to) with its object IƒX /dast/ (hand) functions as the prepo- sitional modifier to the verb  /avard-an/¯ (gloss: bring.past-inf, àXPð@  translation: to bring) and forms the complex predicate àXPð @ IƒX éK /be dast avard-an/¯ (gloss: to hand bring.past-inf, translation: to. achieve/to gain).

122   . XPð@ IƒX éK. I. KQm' AK. @P Xñk øPð Q K ð@ u piruzi-ye xod ra¯ ba¯ taxrib be dast avard¯ . she/he victory-ez self ra¯ with destroying to hand bring.past.3sg . She/he achieved her/his victory by destroying.  prep-lvc(XPð@, éK) prep-lvc(bring.past.3sg,. to)

2. Dislocated Elements Like many other languages, Persian uses a number of dislocated con- stituents. These can be of either the pre- or post-dislocation type. Pre-dislocated elements are those which are preposed topics and post- dislocated elements are those which are postposed topics.

dep-top: Topic Dependent The topic dependent relation is used for a fronted (pre-dislocated) element that introduces the topic of a sentence. It is often anaphorically related to the subject or object of the main clause. In the following example people functions as a topic dependent in the sentence implicating to the nominal subject their mentality.

.Iƒ@ èXQ» Q  Jª K àA ‚ëX ÐXQÓ mardom zehn-ešan¯ taqyir kard-e ast . people mentality-pc.3pl change do.past-pp be.pres.3sg . People, their mentality has changed.

dep-top( èXQ», ÐXQÓ) dep-top(do.past-pp, people)

3. Vocative Vocative is used to directly address a listener. Vocative utterances are most frequently used in Persian with proper nouns at the beginning or end of a sentence.

dep-voc: Vocative Dependent The vocative dependent relation is used for a vocative element, usually a proper name or pronoun. Vocative dependent in Persian can be placed either as preposed or postponed topics. In the following example, sir functions as the vocative dependent of the sentence.   . Õæ K@ùÖß AÓ A¯@

123 aq¯ a¯ ma¯ ne-mi-ay-im¯ . sir we neg-cont-come.pres-1pl . Sir we are not coming.

dep-voc   (Õæ K@ùÖß, A¯@) dep-voc(neg-cont-come.pres-1pl, sir)

Sir may be positioned at the end of the sentence, as in we are not coming sir.

4. Comparative Constructions Persian has a number of preposition-like elements, such as ÉJÓ /mesl/, YJ KAÓ /manand/,¯ àñk /cun/,ˇ àñj Òë /hamcun/,ˇ and Q¢  /nazir/, all meaning like, as, or similar to , that appear in similes. A simile is employed to make a comparison or to describe a metaphor. Lazard (1992) calls these elements similitudes and remarks that similes are used in adverbial expressions and are introduced by prepositions. However, similitudes cannot function en- tirely as prepositions or adverbs. Different similitudes may independently represent different categories; for instance ÉJÓ /mesl/ (like) can be treated as a preposition, YJ KAÓ /manand/,¯ àñk  /cun/,ˇ or àñj  Òë /hamcun/ˇ (similar, like) as an adjective, and Q¢  /nazir/ (match, like) as a noun. Hence, these elements are analyzed in the UPC as ADV_COMP (adverb of comparison) and are further distinguished in the UPDT to describe simile constructions. The constructions are defined as follows.

a) cprep: Comparative Modifier The comparative modifier relation is used for comparative con- structions that resemble prepositional phrases but are introduced by conjunctions or adverbs and can be analyzed as elliptical comparative clauses (see English like a child in he cries like a child).  . YJ»ú× éK QÃ ém'. ¹K ÉJÓ ð@ u mesl-e yek baˇcˇce gerye mi-kon-ad . she/he like-ez one child cry cont-do.pres-3sg . She/he cries like a child.

cprep(YJ»ú× , ÉJÓ ) cprep(cont-do.pres-3sg, like)

b) cpobj: Object of Comparative The object of a comparative is the complement of a preposition-like conjunction or adverb introducing a comparative modifier (see English

124 a child in he cries like a child).  .YJ»ú× éK QÃ ém'. ¹K ÉJÓ ð@ u mesl-e yek baˇcˇce gerye mi-kon-ad . she/he like-ez one child cry cont-do.pres-3sg . She/he cries like a child.

cpobj(ÉJÓ , ém',) cpobj(like, child) .

5. Foreign Words Complete phrases or sentences quoted in another language than Persian are not given an internal syntactic analysis. Instead all the words are connected in a chain with the first word as the head and all relations are marked as fw. The incoming arc to the head of the chain however is assigned a regular syntactic relation reflecting its role in the larger sentence.   ... ÐAÒªË@ ù®‚‚ éêk. ñK. :I ®Ã ð@ u goft : bevejhe yastasqi alqemam¯ ... she/he say.past.3sg : face praying clouds ... She/he requested: rain from clouds with the blessings of his face ...

fw   ( éêk. ñK., ù®‚‚ ) fw(face, praying)

6. Accusative Marker An accusative marker is a clitic highlighting the direct object. When the direct object is definite, it is always followed by ra.¯ On the other hand, when the direct object is indefinite but individuated, it may or may not be followed by ra¯ depending on certain conditions (Lazard, 1992). The accusative marker ra¯ is analyzed with the relation acc and is found in Figure 5.3 when marking the direct object HAJ “ñ’k /xosusiy-at-e/¯ (gloss: feature-pl-ez, translation: characteristics).

... ð YKP@X @P Xñk Ag HAJ “ñ’k ...... xosusiy-at¯ -e xas-e¯ xod ra¯ dar-and¯ va ...... feature-pl-ez specific-ez self ra¯ have.pres-3pl and ...... they have their own special characteristics and ...

acc(HAJ “ñ’k , @P) acc(features-pl, ra)¯

125 Table 5.2. Syntactic relations in UPDT with new relations in italics. Category Description acc Accusative marker acomp Adjectival complement acomp-lvc Adjectival complement in light verb construction advcl Adverbial clause modifier advmod Adverbial modifier amod Adjectival modifier appos Appositional modifier aux Auxiliary auxpass Passive auxiliary cc Coordination ccomp Clausal complement complm Complementizer conj Conjunct cop Copula cpobj Object of comparative cprep Comparative modifier dep Dependent dep-top Topic Dependent dep-voc Vocative Dependent det Determiner dobj Direct object dobj-lvc Direct object in light verb construction fw foreign word mark Marker mwe Multi-word expression neg Negation modifier nn Noun compound modifier npadvmod Nominal adverbial modifier nsubj Nominal subject nsubj-lvc Nominal subject in light verb construction nsubjpass Passive nominal subject num Numeric modifier number Element of compound number parataxis Parataxis pobj Object of a preposition poss Possession modifier preconj Preconjunct predet Predeterminer prep Prepositional modifier prep-lvc Prepositional modifier in light verb construction prt Phrasal verb particle punct Punctuation quantmod Quantifier phrase modifier rcmod Relative clause modifier rel Relative root Root tmod Temporal modifier 126 xcomp Open clausal complement دارند have root

. جنساند و خصوصیات ، میگیرند take , features-e and kind-are . advcl punct dobj cc conj\pobj punct

یک از همه در را خود خاص تأثیر از گرچه انسانها humans although from effect specific-e own rā in all of one nsubj mark prep dobj-lvc amod poss acc prep nsubj prep num

نگاهی واقعیات بورن حیوانات و and look-a animals-e Born facts pobj cc conj poss pobj

کلی general amod

انسانها و حیوانات بورن گرچه از واقعیات تأثیر میگیرند، خصوصیات خاص خود را دارند و در نگاهی کلی همه از یک جنساند.

Figure 5.3. Syntactic annotation for a Persian sentence with English gloss. To make the figure more readable, glosses have been simplified as follows: humans = human- pl, animals-e = animal-pl-ez, facts = fact-pl, take = cont-take.pres-3pl, features-e = feature-pl-ez, specific-e = specific-ez, own = self, have = have.pres-3pl, look-a = look- indef, kind-are = kind-be.pres-3pl. Gloss: human-pl and animal-pl-ez Born although from fact-pl effect cont-take.pres-3pl, feature-pl-ez specific-ez self ra¯ have.pres-3pl and in look-indef general all of one kind-be.pres.3pl. Translation: Although (Adolf) Born’s humans and animals are affected by realities, they have their own special char- acteristics and in (a) general (look) all are of the same kind.

127 5.4.3 An Example Sentence Annotated with STD Figure 5.3 shows the dependency annotation for a sentence from UPDT about the Czech artist Adolf Born, with English glosses. The sentence consists of the following subordinate clause:  YK Q Ãú× Q KAK HAJ ª¯@ð P@ ékQà àPñK . HA K@ñJ k ð AîEA‚ @ ensan-h¯ a¯ va heyvan-¯ at-e¯ Born garˇce az vaqeiy-¯ at¯ ta‘sir human-pl and animal-pl-ez Born although from fact-pl effect mi-gir-and cont-take.pres-3pl Although Born’s humans and animals are affected by realities and the main clause:

.YK@ k. ¹K P@ éÒë úο ùëAÆK PX ð YKP@X @P Xñk Ag HAJ “ñ’k xosusiy-at-e¯ xas-e¯ xod ra¯ dar-and¯ va dar negah-i¯ kolli feature-pl-ez specific-ez self ra¯ have.pres-3pl and in look-indef general hame az yek jens-and . all of one kind-be.pres.3pl . they have their own special characteristics and in (a) general (look) all are of the same kind.

The subordinate clause is an adverbial clause with the head cont-take.pres-3pl marked by the label advcl and governing the nominal subject human-pl and animal-pl-ez Born, the subordinating conjunction although, the prepositional modifier from followed by the prepositional object fact-pl, and the preverbal noun effect in light verb construction with cont-take.pres-3pl. The nominal subjects human-pl and animal-pl-ez are coordinated and linked with an ezafe¯ construction to their possessive modifier Born. The main clause is rooted at the verb have.pres-3pl which governs an implied subject,10 the direct object feature-pl-ez specific-ez self ra¯, the coordinating conjunction and, and the co- ordinated verb phrase in look-indef general all of one kind-be.pres.3pl. The direct object is headed by feature-pl-ez, which is linked by an ezafe¯ construc- tion to its adjectival modifier specific-ez and further to its genitive complement self. The direct object further contains the accusative marker ra¯. The coordi- nated verb phrase kind-be.pres.3pl governs the prepositional modifier in look- indef general, the nominal subject all, and the second prepositional modifier of. The first prepositional modifier is rooted at the preposition in linked to its object look-indef which is modified by the adjectival modifier general. The

10The subject is absent (pro-drop) but the information is given by the verb through the attached personal ending YK - /-and/ (3pl). 128 second prepositional modifier, of, has its object, kind, in the form of a com- plex element with the attached copula clitic YK@ - /-and/ (be.pres.3pl) modified by the numeric modifier num. Thus the coordinated verb kind-be.pres.3pl has received the complex label conj\pobj. In other words, the conj (conjunct) is itself a clitic on a pobj (prepositional object) element. Since I gave priority to the verb as the most important part in the syntactic structure, and the verb is attached to the prepositional object, the prepositional object, which should actually be under the prep, ends up higher in the structure.

5.5 Complex Relations As noted in Section 3.2.2, in developing the UPC, I made a special decision concerning the handling of different types of clitics (pronominal and copula clitics), as they were written in various forms in the corpus. They were some- times segmented and sometimes unsegmented from the head words. To manu- ally separate clitics from the head words consistently in a large corpus such as the Bijankhan Corpus was impossible, with respect to the project time. On the other hand, automatically handling such cases was also impossible since such a process could result in many incorrect conversions by impact- ing orthographically similar words/endings with different part-of-speech cat- egories. For example, the word IƒAK P may refer to the word /riasat/¯ (presi- dency or generalship) or to the compound word /ria-st/¯ (with a small variation in pronunciation that is unmarked in texts, as short vowels are not transcribed) (gloss: duplicity/hypocrisy-be.pres.3sg, translation: is hypocrisy). Thus, au- tomatically segmenting the copula clitic Iƒ /-st/ (is) from the word AKP /ria/¯ (duplicity/hypocrisy) in the corpus could undoubtedly affect the homograph noun IƒAK P /riasat/¯ (presidency or generalship). Furthermore, automatic conversion could impact words that are not exact homographs but share the same endings. For instance, segmenting the copula clitic (for a list of copula clitics see Table 4.2) ø@- /-i/ (be-2sg) in words such as ø@éJ‚ k /xaste-i/ (gloss: tired-be.pres.2sg, translation: you are tired or are you tired) may further affect other words with similar endings, since /-i/ can also serve as a suffix to form adjectives, as in ø@éJ‚ë /haste-i/ (gloss: core- i/nucleus-i, translation: nuclear). Hence, to avoid introducing such errors into the corpus, I decided not to separate clitics from the head words, and to analyze them with special labels at the syntactic level instead. In the treatment of complex unsegmented word forms, I use complex labels where the first label indicates the main syntactic function, and subsequent la- bels mark other functions carried by elements incorporated into the word form. The additional functions are listed in the order in which they occur and are pre- fixed with a backslash (\) if they precede the segment carrying the main func- tion and a forward slash (/) if they follow it. Thus, the label poss/pc is assigned to a word that has the main function poss and an additional (clitic) pc element.

129 By contrast, the label ccomp\poss is used for (the head of) a clausal com- plement, which is itself a clitic on a poss element. Figure 5.3 shows the un- segmented copula clitic YK@ - /-and/ (be.pres.3pl) together with the word k. /jens/ (material) in YK@ k /jens-and/ (gloss: kind-be.pres.3pl, translation: are kind) analyzed as conj\. pobj (for annotation key, see Section 5.4.3). In Table 5.2, I only list atomic labels. A complete list of all simple and complex labels with frequency information can be found in Appendix A.

5.6 Unused Relations Some dependency relations in the original STD scheme have been excluded from the Persian STD since the corresponding relations either do not exist or are not applicable to Persian. For instance, I have not found any instances of the dependency relations indirect object (iobj), agent (agent), and preposi- tional complement (pcomp). Indirect objects, agent, and prepositional com- plements are always realized as prepositional phrases in Persian, so the re- lations prepositional modifier (prep) and prepositional object (pobj) are suffi- cient. Furthermore, I have not found any instances of the dependency relations abbreviation modifier (abbrev), as the relation is defined by the appositional modifier (appos) instead; attributive (attr), since the complement of the copula verb is defined as the head; purpose clause modifier (purpcl), as the relation can be defined by the adverbial clause modifier (advcl); clausal subject (csubj) and clausal passive subject (csubjpass), since Persian has no clausal subject or clausal passive subject, and instead uses a construction with a noun or pro- noun plus a relative clause for which the relations rel and rcmod can readily be used. As there is no genitive modifier in Persian and the ezafe¯ construc- tion is constantly used in the language, I did not use the relation possessive modifier (possessive) either. Other relations that do not exist in Persian and are excluded from the extended version of Persian STD are expletive (expl), infinitival modifier (infmod), and participial modifier (partmod).11 The two latter relations are found as the relations prepositional modifier (prep) and prepositional object (pobj) or ezafe¯ construction in Persian.

11As noted earlier, I have considered the relations based on the Stanford Typed Dependencies manual from 2008. The relations were revised and changed in 2013. Therefore some of the excluded relations are not found in the new version of the manual. These are abbreviation modifier (abbrev), attributive (attr), infinitival modifier (infmod), participial modifier (partmod), and purpose clause modifier (purpcl).

130 5.7 Comparison with Other Treebanks for Persian There are currently three treebanks available for Persian: the HPSG-based PerTreeBank12 (Ghayoomi, 2012), the UPDT (Seraji et al., 2012b), and the PerDT (Rasooli et al., 2013). The first two treebanks contain texts from the same domain; in other words, they both share the same corpus data, namely the Bijankhan Corpus. However, the treebanks differ in size and annotation styles. The development of PerTreeBank ended with 1012 sentences (see 2.4.2). Thus, the treebank is considerably smaller in size than other treebanks and it lacks annotation guidelines. In the following subsections I compare UPDT only with PerDT, because PerDT is the largest treebank and provides better-documented guidelines for each syntactic relation than the other tree- bank.

5.7.1 Data and Format UPDT and PerDT are two syntactically annotated corpora of contemporary Persian based on dependency structure. UPDT consists of 6,000 sentences and 151,671 tokens while PerDT is larger, containing nearly 30,000 sentences and 498,081 tokens. UPDT uses 31 tags for encoding the parts of speech and pro- vides no lemma information, while PerDT includes 32 part-of-speech tags and information about lemmas. The number of dependency relations also varies between the two treebanks. UPDT comprises in total 96 dependency labels (of which 48 are basic and the remaining 48 are complex), while PerDT has a total of 43 relations. The data in UPDT is taken from the Bijankhan Corpus with a large variety of genres (see 3.1) while the data in PerDT is specifically picked based on different verb types (see 2.4.2). In other words, data in PerDT con- tains only isolated sentences from the Web. The selection of sentences is based on different verb types, and in order to cover different types of Persian verbs, a valency lexicon is employed. Although PerDT’s special data selection method gives good coverage of different verb types, the sentences do not appear in a coherent discursive order, as sentences normally should do in a text. This can impact the unbroken syntactic features that normally are found in a regular text, such as anaphoric elements, implications, presuppositions, and the natu- ral tense structure. Moreover, since the number of occurrences of rare verbs is lower than that of frequent ones, there is an uneven distribution of different verb types in the treebank. Simply put, the PerDT, has prioritized including almost all Persian verbs in the data over including different variations of genre, whereas, the UPDT has aimed to cover a wide range of domains and genres to achieve robustness. Statistical systems that are trained on annotated data with limited genre (domain-specific) often suffer performance drops when applied

12The treebank has recently been automatically converted to dependency structure (Ghayoomi and Kuhn, 2014).

131 to texts containing domain variations (Hogan et al., 2008; Yi Zhang and Wang, 2009). The treebanks do not contain the same character encodings. Characters and digits in the UPDT consistently use Persian style and Persian Unicode encodings due to the conversion of the Bijankhan Corpus into the UPC (c.f. 3.2.1). However, in the PerDT, characters shift from Persian to Arabic style and digits vary between Persian and Western Unicode characters.

5.7.2 Tokenization The treebank data further differs in some aspects of tokenization. For instance, in the UPDT, various types of clitics (such as pronominal clitics, copula cli- tics, and personal endings) are all treated consistently and are not separated from the head words, whilst clitics in the PerDT are treated differently de- pending on the type of clitic. Personal endings are not separated from head words but are written with a space. However, pronominal clitics and cop- ula clitics are separated from head words and processed as separate tokens. This means that in order to apply the PerDT to new text, the text needs to undergo the same segmentation as the PerDT, to be compatible and match the treebank data, otherwise there is no guarantee that the same tokens, but with different segmentation, will receive the correct arcs. On the other hand, re- producing similar segmentation in new text requires a powerful tokenization tool that can identify homograph tokens with various senses (and pronunci- ations) such as ÐXQÓ /mard-am/ with the pronominal clitic /-am/ meaning my husband, ÐXQÓ /mard-am/ with the copula clitic /-am/ meaning I am a/the man, ÐXQÓ /mord-am/13 with the personal ending /-am/ meaning I died, and finally ÐXQÓ /mardom/ without any clitics or personal endings, meaning people. To the best of my knowledge, no such tokenization tool is available. As mentioned in Section 2.3.1, clitics are written differently in different texts due to the lack of standardization in Persian orthography. Hence, we need a normalizer that can take care of these inconsistencies and automatically identify orthographically similar instances (homograph tokens) with various morphological categories and notions without giving incorrect conversions. This is a difficult task con- sidering the morphological ambiguities in the language. At the same time I knew that no such normalizer was (or is) available able to perform all these tasks successfully. Therefore, in order to avoid having to face these issues when analyzing new text, I decided not to separate clitics from head words. Instead I made sure that the normalizer PrePer merges clitics with their hosts in cases where they have been separated by whitespace, so that all clitics are treated consistently. That was the best compromise I could make, as such el-

13Note that short vowels are not written in Persian. Hence, the short vowel /a/ in /mard/ or /o/ in /mord/ are not marked in text and the words are typed similarly as /mrd/.

132 ements (unseparated clitics from the head words) are easily reproducible with a simple automatic tokenizer. Moreover in the UPDT, all types of auxiliaries such as YK AK. /bayad/¯ (must), á ƒ@ñ k /xastan/¯ (will), and á ‚ @ñ K /tavanestan/¯ (can), as well as copula verbs (both when used as auxiliaries and when connecting a predicate to its subject), are treated consistently as distinct tokens, whereas in the PerDT these verbs are handled differently. The auxiliary verbs YKAK /bayad/¯ (must), and á ‚ @ñ K /tavanestan/¯ (can), as well as copula verbs in the. form of predicates in the PerDT, similar to the UPDT, are treated as separate items, whereas the aux- iliary verb á ƒ@ñ k /xastan/¯ (will) and copula verbs, when used as auxiliaries together with main verbs, are treated as single verbs separated by a space char- acter. It is noteworthy that the copula is often used as an auxiliary together with past participle verbs. However, when past participles function as adjectives (as many adjectives are derived from verbs), the copula does not function as an auxiliary but as a link to connect a predicate to its subject. Thus, in Persian these cases are generally treated as distinct units and are normally typed, if not misspelled, with intervening space character. This also means that in order to apply the PerDT to new text, the text needs to be adapted based on the same segmentation as the analysis for the PerDT. As mentioned in the previous para- graph, a tokenization tool is needed to distinguish, for instance, whether the copula functions as an auxiliary to past participle verbs or as a link connect- ing the past participle verbs (that function as adjectives) to their head. Once again, to my knowledge no such tool is available. Therefore, in order to avoid this problem in the UPDT, I have treated such tokens separately to make them consistent with the ordinary style of writing these tokens, as well as because the elements are easily reproducible by an automatic tokenizer on new text. The present/past continuous prefix -ú× /mi-/ in the PerDT is also handled differently than in the UPDT. The prefix is accompanied with main verbs, sometimes with a space and sometimes with no space (without being attached to the main verb, due to the existence of right-joining characters at the begin- ning of the main verbs). This means that the PerDT is not in valid CoNLL format, as space characters are not allowed inside tokens in CoNLL. However, this orthographic issue has already been solved in the UPDT due to the con- version of the Bijankhan Corpus to the UPC, when the normalizer PrePer was used (see 4.1.1). Acronyms are further treated differently in the treebanks. In Persian texts, acronyms may appear as transcriptions with Persian letters, either with in- serted dots such as ñJ ÊK.X.Ð@.úG /bi.em.dabelyu/ (B.M.W), inserted space such as /bi em dabelyu/. (B M W), or a combination of inserted dots ñJ ÊK.X Ð@ úG. and space as ñJ ÊK.X .Ð@ .úG /bi. em. dabelyu/ (B. M. W). They may addition- ally appear with Western. letters, often without any inserted dots or space. In the UPDT, acronyms are either treated as single tokens without any internal

133 dots or space, in terms of the above example as BMW, or they are divided into separate units as B, M, and W but linked together with the syntactic la- bel mwe. The reason why acronyms are handled in two different ways in the UPDT is that they were typed dissimilarly in Bijankhan Corpus, sometimes with internal space, sometimes with a combination of internal dots and space, and sometimes as single tokens with neither dots nor space. When I converted the Bijankhan Corpus into the UPC, I left untouched all those acronyms that were typed as single tokens (without internal dots or space). Since the space character in the tokenizer is considered a token separator, the acronyms with a combination of inserted dots and space were split and treated as separate tokens in the UPC, but marked as mwe at syntax level. Thus in the UPDT acronyms are represented in various forms as samples in the training data. It is worth noting that I could retain acronyms with inserted dots if they existed in the corpus; the tokenizer SeTPer can easily take care of acronyms with in- serted dots, because internal dots are not defined as sentence segmenters as long as there is no space character after them. On the other hand, in the PerDT acronyms appear with dots and are treated as single tokens. Accordingly, the acronym ñJ ÊK.X.Ð@.úG (B.M.W) is defined as a single token in the PerDT. It is unclear whether they. have standardized the acronyms in this way or if the acronyms appeared with this style in their corpus. Dates and measurements are additionally presented differently in the two treebanks. In Persian, dates are indicated by specifying year-month-day sep- arated by slashes, e.g., 1917/2/11. Thus, any sequence of digits and slashes in UPDT is decomposed and each element is separately defined as a token. Since slash is considered a separator character in the tokenizer SeTPer, I split this ordered list into numbers and slashes during the conversion of the Bi- jankhan Corpus into the UPC. However, in the PerDT, dates are presented as a sequence of year-month-day together with slashes defined as single tokens such as 1917/2/11. For measurements, numbers are also accompanied with slash, as in 0/01 millimeter. As a general overview of tokenization differences between the two tree- banks, I can conclude that the PerDT has cleaned up the treebank by following a specific template but may have problems with new text. The UPDT, on the other hand, has preserved the variations inherent in Persian texts that normally appear in the general standard texts to support robust processing.

5.7.3 Annotation Schemes Some relations in both treebanks function the same under different names, for instance the relation subject is treated as dependent, under the names SBJ in the PerDT and nsubj in the UPDT. In addition, noun dependents in the PerDT are annotated either as NPREMOD (pre-modifier for noun) marking superla- tive adjectives (in Persian superlative adjectives always precede nouns) and

134 demonstrative pronouns, or as NPOSTMOD (post-modifier of noun) labeling adjectives. The same relation structures appear in the UPDT when treating the noun as the head node and the noun dependents as dependents to the head. The dependent relation marking different types of adjectives in the UPDT is called amod (adjectival modifier) and for demonstrative pronouns is called det (determinative). The two treebanks also share the same structure for the re- lation apposition when it serves to define an NP. The relation is called APP (apposition) in the PerDT and appos (appositional modifier) in the UPDT. Furthermore, the ezafe¯ construction is treated more or less the same in the two treebanks. As noted in Section 2.3.3, ezafe¯ is a particle in Persian that indicates the semantic relation between joint elements within a noun phrase, adjective phrase, or prepositional phrase. In the PerDT these relations are clas- sified as MOZ (ezafe dependent), NPOSTMOD (post-modifier of noun), and NEZ (ezafe complement of adjective). The relation MOZ defines the depen- dency between two nouns, NPOSTMOD defines the dependency between a noun and an adjective, and NEZ defines the dependency between an adjective and its nominal complement. In the UPDT, ezafe¯ construction is also classi- fied by specific dependency labels such as poss (possession modifier), amod (adjectival modifier), and nn (noun compound modifier). The relation MOZ in the PerDT functions similarly as the relations poss and nn, NPOSTMOD as amod, and NEZ as poss in the UPDT. However, despite the fact that both treebanks are based on dependency structure, they vary highly in terms of annotation scheme (for the syntac- tic relations in the PerDT, see Table 2.17). One systematic structural di- vergence of the UPDT compared to the PerDT concerns when words with heavier semantic content are chosen to be the head of a relation based on se- mantic principles. Thus, the UPDT mainly chooses content words as heads in the dependency relations. However, there are two exceptions where the UPDT chooses function words as heads: prepositions in prepositional phrases, and copula verbs that have a prepositional phrase as their complement. In the example /kelid ru-ye miz ast/ (gloss: key on-ez table Iƒ@ Q Ó øðP YJ Ê¿ be.pres.3sg, translation: the key is on the table), the copula verb Iƒ@ /ast/ (gloss: be.pres.3sg, translation: is) is the root of the sentence since the com- plement is a prepositional phrase, and a preposition cannot occupy the root position in a sentence. The reason that the UPDT generally preserves the rela- tions between content words is that it is simple and transparent to have direct links between content words and predicates. Moreover, STD is more oriented towards deep syntax (or semantics) than towards surface syntax, where func- tion words to a larger extent are treated as heads. On the other hand, the PerDT does not follow the same principle as the UPDT in terms of head relations. In the PerDT the head relations easily shift between content and function words. Thus, in the PerDT auxiliaries can appear in the head position whereas in the

135 UPDT, labels such as auxiliaries and complementizers always serve as depen- dents. Subordinate constructions in Persian are often introduced by é» /ke/ (that) which marks both complement clauses and relative clauses (see Section 2.3.3). To distinguish these constructs in the UPDT, /ke/ is marked as complm (com- plementizer) when signifying a complementizer and as rel (relative) when denoting a relativizer. Since complementizers and relativizers are function words, /ke/ stands as a dependent to the head nodes in both cases. The head nodes of such subordinate constructions are always content words marked as ccomp (clause complement) and rcmod (relative clause modifier). In other words, the relations clause complement (ccomp) and relative clause modifier (rcmod) are always defined with a verb or a predicative complement. In the PerDT, on the other hand, the grammatical functions of /ke/ are categorized differently. When /ke/ functions as a complementizer it is marked by the re- lations VCL (complement clause of verb), AJUCL (adjunct clause), or ACL (complement clause of adjective), and when /ke/ functions as a relativizer, it is marked as NCL (clause of noun). In all cases /ke/ heads the subordinate clause while the verb of the clause follows as a dependent to /ke/. The verb of the clause is marked in all cases by the relation PRD (predicate). Direct objects are usually preceded by the direct accusative marker ra¯ (di- rect objects can also appear without the ra¯ marker). In the UPDT this relation is always marked as the head node (content word) and is introduced by dobj, which is the accusative object of the verb. In any case whether or not the direct accusative marker is present, the ra¯ marker is always positioned as a dependent to the direct object since the accusative marker, like the complementizer and the relativizer /ke/, is considered a function word. However, in the PerDT the relation OBJ is inconsistently marked and shifts between a function word and a content word. In other words, the relation OBJ constantly shifts between the direct accusative marker ra¯ and the direct object itself.Thus, when ra¯ is absent in the sentence, the label OBJ denotes the content word and the direct object is marked as the head node. Otherwise the relation OBJ refers to the function word and the direct object marker ra¯ is treated as the head. An adverbial clause complement (advcl) is normally introduced in the UPDT by a subordinating conjunction that is labeled as the relation mark (marker), which is a dependent of the subordinate clause, since the label is used for function words, and function words in the UPDT never (apart from the two exceptions described at the beginning of this section) stand as parent nodes in dependency relations. However, in the PerDT the relation is intro- duced as AJCONJ (conjunction of adjective) and is placed as superior in the hierarchy. Apart from the structural differences between the UPDT and the PerDT, the treebanks additionally differ in specificity. Different labels are chosen on various specific levels; in other words, sometimes the UPDT is more specific and sometimes the PerDT is. For instance, in the UPDT all the dependency

136 relations for prepositions are annotated as prep (prepositional modifier), but in the PerDT the relation is specified by a number of labels such as AJPP (prepositional complement of adjective), NPP (preposition of noun), PCONJ (conjunction of preposition), and VPP (prepositional complement of verb). The distinctions are described in more detail below:

• AJPP is used for a preposition that is the complement of an adjective  such as the preposition AK /ba/¯ (with) in úæ A¾« AK AJ ƒ@ /ašen¯ a¯ ba¯ akkasi/¯ (familiar with photography).. Thus, the preposition. with is annotated with the relation AJPP as a dependent of the adjective familiar. In the UPDT the preposition with is also treated as a dependent of the adjective familiar, but is analyzed with the label prep.

• NPP is used for a preposition that is the complement of a noun such as the preposition PX /dar/ (in) in ú»ñƒAK PX È@Yg /jedal¯ dar Tasuki/¯ (battle in Tasooki). The preposition in is annotated. with the relation NPP as a dependent of the noun battle. In the UPDT, the preposition in, just as in the PerDT, is treated as a dependent of the noun battle, but is analyzed with the label prep.

• PCONJ is used for a coordinating conjunction that is the complement of a preposition such as ð /va/ (and) in AÓ AK ð à@Qî EPX /dar Tehran¯ va ba¯ ma/¯ (in Tehran and with us). The coordination. and is annotated as the dependent of the preposition in, but governs the preposition with. The prepositions in and with are analyzed as ADV (adverb) and POSDEP (post-dependent) respectively. In the UPDT, the coordinating conjunction, as in to the PerDT, is annotated as the dependent of the preposition in (the first conjunct) but does not govern the preposition with. The coordination and as well as the preposition with (the second conjunct) are both governed by the first conjunct, namely the prepo- sition in. Moreover, in the UPDT, the relations for the first conjunct in, the coordination and, and the second conjunct with are annotated by the labels prep (preposition), cc (coordination), and conj (conjunct) respectively.

• VPP is used for a preposition that is the complement of a verb such

as the preposition éK. /be/ (to) in Õæ¯P éƒPYÓ éK. áÓ /man be madrese raft-am/ (gloss: I to school go.past-1sg, translation: I went to school). The preposition to is annotated with the relation VPP as a dependent of the verb went. In the UPDT, the preposition to, as in the PerDT, is treated as a dependent of the verb but analyzed with the label prep.

137 A light verb construction is defined in the UPDT as the relations acomp-lvc, dobj-lvc, nsubj-lvc, and prep-lvc, and is placed in a subordinate relation to the verb. This relation is divided in the PerDT into different classification grounds, namely, clitic non-verbal element (ENC), light verb particle (LVP), non-verbal element (NVE), non-verbal element of infinitive (NE), and second object (OBJ2). This categorization is described in more detail below:

• ENC is used for the non-verbal element of a light verb construction containing a pronominal clitic such as the preverbal Õæ ñk /xoš-am/14  (gloss: good-pc.1sg, translation: I like) in YÓ@ Õæ ñk @Y « P@ /az qaza¯ xoš-am amad/¯ (gloss: of food good-pc.1sg come.past.3sg, translation: I liked the food). The preverbal good-am is annotated with the relation ENC as a dependent of the verb came. This relation is also treated in the UPDT as a dependent of the verb, but is analyzed as an adjectival complement to the light verb came with the label acomp-lvc/pc.

• LVP is used for the non-verbal element @YJ K /peyda/¯ (visible) of the compound verb àXQ» @YJK /peyda¯ kard-an/ (gloss: find do.past-inf, translation: to find) when  the compound verb is used in to form new compound verbs in three words (three-word compound verbs). Thus, the compound verb can appear with other elements (normally nouns) and form a three-word complex predicate in passive form, such as in XQ» @YJK QJª K éÓAKQK /barname¯ taqyir peyda¯ kard/ (gloss: program change find  do.past.3sg, . translation: the program was changed/the program found a change). In this analysis, Q  Jª K /taqyir/ (change) is taken to be part of the compound verb àXQ» @YJK /peyda¯ kard-an/ (gloss: find do.past-inf, translation: to find). In other  words, the elements change and find are dependents of the light verb do and are annotated as two different LVCs in this compound: change as NVE (as described below) and find as LVP. In the UPDT, on the other hand, only find is included in the LVC, and it is analyzed as acomp-lvc. The noun change is analyzed in a more syntactically transparent way with the label dobj. As in the PerDT, both elements are headed by the light verb do. I believe that counting a compound verb as a three-word or two-word expression is completely dependent on our interpretation of the sentence. For instance, in the above example, we can interpret the sentence simply as the program was changed or as the program found a change without affecting the sense, as the sentence works with both interpretations. The same fact further applies to similar combinations of other words with  this compound verb such as in XQ» @YJ K  @Q ¯@ 2% AëIÒJ ¯ /qeymat-ha¯

14-am is used as the first singular form of the pronominal clitics, defined as pc.1sg in the gloss (see Section 2.3.3).

138 2% afzayeš¯ peyda¯ kard/ (gloss: price-pl 2% increase find do.past.3sg). Here we can interpret the sentence with a three-word compound verb as prices were increased by 2% or in a more syntactically transparent way as prices found an increase of 2%.

• NVE is used for the non-verbal element of a compound verb, which is a noun, adjective, or similar, such as the noun IJ .m• /sohbat/ (talking) in àXQ» IJ m• /sohbat kard-an/ (gloss: talking do.past-inf, translation: to talk). In. UPDT, the corresponding relations are acomp-lvc, dobj-lvc, nsubj-lvc, or prep-lvc depending on the function of the preverbal elements.

• NE is used for the non-verbal element of a compound verb when the verbal element is in infinitive form, for example the non-verbal element /exraj/¯ in /exraj¯ kard-an/ (gloss: exclusion h. @Qk@ àXQ» h. @Qk@ do.past-inf, translation: to fire). In the UPDT this relation completely depends on the function of the non-verbal element. This means that the corresponding relation can be a acomp-lvc, dobj-lvc, nsubj-lvc, or prep-lvc.

• OBJ2 is used for the second object of a sentence that can never take the accusative marker ra¯, such as the object éK Yë /hadiye/ (gift) in X@X éK Yë áÓ éK. @P úGAJ» ð@ /u ketab-i¯ ra¯ be man hadiye dad/¯ (gloss: she/he book-indef ra¯ to. me gift give.past.3sg, translation: she/he gave me a book as a gift). In the UPDT the relation is analyzed as dobj-lvc.

A coordination relation is defined in the UPDT by the relation conj (conjunct) between two elements linked by a coordinating conjunction, such as and, or, etc, and is treated asymmetrically. In other words, the head node is always the first conjunct and other conjuncts are in a subordinate relation to their head. In the PerDT, on the other hand, there are a number of conjunction re- lations specifying constraints on different elements depending on the lexical categories. These relations include AJCONJ (conjunction of adjective), AV- CONJ (conjunction of adverb), NCONJ (conjunction of noun), PCONJ (con- junction of preposition), and VCONJ (conjunction of verb). The first element in the conjunct is the head node and the second conjunct is the dependent of the coordinating conjunction. However, the analysis for conjunction of verb (VCONJ) differs from that for the other types of the conjunction relation. In other words, the verb that appears last is the head, and the first verb is the dependent of the coordinating conjunction. This comparative analysis describes some structural differences regarding headedness and dependency relations between the PerDT and the UPDT. It

139 also shows that some relations in the two treebanks are very general, and that some relations are more specific in one treebank and less specific in the other.

5.7.4 Sample Analyses In order to illustrate the comparison, we review two annotated sentences taken from the PerDT. I have analyzed the sentences based on the UPDT scheme to provide a better understanding of how the two schemes differ. Figure 5.4 and 5.5 illustrate the structural varieties in the PerDT and the UPDT. A discussion section follows at the end of the sample analysis of each figure.

Figure 5.4: PerDT Annotation Figure 5.4 shows an analysis based on the PerDT. The sentence is rooted in the copula verb YJ ‚ë /hast-and/ (gloss: be.pres-3pl, translation: are) with the relation ROOT and consists of two parts, the following subordinate clause, which is analyzed as an adverbial ADV:   YKY ƒ AJ ƒ@ Ñë AK. AîE @ é» úGAÓP P@ az zaman-i¯ ke an-h¯ a¯ ba¯ ham ašen¯ a¯ šod-and from time-indef that this-pl with each.other familiar become.past-3pl Since the time they became familiar with each other and the adjective happy, labeled as MOS (Mosnad) which is a property of a name whose main verb is a copula verb:  .YJ ‚ë I j.ƒñ k xošbaxt hast-and . happy be.pres-3pl . They are happy.

The preposition from, labeled as ADV, consists of the child node time-indef analyzed as POSDEP (post-dependent) defining the object of the preposition from. Further, that is annotated as NCL (clause of noun) functioning as depen- dent on its nominal head time-indef. The complementizer that with the relation NCL governs the relative clause they with each.other familiar become.past- 3pl, which is headed by the predicate become.past-3pl analyzed as PRD. The relative clause starts with the subject they, analyzed as SBJ, and fol- lowed by the adjective familiar with the relation MOS. The mosnad (MOS) familiar is modified by the sub-tree with each.other. This sub-tree consists of the preposition with annotated as NPP, which takes the dependent each.other, labeled as POSDEP (post-dependent) and functions as the object of the prepo-

140 PerDT Annotation UPDT Annotation

خوشبخت هستند are happy ROOT root

آشنا . هستند familiar . خوشبخت از . from happy . advcl are ADV MOS PUNC cop punct زمانی time-a از شدند با آنها POSDEP from they with became mark nsubj prep cop که that NCL هم زمانی شدند became time-a each.other PRD mwe pobj

آشنا آنها که they familiar SBJ MOS that mwe با with NPP

هم each.other POSDEP

از زمانی که آنها با هم آشنا شدند خوشبخت هستند.

Figure 5.4. Syntactic annotation of a Persian sentence taken from the PerDT. To make the figure more readable, glosses have been simplified as follows: they = this-pl, became = become.past-3pl, are = be.pres.3pl. The sentence is illustrated based on two different annotation schemes: PerDT annotation and UPDT annotation. Gloss: from time-indef that they with each.other familiar become.past-3pl happy be.pres-3pl. Translation: Since the time they became familiar with each other they are happy.

141 sition with. Finally the punctuation PUNC has the root of the sentence as its head.

Figure 5.4: UPDT Annotation The UPDT annotated sentence also starts with the subordinate clause from time-indef that they with each.other familiar become.past-3pl and ends with the main clause happy be.pres-3pl. However, as shown in Figure 5.4, the UPDT annotation style offers a different analysis of each clause. The subordinate clause is an adverbial clause modifier with the head node familiar marked by the relation advcl governing the subordinating conjunction from time-indef that, the nominal subject they, the prepositional modifier with each.other, and the copula verb become.past-3pl. Since the adverbial clause includes a copula construction, the predicative complement familiar (which in this case is an adjective) takes the position as head. Taking a closer look at the complex sub-trees in this adverbial clause we see that the subordinate conjunction from time-indef that (since the time) is composed of three words. The first word, from, is labeled as mark and placed as the head node. The rest of the words are linked together in a chain, and annotated as a multi-word expression with the label mwe. In the UPDT I distinguish different subor- dinate conjunctions that appear in the form of multi-word expressions, such as é» úæ¯ð /vaqti ke/ (gloss: when that, translation: when), é» úÍAg PX /dar hal-i¯ ke/ (gloss: in state-indef that, translation: while), and é» QÃ@ /agar ke/ (gloss: if that, translation: if), from the complementizer é» /ke/ (that). The subordinate conjunctions in the UPDT are always labeled as mark (marker) to introduce adverbial clauses. The nominal subject sub-tree has they as its nsubj. The prepositional modifier sub-tree starts with the preposition with followed by its prepositional object each.other as its dependent. The adverbial clause is terminated with the copula verb become.past-3pl as its last sub-tree. The main clause, happy be.pres-3pl, is also a copula construction consist- ing of the predicative happy, an adjective functioning as the root of the tree, governing the adverbial clause and the copula verb be.pres-3pl. Finally the punctuation punct has the root of the sentence as its head.

Figure 5.4: Analysis Since in the UPDT content words occupy the head position, a copula verb cannot be selected as the root or head word of a sentence unless the sentence does not contain a predicate. However, in the PerDT the copula verb can be placed as the root of a sentence, as illustrated in Figure 5.4. Fixed expressions such as é» úGAÓ P P@ /az zaman-i¯ ke/ (gloss: from time- indef that, translation: from the time, since), áK @ Xñk. ð AK. /ba¯ vojud-e in/ (gloss: with existence-ez this, translation: however), and ék QÃ@ (gloss: if what, translation: although) have been grammaticalized in Persian as clause linkers used to link an adverbial clause to a main clause. Fixed expressions

142 are treated in the UPDT as mwe while in the PerDT each element in the ex- pression is treated separately. In the PerDT that, in from time-indef that, is treated as a complementizer introducing the clause they with each.other fa- miliar become.past-3pl to define time-indef. In the PerDT, the complemen- tizer links the relative clause to the main clause by attaching to the noun be- ing modified while in the UPDT, the complementizer attaches to the head of the subordinate clause. In this case, it attaches to become.past-3pl instead of time-indef. As mentioned earlier, the copula become.past-3pl, according to the UPDT, could not be the head of the relative clause as long as there exists a predicate such as familiar. In the PerDT the prepositional phrase with each.other is placed as a child node of the predicate familiar and not of the head word become.past-3pl. In the UPDT the preposition phrase also attaches to the predicate familiar, but the predicate stands as the head of the clause rather than as the dependent. The subject they is set in the same state in both analyses.

Figure 5.5: PerDT Annotation Figure 5.5 shows a sentence containing the following subordinate clause:

,YJ JºK . Ð@Y«@ @P áÓ YJë@ñ kú× Ñë QÃ@ agar ham mi-xah-and¯ man ra¯ e‘dam¯ be-kon-and , if even cont-want.pres-3pl me ra¯ execution sub-do.pres-3pl , Even if they want to execute me, starting with if, which is analyzed as AJUCL (adjunct clause), the head node of the clause, modified by the sub-tree even cont-want.pres-3pl me ra¯ execution sub-do.pres-3pl,. The sentence ends with the following main clause:

.YJ JºK . be-kon-and . sub-do.pres-3pl . let them do that.

More specifically, the sub-tree if even cont-want.pres-3pl is rooted in if gov- erning the child node cont-want.pres-3pl labeled as PRD (predicate), and fur- ther the adverb even analyzed as ADV. The sub-tree heads another sub-tree, namely, me ra¯ execution sub-do.pres-3pl, rooted in sub-do.pres-3pl with the label VCL (complement clause of verb). The sub-tree is a complement clause to the verb cont-want.pres-3pl. In this sub-tree, the subject is absent but is indicated in the form of a personal ending with the verb. The object me is defined as the child node of the accusative marker ra.¯ However, the accusative marker is labeled as the object OBJ and the object me is analyzed as PRE- DEP (pre-dependent). The word execution, which builds a complex predicate

143 PerDT Annotation UPDT Annotation

بکنند بکنند do do root ROOT

. بکنند ، . , do . اگر if . advcl punct punct AJUCL PUNC

اعدام من میوخواهند اگر میوخواهند want if want me execution PRD mark aux dobj dobj-lvc

هم را بکنند هم even even do rā mwe ADV VCL acc

، اعدام را rā execution , OBJ NVE PUNC

من me PREDEP

اگر هم میوخواهند من را اعدام بکنند، بکنند.

Figure 5.5. Syntactic annotation of a Persian sentence taken from the PerDT. To make the figure more readable, glosses have been simplified as follows: want = cont- want.pres-3pl, do = sub-do.pres-3pl. The sentence is illustrated based on two different annotation schemes: PerDT annotation and UPDT annotation. Gloss: if even cont- want.pres-3pl me ra¯ execution sub-do.pres-3pl, sub-do.pres-3pl. Translation: Even if they want to execute me, let them do it. together with the light verb sub-do.pres-3pl, is analyzed as NVE (non-verbal element). The sub-tree ends with the “,” labeled as PUNC (punctuation mark).

Figure 5.5: UPDT Annotation The UPDT annotated sentence also starts with the subordinate clause sub- tree if even cont-want.pres-3pl me ra¯ execution sub-do.pres-3pl, and the main clause containing , sub-do.pres-3pl. However, as shown in Figure 5.5, the UPDT annotated tree analyzes each clause differently.

144 The subordinate clause is an adverbial clause modifier with the head node sub-do.pres-3pl marked as advcl dominating the subordinate conjunction if even, the auxiliary verb cont-want.pres-3pl, the direct object me, and the di- rect object in light verb construction execution. The subordinate conjunction if even (even if, even though) is a multi-word expression (fixed-expression). Hence, the leading word in the expression is analyzed by the relation mark and the latter word as a multi-word expression. The subject is absent, but the information about it is given by related personal ending on the verb sub- do.pres-3pl. The direct object me is marked by the relation dobj followed by the accusative marker ra¯ labeled acc. The adverbial clause sub-tree ends with the word execution, which creates a direct object in light verb construction in combination with the light verb sub-do.pres-3pl. The sentence terminates with the main clause containing sub-do.pres-3pl, rooted in sub-do.pres-3pl governing the entire sentence.

Figure 5.5: Analysis As noted earlier, because content words in the UPDT are usually selected as head nodes in the dependency relations, auxiliaries and complementizers are consistently treated as dependents. The reasons behind this principle are that it is more transparent to have direct links between content words and that STD is designed for deep-syntactic analysis (semantic analysis) more than for sur- face analysis. Therefore, as illustrated in the UPDT sample tree (Figure 5.5), unlike the PerDT, the auxiliary cont-want.pres-3pl is treated as a dependent. However, function words in the PerDT are treated as head nodes. Even when looking back at the previous sample trees shown in Figure 5.4 the complemen- tizer that with the relation NCL in PerDT and mwe is treated differently. Another striking difference revealed in Figure 5.5 is how subordinate con- junctions, which are usually expressed with fixed expressions in Persian, are treated in the treebanks. For instance the relation between the words if and even in the expression if even is marked in the UPDT as a multi-word expres- sion and the entire term is considered as a subordinate conjunction introduc- ing the adverbial clause cont-want.pres-3pl me ra¯ execution sub-do.pres-3pl. However, in the PerDT this relation is completely broken and the words are handled discretely. Moreover, the accusative marker ra¯ is treated as a function word in the UPDT and is therefore placed as a dependent of the direct object with the label acc, whereas this relation is treated as a superior in the hierarchy in the PerDT and is placed as a head node to the direct object and is marked by the label OBJ. Despite the fact that both the UPDT and the PerDT are based on a depen- dency structure, they differ greatly concerning both head-dependent relations and functional labels. Moreover, the data in the UPDT and the PerDT are taken from different sources and contain different character encodings. The

145 treebanks further provide different tokenization and annotation schemes, as was described in this chapter.

146 6. Dependency Parsing for Persian

The previous chapter provided a detailed description of the UPDT. In this chapter I describe how I use the treebank in developing a state-of-the-art syn- tactic parser for Persian. The construction of the parser and the treebank in fact went hand in hand, and the processes were accomplished simultaneously in a bootstrapping procedure. As noted earlier, in Section 5.2, I employed MaltParser (Nivre et al., 2006) for the treebank development and the quality of the parser was enhanced as the size of the training data grew. Fine-grained annotated data in treebanks normally provides a more complete grammatical analysis, which in turn enhances the quality of parsing results. However, complex annotation may not always be beneficial and can impair automatic analysis. In this chapter, I present different empirical studies where I systematically simplify the annotation schemes for part-of-speech tags and dependency relations. More precisely, I perform experiments under four different conditions. I first experiment with all the features and labels that already exist in the treebank. The results achieved by this experiment will be used as the baseline results. I then experiment with different relation sets by removing or merging various feature distinctions in the part-of-speech tagset and the syntactic annotation scheme. More specifically, I will perform the following experiments:

1. Experiment with full treebank annotation (baseline).

2. Experiment with coarse-grained part-of-speech tags.

3. Experiment with merged dependency relations for light verb constructions.

4. Experiment with no complex dependency relations.

In Experiment 2, I remove all morphological features and only keep the part- of-speech tags. For instance, I merge all distinctions for adjectives, such as ADJ-CMPR, ADJ-INO, ADJ-SUP, and ADJ-VOC, into ADJ. In Experiment 3, I do a similar study by converting different occurrences of lvc, such as acomp-lvc, dobj-lvc, nsubj-lvc, and prep-lvc, into only lvc. In Experiment 4, I investigate the usefulness of complex syntactic labels by removing all information about clitics. The experiments are designed to serve as indicators of whether the conversions help or do not help the parser. In order to get a realistic picture of the parser performance, all these experiments will be

147 performed using automatically generated part-of-speech tags. However, for comparison, I will also run the experiments with gold standard part-of-speech tags. All the above experiments will be carried out using MaltParser. After dis- covering the best label set for both part-of-speech tags and dependency rela- tions, I will experiment with other parsers such as MSTParser (McDonald et al., 2005b), MateParser (Bohnet and Kuhn, 2012), and TurboParser (Martins et al., 2010) to find a state-of-the-art parser for Persian. I evaluate the parsers by experimenting with various feature settings when optional parameter set- tings for optimization are available. However, only results for final settings are presented. The selected state-of-the-art parser for Persian will then be used as a mod- ule in the pipeline of tools for automatic processing and analysis of Persian. The parsing module will be called ParsPer. For evaluation of ParsPer I first perform a parsing experiment on the treebank data. I then make an indepen- dent parsing evaluation as I did for the other tools in the pipeline. I apply the parser to the 100 randomly selected sentences used in the evaluation of the tools previously introduced in Chapter 4 and present the final results.

6.1 Preliminaries 6.1.1 Data The treebank is sequentially split into 10 parts, of which segments 1–8 are used for training (80%), segment 9 for development (10%), and 10 for testing (10%). In my basic experiments with MaltParser, the first phase, I train the parser on the training set and test it on the development set. In the experiments with other parsers, the second phase, I train the parser on the joint training and development sets (90%) and test it on the test set.

6.1.2 Evaluation Metrics For the evaluation of the experiments, I use the standard and most commonly applied evaluation metrics for dependency parsers: the labeled attachment score (LAS), unlabeled attachment score (UAS), and label accuracy (LA). The labeled attachment score measures the percentage of tokens with correct head and correct label. The unlabeled attachment score measures the percent- age of tokens with correct head. The label accuracy measures the percentage of tokens with correct dependency label. For my basic experiments with Malt- Parser I will further report labeled recall and precision for the 20 most frequent dependency relations in the treebank in order to get a more fine-grained picture of the impact of the representation.

148 6.1.3 Parsers In this chapter I make use of different freely available dependency parsers that until now have been successfully used for different languages, namely Malt- Parser (Nivre et al., 2006), MSTParser (McDonald et al., 2005b), MateParsers (Bohnet, 2010; Bohnet and Nivre, 2012; Bohnet and Kuhn, 2012), and Tur- boParser (Martins et al., 2010). The parsers are briefly described below.

MaltParser MaltParser (Nivre et al., 2006) is an open source data-driven parser genera- tor for dependency parsing that has been applied to a large number of lan- guages. The parser is characterized as transition-based and can be used to develop a parser for a new language given a dependency treebank represent- ing the syntactic relations of that language. The system allows the user to choose different parsing algorithms, and to define optional feature models indicating lexical features, part-of-speech features and dependency type fea- tures. The main parsing algorithms available in MaltParser are Nivre’s algo- rithms, including the arc-eager and arc-standard versions described in Nivre (2003) and Nivre (2004), Covington’s algorithms, containing the projective and non-projective versions described by Covington (2001), and Stack algo- rithms, including the projective and non-projective versions of the algorithm described in Nivre (2009) and Nivre et al. (2009). Covington’s algorithms and the Stack algorithms can handle non-projective trees, whereas Nivre’s algorithms cannot (Ballesteros and Nivre, 2012). An optimization tool for MaltParser, MaltOptimizer (Ballesteros and Nivre, 2012) has been developed specifically to optimize MaltParser for new data sets with respect to parsing algorithm and feature selection.

MSTParser MSTParser (McDonald et al., 2005b; McDonald et al., 2005a) is an open source system that has also been applied to a wide range of languages, similar to MaltParser. The parser is based on the graph-based approach to dependency parsing using global learning and exact (or nearly exact) inference algorithms. A graph-based parser extracts the highest scoring spanning tree from a com- plete graph containing all possible dependency arcs, using a scoring model that decomposes into scores for smaller subgraphs of a tree (McDonald et al., 2005b; Koo and Collins, 2010). MSTParser implements first- and second- order models, where subgraphs are single arcs and pairs of arcs, respectively, and provides different algorithms for projective and non-projective trees.

MateParsers MateParsers are open source statistical dependency parsers in the NLP toolkit Mate Tools (Bohnet, 2010). The pipeline contains a lemmatizer, part-of- speech tagger, morphological tagger, dependency parsers (Bohnet, 2010;

149 Bohnet and Nivre, 2012; Bohnet and Kuhn, 2012), and a semantic role labeler (Björkelund et al., 2010). The dependency parsers in the pipeline include a graph-based parser, a transition-based parser, and a joint tagger-parser derived from the transition-based parser. The basis of the graph-based parser is the second-order maximum span- ning tree dependency parsing algorithm of Carreras (2007) combined with the passive-aggressive perceptron described in Crammer et al. (2006) and Mc- Donald et al. (2005a) and a hash kernel. This method improves the mapping of feature values, which in turn has led to higher attachment scores for languages such as Czech, English, German, and Spanish (Bohnet, 2010). In addition to high accuracy, Bohnet (2010) reports a substantial increase in parsing speed: 3.5 times faster on a single CPU core than the baseline parser that has an ar- chitecture for a maximum spanning tree parser.1 The transition-based parser is a system that combines part-of-speech tag- ging and labeled dependency parsing with non-projective trees. However, when one only desires to run the transition-based parser, the tagger can be switched off. Thus, the tagger is an option that can easily be switched on and off. The parser employs beam search in combination with structured per- ceptron learning. The system has exhibited steady improvements in accuracy for tagging and parsing when evaluated on Chinese, Czech, English, and Ger- man, compared to the results achieved by the graph-based system in the Mate pipeline (Bohnet and Nivre, 2012).

TurboParser TurboParser (Martins et al., 2010; Martins et al., 2013) is another open source multilingual dependency parser. The system is based on a second-order non- projective parser with features for arcs, consecutive siblings and grandparents, using the AD3 algorithm as a decoder. In order to reduce the number of can- didate arcs and increase the parsing speed, the system trains a probabilistic model for unlabeled arc-factored pruning. The parser presented in Martins et al. (2013) uses the AD3 algorithm, which is an accelerated dual decomposition algorithm proposed by Martins et al. (2011). The new version is considerably faster than that of Martins et al. (2011). The parser further handles large components, such as specialized head automata for the third-order features, and a sequence model for head bigrams. The scores presented in Martins et al. (2013) show state-of-the-art results for large data sets of languages with most non-projective dependencies, such as English, Czech, German, and Dutch.

1According to the last shared tasks, the transition-based parsers have similar run times as max- imum spanning tree parsers (Bohnet, 2010).

150 6.2 Experiments with Different Parsing Representations In this section I describe a number of basic experiments performed for dif- ferent purposes with MaltParser. To evaluate the overall performance of the parser, I tune parameters to achieve the best possible results. Thus, I exper- iment with different algorithms and feature settings to optimize MaltParser. To accomplish the optimization process, I apply MaltOptimizer (Ballesteros and Nivre, 2012). Parser accuracy is evaluated on automatically generated part-of-speech tags as well as gold standard tags. In order to generate automatic part-of-speech tags, I used the Persian part-of-speech tagger, TagPer. However, for the treebank experiments I retrained the tagger to exclude the treebank data to avoid data overlap. The tagging evaluation performed by the new TagPer revealed an overall accuracy of 97.17% when HunPoS was trained on 90% of the UPC and evaluated on the remaining 10%. The four different experiments include (1) an overall parsing evaluation on full treebank annotation, (2) an experiment without morphological features in the part-of-speech tagset, (3) an experiment without fine-grained LVC labels, and (4) an experiment without complex labels, as described in the introduction.

6.2.1 Baseline: Full Treebank Annotation In this parsing evaluation I trained MaltParser on the UPDT with automatically generated part-of-speech tags. I used the treebank with full part-of-speech tags and all existing dependency relations. The experiment resulted in a labeled attachment score of 78.84% and an unlabeled attachment score of 83.07%. These results will be used as a baseline for subsequent experiments. Labeled recall and precision for the 20 most frequent dependency relations (with a minimum frequency of 2022) are presented in Table 6.1. As can be seen, the results vary greatly across the relation types, with recall ranging from 53.75% for direct object (dobj) to 97.12% for object of a preposition (pobj), and preci- sion varying between 55.37% for clausal complement (ccomp) to 95.57% for object of a preposition (pobj). Recall and precision for a number of relations, such as object of a preposition (pobj), adjectival modifier (amod), direct object in light verb construction (dobj-lvc), determiner (det), numeric modifier (num) with similar recall and precision, and auxiliary (aux), are all over 90%. In ad- dition to the direct object (dobj), the most erroneous2 relations are nominal subject (nsubj), conjunct (conj), copula (cop), adverbial modifier (advmod), clausal complement (ccomp), noun compound modifier (nn), and accusative marker (acc). However, their recall and precision figures vary somewhat. As indicated in Table 6.1 the results for core arguments such as nominal subject (nsubj) and direct object (dobj) are slightly low. This can be explained

2I define most erroneous relations as relations with scores lower than 70%.

151 Table 6.1. Labeled recall and precision on the development set for the 20 most fre- quent dependency types in the UPDT, when MaltParser is trained on the full treebank annotation (automatically generated part-of-speech tags). Dependency Relations Frequency Recall (%) Precision (%) pobj 16237 97.12 95.57 poss 16067 89.96 79.28 prep 15643 76.00 74.49 punct 13442 75.04 76.10 amod 9211 90.64 90.72 nsubj 8653 67.60 66.26 conj 8629 67.78 67.78 cc 7657 78.34 77.81 root 5918 81.21 79.87 cop 4427 66.22 73.51 dobj-lvc 4185 91.63 92.06 advmod 4157 70.27 65.82 ccomp 4021 63.54 55.37 det 3929 93.79 91.71 dobj 3723 53.75 57.01 nn 3339 57.28 79.73 num 2872 92.00 92.00 acc 2535 69.76 69.48 aux 2287 92.14 90.95 complm 2022 77.71 78.61

by the fact that, despite Persian’s SOV structure, subjects and objects may shift order in a sentence. As Persian is a pro-drop language, an object may be placed at the beginning of a sentence (with or without the accusative marker ra)¯ and the subject may either come next or it may be completely omitted from the sentence and instead be inflected as a personal ending on the verb. There are further cases where subject and object are both omitted but appear as personal endings on the verb, because Persian syntax contains a vast number of dropped subjects and objects. In all these cases, it is hard for the system to identify the correct subject and object in the sentence, which may lead to the dependency relations nsubj and dobj frequently being interchanged or not being correctly identified. The dependency relation noun compound modifier (nn) is another relation with low recall. Checking the parsed file, I discovered that the parser had often selected the label possession modifier (poss) instead of nn. This can be explained by the fact that both labels are always governed by a noun and are used for nouns. The possession modifier (poss) is applied to genitive complements and the compound modifier (nn) to noun compounds (and proper names). However, this difference is not marked in my part-of- speech annotation. Moreover, the number of occurrences of the label poss in

152 Table 6.2. Labeled recall and precision on the development set for the 20 most fre- quent dependency types in the UPDT, when MaltParser is trained on the full treebank annotation (gold standard part-of-speech tags). Dependency Relations Frequency Recall (%) Precision (%) pobj 16237 98.53 97.58 poss 16067 92.91 82.47 prep 15643 77.18 75.98 punct 13442 78.57 78.98 amod 9211 94.81 94.72 nsubj 8653 71.54 70.04 conj 8629 70.87 71.46 cc 7657 81.34 81.00 root 5918 85.57 84.58 cop 4427 65.95 73.43 dobj-lvc 4185 92.56 94.09 advmod 4157 76.69 73.46 ccomp 4021 70.24 59.95 det 3929 96.05 96.32 dobj 3723 57.96 60.31 nn 3339 60.19 87.94 num 2872 92.62 93.48 acc 2535 74.60 74.60 aux 2287 92.14 90.17 complm 2022 83.43 82.02

the training data is higher than that of the label nn, meaning that it is easier for the parser to identify the structure as the dependency relation poss than nn. For comparison, I also trained MaltParser on the UPDT with gold standard part-of-speech tags. Similar to the previous experiment, I used the treebank with its full part-of-speech tags and all existing dependency relations. The evaluation resulted in a labeled attachment score of 81.98% and an unlabeled attachment score of 85.24%. Table 6.2 displays labeled recall and precision for the 20 most frequent dependency relations in the treebank. As we can see, the highest and lowest recall and precision scores belong to the same dependency relations that were marked as the highest and lowest assigned labels in the experiment with auto tags. The dependency relation object of a preposition (pobj) shows the highest scores for both recall and precision. Meanwhile direct object (dobj) and clausal complement (ccomp) have the lowest scores for both recall and precision. The relation accusative marker (acc) further receives a similar recall and precision score of 74.60%. In addition to the object of a preposition (pobj), the relations adjectival modifier (amod), direct object in light verb construction (dobj-lvc), determiner (det), numeric modifier (num), and auxiliary (aux) received scores above 90%

153 Table 6.3. Labeled and unlabeled attachment scores and label accuracy on the de- velopment set when MaltParser was trained on UPDT with a fine-grained annotated treebank. Basic Experiments with MaltParser Baseline Experiments LAS (%) UAS (%) LA (%) Auto Tags 78.84 83.07 88.48 Gold Tags 81.98 85.24 90.78 for both recall and precision. The dependency relation possession modifier (poss) also shows a recall of over 90%. Furthermore, copula (cop), direct ob- ject (dobj), and noun compound modifier (nn) are the most erroneous relations, although there are differences in the recall and precision figures achieved by the parser. Table 6.3 summarizes the baseline results. The scores for auto tags, as could be predicted, are lower than the scores achieved with gold tags. The influence of gold versus auto part-of-speech tags on parsing performance has further been shown for other treebanks. For instance, Petrov and Klein (2008) note that using auto part-of-speech tags in the Tübingen treebank leads to a substantial number of parsing errors due to incorrect tagging, resulting in a 1.92% difference in F-score compared to gold part-of-speech tags.

6.2.2 Coarse-Grained Part-of-Speech Tags The second empirical study was performed in order to select the best part-of- speech encoding set in the UPDT. In this experiment, I merged all morphologi- cal features with their main categories. As a result, feature distinctions that ex- isted for adjective, adverb, noun, and verb were all discarded. In other words, ADJ_CMPR, ADJ_INO, ADJ_SUP, and ADJ_VOC were merged with ADJ; ADV_COMP, ADV_I, ADV_LOC, ADV_NEG, and ADV_TIME were merged with ADV; N_PL, N_SING, and N_VOC were merged with N; and V_AUX, V_IMP, V_PA, V_PP, V_PRS, and V_SUB were merged with V. After merging all automatically generated morphological features with their main categories I ran MaltParser on UPDT with 15 auto part-of-speech tags instead of 31. Pars- ing evaluation revealed scores of 79.24% for labeled attachment and 83.45% for unlabeled attachment. Comparing the results to those obtained by the base- line experiment with auto part-of-speech tags shows that MaltParser performs better on coarse-grained part-of-speech tags. Table 6.4 shows the results for labeled recall and precision for the 20 most frequent dependency labels in the UPDT. Again, object of a preposition (pobj) shows the best results for depen- dency relations, with 97.07% for recall and 95.72% for precision, and direct object (dobj) shows the lowest recall and precision, with 52.55% and 55.56%, respectively. In addition to the object of a preposition (pobj), the dependency

154 Table 6.4. Labeled recall and precision on the development set for the 20 most fre- quent dependency types in the UPDT, when MaltParser is trained on the UPDT with coarse-grained auto part-of-speech tags. Dependency Relations Frequency Recall (%) Precision (%) pobj 16237 97.07 95.72 poss 16067 90.18 79.43 prep 15643 76.85 75.57 punct 13442 76.07 76.80 amod 9211 88.69 90.37 nsubj 8653 68.62 64.55 conj 8629 68.85 68.28 cc 7657 78.88 78.14 root 5918 81.38 80.17 cop 4427 67.83 78.33 dobj-lvc 4185 90.23 91.94 advmod 4157 73.31 66.16 ccomp 4021 67.29 61.67 det 3929 94.35 92.78 dobj 3723 52.55 55.56 nn 3339 57.04 82.46 num 2872 92.92 91.79 acc 2535 69.35 70.20 aux 2287 92.14 89.41 complm 2022 80.00 82.35

relations direct object in light verb construction (dobj-lvc), determiner (det), and numeric modifier (num) receive scores of over 90% for both recall and precision. The possession modifier (poss) and auxiliary (aux), as well as ad- verbial modifier (amod) obtained scores above 90% for recall and precision, respectively. The most erroneous dependency relations in this experiment are nominal subject (nsubj), conjunct (conj), copula (cop), clausal complement (ccomp), direct object (dobj), noun compound modifier (nn), and accusative marker (acc). However, noun compound modifier (nn), with 82.46%, shows quite a high score for precision compared to its score of 57.04% for recall. Comparing the recall and precision results of the dependency labels pre- sented in Table 6.4 to the baseline, we see an improvement in many depen- dency relations. The greatest improvement is exhibited by the relation clause complement (ccomp), with a 3.75% gain for recall and 6.3% for precision. The dependency relation clause complement (ccomp), in the treebank, is as- signed for complements that are presented by verbs, nouns, or adjectives. Us- ing coarse-grained part-of-speech tags for verbs, nouns, and adjectives leads to higher results. This further assists the relation complementizer (complm), which always introduces a clausal complement (ccomp), and achieves 2.29%

155 Table 6.5. Labeled recall and precision on the development set for the 20 most fre- quent dependency types in the UPDT, when MaltParser is trained on the UPDT with coarse-grained gold part-of-speech tags. Dependency Relations Frequency Recall (%) Precision (%) pobj 16237 98.43 97.53 poss 16067 93.51 82.40 prep 15643 78.37 77.23 punct 13442 78.13 78.88 amod 9211 94.62 93.41 nsubj 8653 71.54 68.83 conj 8629 71.58 71.41 cc 7657 81.47 81.25 root 5918 83.22 82.67 cop 4427 68.90 80.82 dobj-lvc 4185 92.33 93.41 advmod 4157 79.39 74.13 ccomp 4021 70.51 61.59 det 3929 96.33 96.60 dobj 3723 57.06 59.56 nn 3339 58.50 89.59 num 2872 94.15 95.03 acc 2535 72.58 74.07 aux 2287 93.45 88.80 complm 2022 86.86 85.88

higher recall and 3.74% higher precision. To follow up the tables, copula (cop) is also one of the dependency relations that shows substantial improvements, especially for precision, with 1.61% higher recall and 4.82% higher precision. Continuing the comparison, most of the dependency labels show an improve- ment in results. However, coarse-grained part-of-speech tags have a negative impact on some dependency labels. This negative effect is more or less visible in the dependency relations object of a preposition (pobj), adjectival modifier (amod), nominal subject (nsubj), direct object in light verb construction (dobj- lvc), direct object (dobj), noun compound modifier (nn), and auxiliary (aux), which may due to the lack of various distinctions among nouns, adjectives, and verbs. For instance, plural nouns never appear in complex predicates and, as seen in the tables, direct object in light verb construction (dobj-lvc) drops by 1.40% and 0.12% for recall and precision, respectively. I also ran MaltParser on the treebank with coarse-grained gold part-of- speech tags. The results of the experiment showed that the parser achieved a labeled attachment score of 82.00% and an unlabeled attachment score of 85.37%. The overall results are somewhat better than in the previous ex- periment when the morphological distinctions are retained. Thus, merging

156 the morphological features in the treebank is useful as it improves the pars- ing performance. Note that parsing performance follows the same trend as in the experiment where coarse-grained auto part-of-speech tags were used. Ta- ble 6.5 presents labeled recall and precision scores for the 20 most frequent dependency relations in the UPDT. Once again, object of a preposition (pobj), with 98.43% and 97.53%, receives the highest recall and precision scores. On the other hand, the lowest recall and precision was shown by direct object (dobj) with 57.06% and 59.56%, respectively. Preposition modifier (prep), coordination (cc), copula (cop), adverbial modifier (advmod), clause comple- ment (ccomp), determiner (det), numeric modifier (num), and complementizer (complm) get higher recall and precision scores than in the baseline experi- ment. Additionally, possession modifier (poss) conjunct (conj), and auxiliary (aux) score higher for recall, and noun compound modifier (nn) has higher pre- cision. The relation nominal subject (nsubj) shows no difference in recall but its precision has decreased. Apart from the aforementioned dependency rela- tions, the rest of the relations receive lower recall and precision scores. The de- pendency relations object of a preposition (pobj), adjectival modifier (amod), direct object in light verb construction (dobj-lvc), determiner (det), and nu- meric modifier (num) obtain scores of over 90% for both recall and precision. The relations possession modifier (poss) and auxiliary (aux) show scores of over 90% for recall. Moreover, nominal subject (nsubj), copula (cop), clausal complement (ccomp), direct object (dobj), and noun compound modifier (nn) all belong to the most erroneous dependency relations even if there are differ- ences in the recall and precision figures obtained by the parser. As when the same experiment was run with auto part-of-speech tags, the study shows an improvement in overall parsing results. However, the improve- ment is slightly less than the results for auto tags. The greatest improvement is observed for the relation copula (cop) with a 2.95% gain for recall and 7.39% for precision. Comparing the results to the baseline, although using coarse- grained gold part-of-speech tags improves the results for a number of depen- dency relations, it affects some other dependency relations negatively. These relations are object of a preposition (pobj), possession modifier (poss), punc- tuation (punct), adjectival modifier (amod), nominal subject (nsubj), conjunct (conj), root, direct object in light verb construction (dobj-lvc), direct object (dobj), noun compound modifier (nn), accusative marker (acc), and auxiliary (aux), though with differences in recall and precision scores. The reason for this reduction may be that various distinctions between nouns, adjectives, and verbs are not clear to the parser. These may also be dependent on the surround- ing words and their specific characteristics in the context. For instance, proper nouns are always given the tag N_SING, and the lack of feature distinctions for nouns in the tag set makes it harder for the parser to achieve a high recall score for the relation noun compound modifier (nn). The dependency relation shows a decrease of 1.69% for recall. However, the score increases by 1.65%

157 Table 6.6. Labeled and unlabeled attachment scores and label accuracy on the devel- opment set when MaltParser was trained on the UPDT with coarse-grained part-of- speech tags. Basic Experiments with MaltParser Baseline Experiments LAS (%) UAS (%) LA (%) Auto Tags 78.84 83.07 88.48 Gold Tags 81.98 85.24 90.78 PoS Tags Experiments Auto Tags 79.24 83.45 88.43 Gold Tags 82.00 85.37 90.54 for precision. Table 6.6 compares the results obtained by the experiments with auto and gold part-of-speech tags.

6.2.3 Coarse-Grained LVC Relations For this experiment I converted all variations of light verb constructions, such as acomp-lvc, dobj-lvc, nsubj-lvc, and prep-lvc, to simply lvc and performed the study with automatically generated fine-grained part-of-speech tags. The evaluation showed that the parser achieved a labeled attachment score of 79.46% and an unlabeled attachment score of 83.52%. With respect to the fact that the labeled attachment score is based on the number of correct de- pendency labels and correct head, the LAS results obtained in this experiment cannot be directly compared to the baseline results, as the two experiments use different label sets. Therefore, output that differs in this regard can only be evaluated when unlabeled. Thus, the unlabeled attachment score that mea- sures the number of tokens with correct head can be directly compared with the baseline. This means that removing the LVC distinctions from the treebank with auto part-of-speech tags helps the parser to obtain higher accuracy. As shown in Table 6.7, the highest recall and precision scores are shown by ob- ject of a preposition (pobj), with 97.45% and 95.89% respectively. The lowest recall and precision scores are shown by direct object (dobj) with 55.26% and 56.79%, respectively. In addition to the object of a preposition (pobj), the dependency relations determiner (det), numeric modifier (num), and auxiliary (aux) obtain scores above 90% for recall and precision. The relations adjectival modifier (amod) and light verb construction (lvc) further show a score of over 90% for preci- sion. Irrespective of the score differences for recall and precision, the depen- dency relations nominal subject (nsubj), conjunct (conj), copula (cop), adver- bial modifier (advmod), clausal complement (ccomp), direct object (dobj), and noun compound modifier (nn) are the most erroneous dependency relations in

158 Table 6.7. Labeled recall and precision on the development set for the 20 most fre- quent dependency types in the UPDT, when MaltParser is trained on the treebank with fine-grained auto part-of-speech tags and only one light verb construction. Dependency Relations Frequency Recall (%) Precision (%) pobj 16237 97.45 95.89 poss 16067 89.91 79.65 prep 15643 75.04 73.88 punct 13442 76.22 76.72 amod 9211 89.90 90.32 nsubj 8653 70.30 66.92 conj 8629 67.66 67.90 cc 7657 78.88 78.14 root 5918 82.05 81.23 cop 4427 68.10 78.64 lvc 5427 85.92 90.54 advmod 4157 72.64 68.04 ccomp 4021 64.08 57.18 det 3929 94.07 92.76 dobj 3723 55.26 56.79 nn 3339 58.01 83.28 num 2872 92.92 92.07 acc 2535 70.97 70.97 aux 2287 92.58 92.17 complm 2022 80.57 81.50

this experiment. Compared to the baseline results presented in Table 6.1, recall and precision have decreased for the dependency relations prepositional mod- ifier (prep) and adjectival modifier (amod). This can probably be explained by the fact that merging LVC variations makes it harder for the system to select, for instance, a preposition as a prepositional modifier (prep) or an lvc, as well as an adjectival modifier (amod) or an lvc. A striking finding from Table 6.7 is the outcome of converting different light verb constructions to lvc, which resulted in 85.92% for recall and 90.54% for precision. Moreover Table 6.9 shows recall and precision for different types of LVC relations from the baseline experiment when I applied the fine-grained annotated treebank as well as recall and precision of the dependency label lvc from Experiment 3 when I tested the treebank with fine-grained part-of- speech tags and merged LVC relations. The entries in the table further present information about frequency of acomp-lvc, dobj-lvc, nsubj-lvc, and prep-lvc in Experiment 1 as well as the frequency of the label lvc in Experiment 3. Note that, given the low frequency of the LVC relations acomp-lvc, nsubj-lvc, and prep-lvc in the treebank, their recall and precision are not presented together with the 20 most frequent dependency types in the UPDT. As presented in Ta-

159 Table 6.8. Labeled recall and precision on the development set for the 20 most fre- quent dependency types in the UPDT, when MaltParser is trained on the treebank with fine-grained gold part-of-speech tags and only one light verb construction. Dependency Relations Frequency Recall (%) Precision (%) pobj 16237 98.53 97.69 poss 16067 93.18 82.67 prep 15643 77.86 76.44 punct 13442 78.94 79.47 amod 9211 94.90 94.38 nsubj 8653 72.33 70.89 conj 8629 72.29 72.99 cc 7657 81.88 81.55 root 5918 85.57 85.00 cop 4427 67.83 74.63 lvc 5427 84.81 92.49 advmod 4157 78.72 74.20 ccomp 4021 71.31 59.91 det 3929 96.33 96.88 dobj 3723 58.56 62.30 nn 3339 59.71 88.49 num 2872 92.92 93.50 acc 2535 73.79 74.39 aux 2287 93.45 91.85 complm 2022 85.71 83.80

ble 6.9, results for recall and precision are lower than the baseline results for direct object in light verb construction (dobj-lvc) but higher than the results obtained by the adjectival complement in light verb construction (acomp-lvc) and the prepositional modifier in light verb construction (prep-lvc). However, we should keep in mind that the label lvc covers all types of LVC relations and, as mentioned earlier, it is harder for the system to select a proper label for tokens that sometimes participate in LVC relations and sometimes participate in relations similar to LVC labels, such as prepositions that occasionally ap- pear either as the dependency relations prepositional modifier (prep) or as the prepositional modifier in light verb construction (prep-lvc). Hence, the overall results show that having various types of LVC distinctions in the treebank does not contribute to higher performance. On the other hand, recall and/or preci- sion for the core arguments nominal subject (nsubj) and direct object (dobj) are improved. In other words, recall is improved by 2.7% and 1.51% for nominal subject (nsubj) and direct object (dobj), respectively. The dependency relation root is further improved by 0.84% for recall and 1.36% for precision. Thus, this merging might be a disadvantage for the relation prepositional modifier (prep) but favors other relations, for instance the nominal subject (nsubj).

160 Table 6.9. Recall and precision for LVC relations with fine-grained auto and gold part-of-speech tags in experiments 1 and 3. LVC Performance in Experiments 1 and 3 Auto Tags Frequency Recall Precision acomp-lvc 681 80.56 78.38 dobj-lvc 4185 91.63 92.06 nsubj-lvc 7 ∅ ∅ prep-lvc 554 46.88 78.95 lvc 5427 85.92 90.54 Gold Tags acomp-lvc 681 76.39 85.94 dobj-lvc 4185 92.56 94.09 nsubj-lvc 7 ∅ ∅ prep-lvc 554 48.44 81.58 lvc 5427 84.81 92.49

I performed another experiment under the same conditions as the previous one but with gold part-of-speech tags. This resulted in a labeled attachment score of 82.38% and an unlabeled attachment score of 85.58%. The results for unlabeled attachment score are higher than the baseline, which shows a simi- lar pattern to that we saw when experimenting with auto part-of-speech tags. Merging LVC distinctions improves parsing accuracy, with recall ranging from 58.56% for direct object (dobj) to 98.53% for object of a preposition (pobj), and precision varying between 59.91% for clausal complement (ccomp) to 97.69% for object of a preposition (pobj). As in the experiment with auto tags, the striking result here is that the light verb construction (lvc) is parsed much less accurately than the relation direct object in light verb construction (dobj-lvc) in the baseline. As depicted in Table 6.8, the highest recall and precision scores are once again achieved by object of a preposition (pobj), with 98.53% and 97.69%, respectively. Similar to the baseline results, the lowest recall and precision scores are exhibited by direct object (dobj) and clausal complement (ccomp), respectively. Comparing to the results for baseline (Table 6.2), only accusative marker (acc) presents lower recall and precision. Noun compound modifier (nn) and adjectival modifier (amod) feature lower recall and precision, respec- tively. This means that, apart from the relation root, which shows no difference in recall, the remaining relations achieve higher recall and precision. Some relations, such as object of a preposition (pobj), adjectival modifier (amod), determiner (det), numeric modifier (num), and auxiliary (aux) achieve scores above 90% for both recall and precision. Moreover, the possession modifier (poss) and light verb construction (lvc) show scores of over 90% for recall and

161 Table 6.10. Labeled and unlabeled attachment scores and label accuracy on the de- velopment set when MaltParser was trained on UPDT with fine-grained part-of-speech tags and only one dependency relation for light verb construction. Basic Experiments with MaltParser Baseline Experiments LAS (%) UAS (%) LA (%) Auto Tags 78.84 83.07 88.48 Gold Tags 81.98 85.24 90.78 LVC Experiments Auto Tags 79.46 83.52 88.86 Gold Tags 82.38 85.58 90.93 precision respectively. The relations copula (cop), direct object (dobj), and noun compound modifier (nn) were the most erroneous relations, although the recall and precision scores attained for them by the parser differ. As shown in Table 6.9, as in the previous experiment with auto part-of- speech tags, the results for recall and precision for lvc are lower than the base- line results for direct object in light verb construction (dobj-lvc) but higher than the results for other LVC relations, for the same reason. Moreover, re- call and precision for the nominal subject in light verb construction (nsubj- lvc) are nil. After checking the training and test sets, I noticed that all seven tokens annotated with the label nsubj-lvc had ended up in the training data. Although providing recall and precision for each and every LVC distinction on a label-by-label basis is most informative, because the label lvc covers all types of the LVC variations, I cannot directly compare the results of each with the results obtained by the dependency relation lvc in Experiment 3, unless I calculate an overall recall and precision score for all the LVC types in Exper- iment 1. The results of such statistical calculations revealed an overall recall and precision of 85.55% and 89.16% with auto tags and 85.90% and 91.66% with gold tags. Comparing the overall recall and precision results from Ex- periment 1 with those achieved in Experiment 3 shows that merging the LVC distinctions into lvc helps the parser achieve superior performance with auto part-of-speech tags, resulting in a 0.37% improvement for recall and 1.38% for precision. The recall for lvc in Experiment 3 is reduced by 1.09% when tested on gold part-of-speech tags, but the precision is improved with 0.83%. Furthermore, Table 6.10 summarizes the results of Experiment 3 for auto and gold part-of-speech tags.

6.2.4 No Complex Relations I further experimented with modifying all complex syntactic relations that were used for complex unsegmented word forms (words containing unseg-

162 Table 6.11. Labeled recall and precision on the development set for the 20 most frequent dependency types in the UPDT, when MaltParser is trained on the treebank with fine-grained auto part-of-speech tags and only basic dependency relations. Dependency Relations Frequency Recall (%) Precision (%) pobj 16412 97.47 96.90 poss 16268 90.27 79.59 prep 15734 76.52 75.62 punct 13442 75.04 75.76 amod 9277 89.75 90.59 nsubj 8847 68.40 66.56 conj 8753 68.63 69.28 cc 7657 79.16 78.41 root 6010 81.17 80.90 cop 4427 66.76 74.55 dobj-lvc 4204 90.76 92.25 advmod 4168 71.62 67.52 ccomp 4105 64.10 56.31 det 3929 94.07 93.28 dobj 3862 54.14 57.19 nn 3340 56.31 81.98 num 2872 93.23 93.23 acc 2535 71.37 71.08 aux 2287 92.14 90.56 complm 2022 77.14 78.03

mented clitics). All complex dependency relations, comprising 48 labels, were merged with basic Persian STD relations, with 48 labels. Accordingly, I re- moved all features appearing after forward slash (/) or backslash (\). I ran MaltParser on the treebank with fine-grained auto part-of-speech tags and a total of 48 dependency relations, instead of 96 dependency relations includ- ing the complex ones. The evaluation revealed a labeled attachment score of 79.63% and an unlabeled attachment score of 83.42%. As noted earlier, the results from labeled attachment score do not allow a direct comparison with those presented in Table 6.2 as a baseline, because the two experiments use different label sets. Hence, the comparative evaluation only considers the unla- beled attachment score, which shows an improvement in parsing performance when simplifying the complex dependency relations. This improvement is un- derstandable, as some complex relations such as ccomp\cpobj, ccomp\nsubj, and so forth, occur only once in the treebank, and it is almost impossible for a data-driven machine to learn such rare cases from the given data (a list of the all dependency relations, including basic and complex labels, is presented in Appendix A).

163 Table 6.12. Labeled recall and precision on the development set for the 20 most frequent dependency types in the UPDT, when MaltParser is trained on the treebank with fine-grained gold part-of-speech tags and only basic dependency relations. Dependency Relations Frequency Recall (%) Precision (%) pobj 16412 98.60 98.39 poss 16268 93.32 82.32 prep 15734 77.58 76.17 punct 13442 78.94 79.41 amod 9277 94.92 94.49 nsubj 8847 71.51 70.72 conj 8753 70.74 72.97 cc 7657 81.88 81.22 root 6010 85.50 85.36 cop 4427 67.29 76.06 dobj-lvc 4204 92.38 94.12 advmod 4168 77.36 73.40 ccomp 4105 69.41 59.59 det 3929 96.05 96.87 dobj 3862 57.99 61.83 nn 3340 58.98 91.01 num 2872 93.54 93.54 acc 2535 75.40 75.40 aux 2287 92.58 91.38 complm 2022 84.57 82.68

As presented in Table 6.11, there are variations in recall, ranging from 54.14% for direct object (dobj) to 97.47% for object of a preposition (pobj), and in pre- cision, varying between 56.31% for clausal complement (ccomp) to 96.90% for object of a preposition (pobj). The dependency relations object of a prepo- sition (pobj), direct object in light verb construction (dobj-lvc), determiner (det), numeric modifier (num), and auxiliary (aux) receive scores above 90% for both recall and precision. Possession modifier (poss) shows a score of over 90% only for recall, and adjectival modifier (amod) scores above 90% for pre- cision. The dependency relations nominal subject (nsubj), conjunct (conj), copula (cop), clause complement (ccomp), direct object (dobj), noun com- pound modifier (nn), are the most erroneous relations, although there are dif- ferences in the recall and precision figures obtained by the parser. Compared to the baseline, recall and precision for the dependency relations adjectival modifier (amod) and complementizer (complm) have dropped in the figures. The relations root and noun compound modifier (nn) as well as punctuation (punct) and auxiliary (aux) further show a decline in recall and precision re- spectively. This can probably be explained by the way I had annotated the complex labels. Removing the information provided by the these relations

164 Table 6.13. Labeled and unlabeled attachment scores and label accuracy on the de- velopment set when MaltParser was trained on UPDT with fine-grained part-of-speech tags and merely basic dependency relations. Basic Experiments with MaltParser Baseline Experiments LAS (%) UAS (%) LA (%) Auto Tags 78.84 83.07 88.48 Gold Tags 81.98 85.24 90.78 DepRel Experiments Auto Tags 79.63 83.42 89.09 Gold Tags 82.38 85.40 91.06

makes it harder for the parser to achieve high results when assigning these la- bels. However, the parser shows higher scores for the remaining dependency relations. Finally, I evaluated MaltParser by rerunning the previous experiment with gold standard part-of-speech tags. The evaluation revealed scores of 82.38% for labeled attachment and 85.40% for unlabeled attachment. As in the same experiment when run with auto part-of-speech tags, the study shows that merg- ing complex relations with basic ones improves the overall parsing accuracy. As shown in Table 6.12, there are variations in recall ranging from 57.99% for direct object (dobj) to 98.60% for object of a preposition (pobj), and in precision from 59.59% for clausal complement (ccomp) to 98.39% for object of a preposition (pobj). Compared to the baseline results, the parser gives slightly lower recall and precision for clausal complement (ccomp). Nominal subject (nsubj), conjunct (conj), root, direct object in light verb construction (dobj-lvc), and noun compound modifier (nn), as well as possession modifier (poss), adjectival modifier (amod), and adverbial modifier (advmod) further show slightly lower recall and precision. However, the remaining relations achieve higher recall and precision. For the relations numeric modifier (num) and accusative marker (acc), the parser shows the same recall and precision, and for determiner (det) it shows the same recall as the baseline. The rela- tions object of a preposition (pobj), adjectival modifier (amod), direct object in light verb construction (dobj-lvc), determiner (det), numeric modifier (num), and auxiliary (aux) obtain scores above 90% for both recall and precision. The possession modifier (poss) and noun compound modifier (nn) also score above 90% for recall and precision respectively. Irrespective of the differences in re- call and precision scores achieved by the parser, copula (cop), clausal comple- ment (ccomp), direct object (dobj), and noun compound modifier (nn), are the most erroneous dependency relations in this experiment. Table 6.13 compares the results of the Experiment 4 for auto and gold part-of-speech tags.

165 Table 6.14. Labeled and unlabeled attachment scores, and label accuracy on the development set resulting from 8 empirical studies where MaltParser was trained on UPDT with different simplifications of annotation schemes in part-of-speech tagset and dependency relations. Baseline = Experiment with a fine-grained annotated tree- bank, CPOS = Experiment with coarser-grained part-of-speech tags and fine-grained dependency relations, 1LVC = Experiment with fine-grained part-of-speech tags and dependency relations free from distinctive features in light verb construction, and Ba- sic DepRel = Experiment with fine-grained part-of-speech tags and merely basic de- pendency relations. Basic Experiments with MaltParser Auto Experiments LAS (%) UAS (%) LA (%) Baseline 78.84 83.07 88.48 CPOS 79.24 83.45 88.43 1 LVC 79.46 83.52 88.86 Basic DepRel 79.63 83.42 89.09 Gold Experiments Baseline 81.98 85.24 90.78 CPOS 82.00 85.37 90.54 1 LVC 82.38 85.58 90.93 Basic DepRel 82.38 85.40 91.06

6.2.5 Best Parsing Representation Aggregation of morphological properties and detailed syntactic annotation in a treebank can be complex and difficult for a parser to process. Jelínek (2014) claims that morphological tags may not always benefit the analysis of syntactic structure of words, and complex annotation schemes may be inadequate and impair automatic parsing. He presents empirical studies on simplifying the data and the annotation scheme of the Prague Dependency Treebank. The findings show a considerable improvements in accuracy scores, achieving an 8.3% reduction in error rate with MaltParser. For the empirical studies that I presented in Section 6.2, I systematically simplified the annotation schemes for part-of-speech tags and dependency la- bels. I carried out a total of four different types of empirical studies. As gold tags are never used in out-of-domain data, it was unrealistic to rely on results obtained from experiments with gold part-of-speech tags. Thus, the exper- iments were done on both automatically generated and gold part-of-speech tags. Table 6.14 presents a summary of the 8 basic experiments I performed. The results are presented with labeled and unlabeled attachment scores as well as label accuracy score. As noted earlier, however, the figures obtained as la- beled attachment scores in Experiments 3 and 4 are not comparable with those presented in the baseline results, because each study uses different dependency relation sets.

166 To sum up the four experiments I can conclude that:

1. Using coarse-grained part-of-speech tags in the dependency representation improves parsing performance without losing any information. By using the part-of-speech tagger TagPer I can recreate and restore this information at the end once the parsing is done. Thus, fined-grained part-of-speech tags can still be in the output. Considering the part-of-speech tags in the UPC, it is worth noting that I had already simplified these properties to some extent, as described in Section 3.2.3, when improving the tagset of the Bijankhan Corpus. Although I have not done any parsing study using the entire original Bijankhan tagset and with detailed morphological information, I believe that tags with complex morphological information in terms of number and specification impact parsing accuracy negatively, and the above experiments can support this idea, because further simplifications were beneficial for automatic parsing.

2. The studies further show that simplifying the representation of light verb constructions helps the parser to perform better without loss of important information. In other words, by using coarse LVC, the results become less specific and less informative only with respect to the LVC construction, and yield better parsing performance overall. Furthermore, the lvc specification at the end can mostly be recovered from the part-of-speech tags in the output.

3. Using only basic relations might provide a marginal improvement, but this is not a sufficient justification to remove them, because by eliminating the complex labels I lose essential information that cannot be recovered by the tagger, which affects the quality of parsing analysis. Using the treebank with complex relations provides a richer grammatical analysis that boosts the quality of parsing results.

These results provided us with a valuable insight about how different mor- phosyntactic parameters in data influence the parsing analysis. The studies also have brought us to the point where I shall select the best configuration for further experiments. Specifically, I will use a representation with coarse- grained part-of-speech tags, single LVC representation, and fine-grained de- pendency relations containing both basic and complex labels (96 labels).

6.3 Experiments with Different Parsers The experiments described in this section are designed to estimate the per- formance of different parsers on the best performing data representations se- lected by MaltParser in the baseline experiments. Hence, I set up the data with

167 the best achieved parameters, which use the automatically generated coarse- grained part-of-speech tags with a single LVC label and the fine-grained de- pendency relations consisting of 96 basic and complex labels. The treebank is further organized with a different split than in the basic experiments. In other words, I train the parser on the joint training and development sets (90%) and test on the test set (10%). I will experiment with MaltParser (Nivre et al., 2006), MSTParser (McDonald et al., 2005b), MateParsers (Bohnet, 2010; Bohnet and Nivre, 2012), and TurboParser (Martins et al., 2010). For evaluating MaltParser, I used Nivre’s algorithms, as these were found to be the best parsing algorithms by MaltOptimizer during my previous exper- iments. The parser resulted in scores of 79.40% and 83.47% for labeled and unlabeled attachment, respectively. In evaluating MSTParser, I used the second-order model with projective parsing, as this setting had yielded the highest results in my earlier parameter tuning experiments. The parser achieved results of 77.79% for labeled and 83.45% for unlabeled attachment scores. For experimenting with MateParsers, I trained the graph-based and transition-based parsers on the UPDT with the best parameters selected. The results of Mate experiments showed that the graph-based parser outperformed the transition-based parser, resulting in 82.58% for labeled and 86.69% for unlabeled attachment scores. For experimenting with TurboParser, I trained the second-order non- projective parser with features for arcs, consecutive siblings and grandparents, using the AD3 algorithm as a decoder. I adopted the full setting, as it had performed best with my earlier parameter-tuning experiments. The full setting enables arc-factored, consecutive sibling, grandparent, arbitrary sibling, head bigram, grand-sibling (third-order), and tri-sibling (third-order) parts. The parser achieved results of 80.57% for labeled and 85.32% for unlabeled at- tachment scores. As shown in Table 6.15 the graph-based parser in the Mate Tools achieves the highest results for Persian. The parser thus developed will be treated as the state-of-the-art parser for the language and will be called ParsPer. The parser will undergo further evaluation which will be presented more in detail in the next section.

6.4 Dependency Parser for Persian: ParsPer The goal of developing a state-of-the-art syntactic parser for Persian is to ap- ply the parser to new text that has already been automatically segmented and tagged by my tools in the pipeline, namely, SetPer and TagPer (see Chapter 4). As the results of the previous experiments showed, the graph-based MateParser outperformed MaltParser, MSTParser, and TurboParser, obtain- ing scores of 82.58% and 86.69% for labeled and unlabeled attachment. This

168 Table 6.15. Best results given by different parsers when trained on UPDT with auto part-of-speech tags, 1LVC, CompRel in the model assessment. Final Results Evaluations LAS (%) UAS (%) LA (%) MaltParser 79.40 83.47 88.72 MSTParser 77.79 83.45 87.11 Mate graph-based 82.58 86.69 90.55 Mate transition-based 81.72 85.94 89.87 TurboParser 80.57 85.32 88.93 means that this time, I need to train the graph-based MateParser on the entire UPDT with the selected configuration. The parser developed will be included in the pipeline of tools for automatic processing and analysis of Persian, and will be called ParsPer.3 ParsPer has been released as a freely available tool for parsing of Persian and is open source under a GNU General Public License. The parser will be further evaluated in the next subsection.

6.4.1 The Evaluation of ParsPer In order to assess the performance of ParsPer I conducted an independent pars- ing evaluation as I had done for my earlier tools. I applied ParsPer to the 100 randomly selected sentences with an average sentence length of 28 tokens used in the evaluation of tools previously introduced in the pipeline (PrePer, SeT- Per, and TagPer). For this task I performed three different parsing evaluations. First I ran the parser on the automatically normalized, tokenized and tagged text. In other words, I parsed the text (containing 100 randomly selected sen- tences) that had already passed through the pipeline and been processed by all the developed tools. This is the main experiment in the ParsPer evaluation, and also indicates the performance of other tools in the pipeline and how each process is affected by previous processes, when the output of one tool is the input of the next tool. Next, I performed two more experiments with the 100 randomly selected sentences in order to analyze the results in a more nuanced way, by experimenting on the sentences when they are manually normalized and tokenized but automatically tagged and then, when they are manually nor- malized, tokenized, and tagged. In the automatically tokenized and tagged text experiment, I manually an- notated the manually normalized, tokenized, and tagged gold file that was used in the evaluation of TagPer (see Section 4.1.3) with dependency information using the same dependency scheme on which ParsPer was built, to served as a gold standard. I then parsed the automatically tokenized and tagged text with

3http://stp.lingfil.uu.se/∼mojgan/parsper-mate.html

169 ParsPer. As the automatically tokenized text contained 10 fewer4 tokens than the gold file (the number of tokens in the gold file was 2788 and in the automat- ically processed file was 2778) I cannot directly present labeled and unlabeled attachment scores. Instead, however, I present scores for labeled recall and precision, as well as unlabeled recall and precision. The parsing evaluation revealed labeled recall and precision scores of 73.52% and 73.79%, and un- labeled recall and precision scores of 81.99% and 82.28%, respectively. As could be expected, the results for labeled recall and precision are low. This is due to the fact that in addition to there being incorrect tokens in the automat- ically tokenized file, incorrect part-of-speech tags have had a negative impact on the results. I then automatically parsed the manually normalized, tokenized, but au- tomatically tagged text and compared the parsing results with the manually parsed gold text. By this experiment, I wanted to isolate the impact of tagging errors. The evaluation resulted in labeled and unlabeled attachment scores of 78.50% and 86.27% on the test set with 100 sentences and 2788 tokens. As the results indicate, the unlabeled attachment score is close to the unlabeled attachment score obtained by the parser when evaluated on in-domain text. Furthermore, the unlabeled attachment score is 7.77% higher than the labeled attachment score. This may partly be due to fact that the structural variation for the head nodes is lower than the variation for labels. Moreover, I have a firm structure for the head nodes in the syntactic annotation when invariably choosing content words as head position. The solidity of this structure in turn makes it easier for the parser to learn the structure after repeatedly seeing it. Hence, the parser assigns the head nodes more accurately than the combina- tions of head and label. This does not mean that I do not follow a consistent structure for the dependency relations. What I mean is that the number of oc- currences of certain cases for dependency relations may not be the same as the number of repeated cases for head structures. This might be perceived as a sparseness on the part of the parser, which can directly affect the labeled at- tachment score. Moreover, the syntactic (non)complexity of the data can have a direct impact on parser performance. Finally, I automatically parsed the manually normalized, tokenized, and tagged text (the gold file in the tagging evaluation) and compared the parsing with the manually parsed gold file. The evaluation resulted in straightforward labeled and unlabeled attachment scores of 78.76% and 86.12% for the test set with 100 sentences and 2788 tokens. The same kind of pattern as in the previous experiment was also found here. In other words, we see a nearly identical gap of 7.36% between the labeled and unlabeled attachment scores. Table 6.16 shows results from different evaluations of the ParsPer.

4In addition to the 10 fewer tokens, two more tokens had not been successfully normalized by PrePer in the normalization process and looked different (see Section 4.1.3). Hence, difference in number was 12.

170 Table 6.16. The evaluation of the ParsPer when tested on 100 randomly selected sentences from the web-based journal Hamshahri. LR = Labeled Recall, LP = Labeled Precision, UR = Unlabeled Recall, UP = Unlabeled Precision, AS = Automatically Segmented, AT = Automatically Tagged, AP = Automatically Parsed, MS = Manually Segmented, and MT = Manually Tagged. Results of Out-of-domain Data Evaluations AS+AT+AP (%) MS+AT+AP (%) MS+MT+AP (%) LAS – 78.50 78.76 UAS – 86.27 86.12 LA – 86.94 87.39 LR 73.52 – – LP 73.79 – – UR 81.99 – – UP 82.28 – –

A comparison of Experiments 1, 2, and 3 shows that tokenization is a greater problem than tagging for syntactic parsing. Whereas a perfectly tokenized text with tagging errors degrades parsing results by less than 1%, errors in tokenization may reduce parsing accuracy by as much as 5%. To some extent, this is probably due to additional tagging errors caused by tokenization errors. It is nevertheless clear that tokenization errors disrupt the syntactic structure more than tagging errors do. Adding variations of writing styles (as mentioned earlier) on top of this triggers variations in the tokenization process, which in turn leads to the parser being unable to realize similar sentences with different tokenizations. However, this normally happens when the parser is not familiar with the tokens (or the order of how tokens are represented) in the sentence, which is due to the fact that the structure is not prevalent enough in the training data. Moreover, by observing the evaluation results from the two latter experi- ments (2 and 3) I discovered that, for instance, the head attachment of the de- pendency relation light verb construction (lvc) was among the most frequent errors. As mentioned in Section 6.2.3, this might be due to the structure of LVC variations. Representing the distinctive variations of LVC by only a sin- gle lvc makes it harder for the system to select between for instance, a direct object as a direct object or a lvc (when ra¯ is not present to highlight the direct object), a preposition as a prepositional modifier (prep) or a lvc, as well as an adjectival modifier (amod) or a lvc. Other types of ambiguities are observed in the results, such as the head attachment of the label possession modifier (poss). On several occasions this label was mistakenly selected for the label noun compound modifier (nn). As noted earlier, in the treebank, compound nouns and proper names versus nouns, are not annotated differently in the part-of-speech layer. This may be difficult to disambiguate for the parser and therefore have a negative effect on the syntactic analysis results.

171 Table 6.17. Precision and recall of binned head direction obtained when ParsPer was evaluated on 100 manually tokenized and automatically tagged sentences taken from the web-based journal Hamshahri. MS+AT+AP Direction Recall (%) Precision (%) to_root 83.00 83.00 left 95.81 94.03 right 88.42 92.00 MS+MT+AP to_root 81.00 81.00 left 95.87 94.19 right 88.89 92.27

Table 6.18. Precision and recall of binned head distance obtained when ParsPer was evaluated on 100 manually tokenized and automatically tagged sentences taken from the web-based journal Hamshahri. MS+AT+AP Distance Recall (%) Precision (%) to_root 83.00 83.00 1 95.94 92.59 2 85.28 83.88 3–6 78.04 85.76 7–... 83.43 87.13 MS+MT+AP to_root 81.00 81.00 1 95.87 92.46 2 85.00 84.53 3–6 79.10 84.94 7–... 83.64 88.27

Furthermore Table 6.17 and 6.18 show recall and precision of binned head direction and longer distance head attachment in the last two experiments. As seen in Table 6.17, ParsPer predicts left arcs with higher recall and precision than right arcs. Looking at Table 6.18, we see that the highest recall and precision are achieved for tokens at depth one from the root. It might be possible to improve the parsing performance by adding to or modifying the part-of-speech tag set, as well as eliminating or modifying some structures in the syntactic annotation scheme that do not properly favor the parser. However, I will not go into the topic of further improvement here, and this matter will have to be left for future research.

172 7. Conclusion

The goal of this thesis project was to develop open source morphosyntactic corpora and tools for natural language processing of Persian. To achieve my goal, I adopted two key requirements: compatibility and reuse. The compat- ibility requirement stipulates that (1) the tools should be run in a pipeline where the output of one tool is compatible with the input requirements of the next and (2) the tools have to deliver the same analysis as is found in the annotated corpora. The reuse requirement was primarily chosen as a practical necessity. Two research questions were formulated and have been discussed throughout this thesis. In this chapter they will be revisited and briefly discussed to highlight the contributions made by this work. The questions were:

Q-1 How can we develop morphologically and syntactically annotated corpora and tools while satisfying the requirements of compatibility and reuse?

Q-2 How accurately can we perform morphological and syntactic analysis for Persian by adapting and applying existing tools to the annotated corpora?

In response to question Q-1, Chapter 3 provides a detailed description of how I have handled challenges related to tokenization with respect to the lack of standardization in Persian orthography. Modifying the Bijankhan Corpus for higher linguistic analysis was the basic procedure of my thesis, as a subset of the corpus was employed in developing a dependency treebank for Persian. With respect to the interaction between different linguistic levels, which in- troduces challenges for segmentation and annotation, I had to make decisions concerning issues ranging from tokenizing to syntactic analysis. The most challenging cases concerned the handling of fixed expressions and different types of clitics such as pronominal and copula clitics, as they are normally written in various forms in Persian texts. They are sometimes segmented and sometimes unsegmented from the head words. Manually merging (or sepa- rating the attached form of) fixed expressions and separating clitics from the head words in a consistent way in such a large corpus as the Bijankhan Cor- pus was impossible, with respect to the time available. On the other hand, automatically handling such cases was also impossible, because this could result in many incorrect conversions by impacting orthographically similar words/endings with different part-of-speech categories. Moreover, automatic

173 conversion could impact words that are not exactly homographs but share the same endings. Therefore, to avoid introducing such errors into the corpus I de- cided to handle fixed expressions as distinct tokens and not to separate clitics from the head words, but rather to analyze them with special labels at the syn- tactic level instead. In other words, as described in Chapter 5, in the syntactic annotation, I analyzed fixed expressions as multi-word expressions and treated clitics as complex unsegmented word forms by annotating them with complex dependency labels. Hence, in the treebank, apart from 48 dependency labels for basic relations, I have 48 complex dependency labels to cover the syntactic relations for words containing unsegmented clitics. The complex labels are indicated by two or more labels separated by either a backslash or a forward slash, depending on the function of the clitics (see Section 5.5). Thus, by im- proving the segmentation and annotation of the Bijankhan Corpus, and adding a syntactic annotation layer, I made sure that I best satisfied my requirements of compatibility and reuse without resegmenting and reannotating the entire corpus from scratch. The approach used was to accept the tokenization (or orthographic) variations in the input data in order to achieve robustness. Many evaluations to date have been done on cleaned-up data, hiding tokenization variations from the system, which gives unrealistic performance estimates. Typical variations that normally exist in out-of-domain data, in particular the orthographic variations in Persian texts, can directly impact tokenization and require different adjustments for morphosyntactic analysis. Question Q-2 is addressed thoroughly in Chapters 4 and 6 where I present a pipeline containing tools for morphosyntactic processing and analysis of Persian. In these chapters I describe how various standard tools are developed on resources presented in Chapters 3 and Chapter 5. In other words, for all tools developed in the pipeline I have made use of standard methods and state- of-the-art tools, in particular, the sentence segmentation and tokenization tools in Uplug (Tiedemann, 2003), the part-of-speech tagger HunPoS (Halácsy et al., 2007), and the graph-based parser in Mate Tools (Bohnet, 2010). In addition, MaltParser was also used as the main tool for bootstrapping the corpus data during the treebank development. In reusing existing tools, which was a practical necessity for my project, I also made sure that the development satisfied my requirement of compatibility. To achieve this, all tools developed are compatible and run in a pipeline where the output of one tool matches the input requirements of the next. Furthermore, since there is a direct connection between tools and annotation, the annotated data that are used for training and evaluation are also compatible with the tools. More precisely, the tools render the same analysis that is found in the annotated corpora. Therefore, having domain variations in the annotated corpora, in terms of different genres and tokenization variations related to orthographic variations, was one of my highest priorities to achieve efficiency and robustness when applying the tools to out-of-domain texts. For each and every process, from normalization to syntactic parsing, I have developed a tool that is compatible

174 with my annotated corpora. In my pipeline of tools for automatic processing and analysis of Persian, I introduced the tools PrePer for normalization, SeTPer for sentence segmentation and tokenization, TagPer for part-of-speech tagging, and ParsPer for syntactic parsing. Detailed descriptions of PrePer, SeTPer, and TagPer are given in Chapter 4 and the creation of ParsPer is presented in Chapter 6. Each tool developed for the pipeline was evaluated on out-of-domain context containing 100 randomly selected sentences taken from the web-based journal Hamshahri, in addition to evaluations I made on in-domain text. In this chapter I only report the results achieved by the tools when tested on the out-of-domain text. The evaluation of the SeTPer showed an accuracy of 100% for the sentence segmenter as well as 99.25% and 99.59% recall and precision for the tokenization tool when tested on a text already normalized by PrePer. For tagging evaluation, TagPer resulted in an F-score of 98.09%. Finally, parsing performance evaluation revealed a labeled recall and precision of 73.52% and 73.79%, and an unlabeled recall and precision of 81.99% and 82.28%, respectively. The resources and tools developed in this project are open source and freely available (see descriptions in the relevant chapter). To sum up, the main contributions of this thesis in terms of resources and tools are:

1. Resources: 1. The Uppsala Persian Corpus (UPC) 2. The Uppsala Persian Dependency Treebank (UPDT)

2. Tools: 1. Preprocessor for Persian (PrePer) 2. Sentence Segmenter and Tokenizer for Persian (SeTPer) 3. Part-of-Speech Tagger for Persian (TagPer) 4. Dependency Parser for Persian (ParsPer)

My research attempts to contribute to the field of natural language process- ing by discussing various important issues and challenges in the automatic morphosyntactic processing and analysis of Persian. I further explore differ- ent methods for handling noisy data to address challenges relating to Persian orthography, morphology, and syntax. The methodologies described in this thesis, from decisions about handling tokenization issues to the innovative analysis used in developing the Persian dependency treebank, which are all empirically evaluated, bring new insights and ideas to the field. These meth- ods, with their emphasis on handling variations in tokenization, may deviate from the abstract linguistic conventions used in the literature, but are able to cope with common difficulties in user-generated texts due to the lack of a common standard for Persian orthography. Based on these ideas, I developed a pipeline of resources and tools for Persian that can easily be employed on out-of-domain texts.

175 In my future work, I intend to bring the Uppsala Persian Corpus and the Up- psala Persian Dependency Treebank to a higher level by converting them into the framework of the Universal Dependencies. Given that Persian is a pro- drop language with a large number of dropped/null subjects and objects, verb endings are of great importance in carrying information about subjects as well as objects. Therefore, in future annotation schemes I plan to handle verb end- ings differently than in the current schemes in the UPC and UPDT. Since the encoding labels in the corpora will be changed, the results of the data-driven tools will accordingly be changed and hopefully improved. This means that a new tagger and parser will be developed for Persian, and I hope that with the new TagPer and ParsPer I will be able to cover a wider set of structural variations for the head and dependency labels in out-of-domain data. Further- more, I hope that the new ParsPer, in particular, will more easily distinguish the sentence subjects from objects. It is important to continuously improve the resources and tools that have been created. This is an advantage of the reusability and compatibility require- ments that I imposed on the tools in my pipeline. Easily being able to reuse and modify data is a crucial feature for achieving high quality tools based on these resources. It further facilitates adaptation for different needs. Hopefully, the approaches I chose and the different solutions I found for Persian in this thesis can benefit the work with other languages with similar linguistic and orthographic characteristics.

176 References

Adesam, Yvonne (2012). “The Multilingual Forest, Investigating High–quality Paral- lel Corpus Development”. PhD thesis. Stockholm University. Aduriz, I., M. J. Aranzabe, J. M. Arriola, A. Atutxa, A. Díaz de Ilarraza, A. Garmen- dia, and M. Oronoz (2003). “Construction of a Basque Dependency Treebank”. In: Proceedings of the 2nd Workshop on Treebanks and Linguistic Theories (TLT), pp. 201–204. Afonso, Susana, Eckhard Bick, Renato Haber, and Diana Santos (2002). “Floresta Sintáctica: A Treebank for Portuguese”. In: Proceedings of the Third International Conference on Language Resources and Evaluation, pp. 1698–1703. AleAhmad, Abolfazl, Hadi Amiri, Ehsan Darrudi, Masoud Rahgozar, and Farhad Oroumchian (2009). “Hamshahri: A Standard Persian Text Collection”. Journal of Knowledge-Based Systems 22.5, pp. 382–387. Aranzabe, María Jesús, Arantza Díaz de Ilarraza, Nerea Ezeiza, Kepa Bengoetxea, Iakes Goenaga, and Koldo Gojenola (2012). “Combining Rule-Based and Statis- tical Syntactic Analyzers”. In: Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages, pp. 48–54. Aroonmanakun, Wirote (2007). “Thoughts on Word and Sentence Segmentation in Thai”. In: Proceedings of the SNLP2007–Symposium on Natural Language Pro- cessing, pp. 85–90. Assi, Mostafa S. (2005). PLDB Persian Linguistics Database Pažuhešgaran (Re- searchers). Technical Report. Institute for Humanities and Cultural Studies. Astiri, Ahmad, Mohsen Kahani, and Hadi Qaemi (2013). Furqan Quran Corpus. Technical Report. Web Technology Laboratory, University of Mashhad. Aston, Guy and Lou Burnard (1998). Exploring the British National Corpus with SARA. Cambridge University Press. Ballesteros, Miguel and Joakim Nivre (2012). “MaltOptimizer: A System for Malt- Parser Optimization”. In: Proceedings of the 8th International Conference on Lan- guage Resources and Evaluation (LREC), pp. 833–841. Baluch, Bahman (1992). “Reading with and without Vowels: What Are the Psycho- logical Consequences?” Journal of Social and Evolutionary Systems 15, pp. 95– 104. Bargi, Alan Aziz (2011). Virastar. URL: https://github.com/aziz/virastar. Bick, Eckhard (2003). “Arboretum, a Hybrid Treebank for Danish”. In: Proceedings of the Second Workshop on Treebanks and Linguistic Theories, pp. 9–20. Bijankhan, Mahmood (2004). “The Role of the Corpus in Writing a Grammar: An Introduction to a Software”. Iranian Journal of Linguistics 19, pp. 38–67. Bijankhan, Mahmood, Javad Sheykhzadegan, Mohammad Bahrani, and Masood Ghayoomi (2011). “Lessons from building a Persian written corpus: Peykare”. Lan- guage Resources and Evaluation 45.2, pp. 143–164.

177 Björkelund, Anders, Bernd Bohnet, Hafdell Love, and Nugues Pierre (2010). “A High-Performance Syntactic and Semantic Dependency Parser”. In: Proceedings of the 23rd International Conference on Computational Linguistics: Demonstra- tions (COLING ’10), pp. 33–36. Bögel, Tina, Miriam Butt, and Sebastian Sulger (2008). “Urdu Exafe and the Morphology-Syntax Interface”. In: Proceedings of the LFG08 Conference, Miriam Butt and Tracy Holloway King (Editors), pp. 129–149. Boguslavsky, Igor, Svetlana Grigorieva, Nikolai Grigoriev, Leonid Kreidlin, and Nadezhda Frid (2000). “Dependency Treebank for Russian: Concept, Tools, Types of Information”. In: Proceedings of the 18th International Conference on Compu- tational Linguistics (COLING), pp. 987–991. Bohnet, Bernd (2010). “Top Accuracy and Fast Dependency Parsing is not a Contra- diction”. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 89–97. Bohnet, Bernd and Jonas Kuhn (2012). “The Best of Both Worlds: A Graph-based Completion Model for Transition-based Parsers”. In: Proceedings of the 13th Con- ference of the European Chapter of the Association for Computational Linguistics (EACL ’12), pp. 77–87. Bohnet, Bernd and Joakim Nivre (2012). “A Transition-Based System for Joint Part- of-Speech Tagging and Labeled Non-Projective Dependency Parsing”. In: Pro- ceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL ’12), pp. 1455–1465. Bosco, Cristina and Vincenzo Lombardo (2004). “Dependency and Relational Struc- ture in Treebank Annotation”. In: Proceedings of the Workshop Recent Advances in Dependency Grammar, pp. 9–16. Bosco, Cristina, Simonetta Montemagni, and Maria Simi (2013). “Converting Italian Treebanks: Towards an Italian Stanford Dependency Treebank”. In: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pp. 61–69. Brants, Sabine, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, and George Smith (2002). “The TIGER Treebank”. In: Proceedings of the First Workshop on Tree- banks and Linguistic Theories (TLT), pp. 24–42. Brants, Thorsten (2000). “TnT a Statistical Part-of-Speech Tagger”. In: Proceedings of the 6th Applied Natural Language Processing Conference (ANLP), pp. 224–231. Brill, Eric (1995). “Transformation-Based Error-Driven Learning and Natural Lan- guage Processing: A Case Study in Part-of-Speech Tagging”. Journal of Computa- tional Linguistics 21.4, pp. 543–565. Buchholz, Sabine and Erwin Marsi (2006). “CoNLL-X Shared Task on Multilingual Dependency Parsing”. In: Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL), pp. 149–164. Capková, Sofia Gustafson and Britt Hartmann (2006). Manual of the Stockholm Umeå Corpus Version 2.0. URL: http : / / spraakbanken . gu . se / parole / Docs / SUC2 . 0 - manual.pdf. Carreras, Xavier (2007). “Experiments with a higher-order projective dependency parser”. In: Proceedings of the Joint Conference on Empirical Methods in Natural

178 Language Processing and Computational Natural Language Learning, pp. 957– 961. Chang, Pi-Chuan, Huihsin Tseng, Dan Jurafsky, and Christopher D. Manning (2009). “Discriminative reordering with Chinese grammatical relations features”. In: Pro- ceedings of the Third Workshop on Syntax and Structure in Statistical Translation (SSST-3) at NAACL HLT 2009, pp. 51–59. Cmejrek,ˇ Martin, Jan Curin,ˇ Jiri Havelka, Jan Hajic,ˇ and Vladislav Kubonˇ (2004). “Prague Czech-English Dependency Treebank: Syntactically Annotated Resources for MachineTranslation”. In: Proceedings of the IV International Conference on Language Resources and Evaluation, pp. 1597–1600. Covington, Michael A. (2001). “A Fundamental Algorithm for Dependency Parsing”. In: Proceedings of the 39th Annual ACM Southeast Conference, pp. 95–102. Crammer, Koby, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer (2006). “Online Passive-Aggressive Algorithms”. Journal of Machine Learning Research 7, pp. 551–585. Daelemans, Walter, Jakob Zavrel, Peter Berck, and Steven Gillis (1997). “A Memory- Based Part-of-Speech Tagger Generator”. In: Proceedings of the Fourth Workshop on Very Large Corpora, pp. 14–27. Davies, Mark (2010). “The Corpus of Contemporary American English as the first reli- able monitor corpus of English”. Literary and Linguistic Computing 25.4, pp. 447– 464. de Marneffe, Marie-Catherine, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher D. Manning (2014). “Universal Stanford Dependencies: A cross-linguistic typology”. In: Proceedings of the 9th Interna- tional Conference on Language Resources and Evaluation (LREC 2014), pp. 4585– 4592. de Marneffe, Marie-Catherine, Bill MacCartney, and Christopher D. Manning (2006). “Generating Typed Dependency Parses from Phrase Structure Parses”. In: Proceed- ings of the 5th International Conference on Language Resources and Evaluation (LREC), pp. 449–454. de Marneffe, Marie-Catherine and Christopher D. Manning (2008). “The Stanford Typed Dependencies Representation”. In: Proceedings of the COLING’08 Work- shop on Cross-Framework and Cross-Domain Parser Evaluation, pp. 1–8. Dehdari, Jon and Deryle Lonsdale (2008). “A Link Grammar Parser for Persian”. In: Aspects of Iranian Linguistics. Cambridge Scholars Press, pp. 19–34. Džeroski, Sašo, Tomaž Erjavec, Nina Ledinek, Petr Pajas, Zdenek Žabokrtsky, and Andreja Žele (2006). “Towards a Slovene Dependency Treebank”. In: Proceed- ings of the Fifth International Conference on Language Resources and Evaluation (LREC), pp. 1388–1391. Earley, Jay (1970). “An Efficient Context-Free Parsing Algorithm”. Journal of Com- munications of the ACM 13.2, pp. 94–102. Eisner, Jason (1996). “Three New Probabilistic Models for Dependency Parsing: An Exploration”. In: Proceedings of the 16th International Conference on Computa- tional Linguistics (COLING), pp. 340–345. Erjavec, Tomaž and Nancy Ide (1998). “The MULTEXT-East Corpus”. In: Proceed- ings of The First International Conference on Language Resources and Evaluation (LREC), pp. 971–974.

179 Erjavec, Tomaž, Cvetana Krstev, Vladimír Petkevic, Kiril Simov, Marko Tadic, and Duško Vitas (2003). “The MULTEXT-East Morphosyntactic Specifications For Slavic Languages”. In: Proceedings of The EACL 2003 Workshop on the Morpho- logical Processing of Slavic Languages, pp. 25–32. Esfahbod, Behdad (2004). Persian Computing with Unicode. URL: http : / / www . farsiweb.info. Fung, James G, Dilek Hakkani-Tür, Mathew Magimai Doss, Liz Shriberg, Sebastien Cuendet, and Nikki Mirghafori (2007). “Cross-Linguistic Analysis of Prosodic Features for Sentence Segmentation”. INTERSPEECH 2007, 8th Annual Confer- ence of the International Speech Communication Association, Antwerp, Belgium (ISCA), pp. 2585–2588. Garside, Roger, Geoffrey Leech, and Tamás Váradi (1992). The Lancaster Parsed Corpus. A machine-readable Syntactically Analyzed Corpus of 144,000 Words. Available for Distribution Through ICAME. Technical Report. Bergen: The Nor- wegian Computing Centre for the Humanities. Ghayoomi, Masood (2012). “Bootstrapping the Development of an HPSG-based Tree- bank for Persian”. Journal of Linguistic Issues in Language Technology 7, pp. 105– 114. Ghayoomi, Masood and Jonas Kuhn (2014). “Converting an HPSG Treebank into its Parallel Dependency Treebank”. In: Proceedings of the 9th International Confer- ence on Language Resources and Evaluation (LREC), pp. 2245–2252. Giesbrecht, Eugenie and Stefan Evert (2009). “Is Part-of-Speech Tagging a Solved Task? An Evaluation of POS Taggers for the German Web as Corpus”. In: Pro- ceedings of the 5th Web as Corpus Workshop WAC5, pp. 27–35. Giménez, Jesus and Lluís Màrquez (2004). “SVMTool: A general POS tagger gener- ator based on Support Vector Machines”. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC), pp. 43–46. Hajic,ˇ Jan, Barbora Vidová Hladká, and Petr Pajas (2001). “Prague Dependency Tree- bank: Annotation Structure and Support”. In: Proceeding of the IRCS Workshop on Linguistic Databases, Philadelphia, pp. 105–114. Hajic,ˇ Jan, Otakar Smrž, Petr Zemánek, Petr Pajas, Jan Šnaidauf, Emanuel Beška, Jakub Krácmar,ˇ and Kamila Hassanová (2004). “Prague Arabic Dependency Tree- bank: Development in data and tools”. In: Proceedings of the NEMLAR Interna- tional Conference on Arabic Language Resources and Tools, pp. 110–117. Halácsy, Péter, András Kornai, and Csaba Oravecz (2007). “HunPos: an Open Source Trigram Tagger”. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Interactive Poster and Demonstration Sessions (ACL), pp. 209–212. Hashabeiky, Forogh (2005). “Persian Orthography, Modification or Changeover (1850–2000)”. PhD Thesis. Studia Iranica Upsaliensia 7. Hashabeiky, Forogh (2007). “The Usage of Singular Verbs for Inanimate Plural Sub- jects in Persian”. Orientalia Suecana, Journal of Indological, Iranian, Semitic and Turkic Studies LVI, pp. 77–101. Hashemi, Homa B., Azadeh Shakery, and Heshaam Faili (2010). “Creating a Persian- English Comparable Corpus”. In: Proceedings of the International Conference on Multilingual and Multimodal Information Access Evaluation (CLEF), pp. 27–39.

180 Haverinen, Katri, Jenna Nyblom, Timo Viljanen, Veronika Laippala, Samuel Koho- nen, Anna Missilä, Stina Ojala, Tapio Salakoski, and Filip Ginter (2013). “Building the Essential Resources for Finnish: The Turku Dependency Treebank”. Journal of Language Resources and Evaluation, pp. 493–531. Haverinen, Katri, Timo Viljanen, Veronika Laippala, Samuel Kohonen, Filip Ginter, and Tapio Salakoski (2010). “Treebanking Finnish”. In: Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories (TLT), pp. 79–90. Hladká, Barbora and Kiril Ribarov (1998). “Part of Speech Tags for Automatic Tag- ging and Syntactic Structures”. Issues of Valency and Meaning - Studies in honour of Jarmila Panevova, pp. 226–237. Hogan, Deirdre, Jennifer Foster, Joachim Wagner, and Josef Van Genabith (2008). “Parser-Based Retraining for Domain Adaptation of Probabilistic Generators”. In: Proceedings of the Fifth International Natural Language Generation Conference, pp. 165–168. Huang, Chu-Ren, Feng-Yi Chen, Keh-Jiann Chen, Zhao-ming Gao, and Kuang-Yu Chen (2000). “Sinica Treebank: design criteria, annotation guidelines, and on-line interface”. In: Proceedings of the second workshop on Chinese language process- ing: held in conjunction with the 38th Annual Meeting of the Association for Com- putational Linguistics, pp. 29–37. Hwa, Rebecca (2004). “Sample Selection for Statistical Parsing”. Computational Lin- guistics 30, pp. 253–276. Ide, Nancy, Patrice Bonhomme, and Laurent Romary (2000). “XCES: An XML-based Encoding Standard for Linguistic Corpora”. In: Proceedings of the Second Inter- national Language Resources and Evaluation Conference, pp. 825–30. Ide, Nancy, Greg Priest-Dorman, and Jean Véronis (1996). Corpus Encoding Stan- dard (CES). Technical Report. Department of Computer Science, Vassar College, Poughkeepsie, New York. Jahani, Carina, Behrooz Barjasteh Delforooz, and Maryam Nourzaei (2012). “Non- canonical Subjects in Balochi”. In: Iranian Languages and Culture. Mazda Pub- lishers, pp. 196–218. James, Gregory, Robert Davison, Amos C. Heung-yeung, and Scott Deerwester (1994). English in Computer Science: A corpus-based lexical analysis. The Hong Kong University of Science and Technology. Jelínek, Tomáš (2014). “Improvements to Dependency Parsing Using Automatic Sim- plification of Data”. In: Proceedings of the Ninth International Conference on Lan- guage Resources and Evaluation (LREC’14), pp. 73–77. Jeremiás, Éva M. (2003). “New Persian”. In: The Encyclopaedia of Islam. Ed. by Supplement. Brill Publishers, pp. 426–448. Jurafsky, Daniel and James H. Martin (2008). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd Edition. Prentice Hall. Kaplan, Ronald M. (1973). “A General Syntactic Processor”. In: Rustin R. (Ed.), Nat- ural Language Processing, pp. 193–241. Karimi, Simin (1989). “Aspects of Persian Syntax, Specificity and the Theory of Grammar”. PhD Thesis. University of Washington. Karimi, Simin (2003). Word Order and Scrambling. Wiley-Blackwell.

181 Karlsson, Fred, Atro Voutilainen, Juha Heikkilä, and Arto Anttila (1995). Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text. Walter de Gruyter. Kasami, Tadao (1965). An Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages. Technical Report. University of Illinois Coordinated Sci- ence Lab., Amsterdam. Kay, Martin (1982). “Algorithm Schemata and Data Structures in Syntactic Process- ing”. In: Readings in Natural Language Processing. Ed. by Barbara J. Grosz, Karen Sparck Jones, and Bonnie Lynn Webber. Morgan Kaufmann, pp. 35–70. Keh-Jiann, Chen, Chu-Ren Huang, Feng-Yi Chen, Chi-Ching Luo, Ming-Chung Chang, Chao-Jan Chen, and Zhao-Ming Gao (2003). “Sinica Treebank: Design Criteria, Representational Issues and Implementation”. In: Treebanks: Building and Using Parsed Corpus. Ed. by Anne Abeillé. KLUWER, Dordrecht, pp. 231–248. Koehn, Philipp (2002). Europarl: A Multilingual Corpus for Evaluation of Machine Translation. Technical Report. Information Sciences Institute. Koo, Terry and Michael Collins (2010). “Efficient Third-order Dependency Parsers”. In: Proceedings of 48th Meeting of the Association for Computional Linguistics (ACL’10), pp. 1–11. Koskenniemi, Kimmo (1983). “Two-Level Model for Morphological Analysis”. In: Proceedings of the 8th International Joint Conference on Artificial Intelligence, pp. 683–685. Kroch, Anthony and Ann Taylor (2000). The Penn-Helsinki Parsed Corpus of Middle English (PPCME2). URL: http://www.ling.upenn.edu/hist-corpora. Kromann, Matthias T. (2003). “The Danish Dependency Treebank and the DTAG Treebank Tool”. In: Proceedings of the Second Workshop on Treebanks and Lin- guistic Theories (TLT 2003), pp. 217–220. Kübler, Sandra, Ryan McDonald, and Joakim Nivre (2009). Dependency Parsing. Morgan and Claypool. Kucera,ˇ Henry and Francis W. Nelson (1967). Computational Analysis of Present-day American English. Brown University Press, 1st Edition. Kumar, Dinesh and Josan Gurpreet Singh (2010). “Part of Speech Tagger for Morpho- logically Rich Indian Languages: A Survey”. International Journal of Computer Applications (IJCA) 6.5, pp. 1–9. Lazard, Gilbert (1992). A Grammar of Contemporary Persian. Tanslated into English by Shirley A. Lyon. Mazda Publishers. Leech, Geoffrey and Andrew Wilson (1994). EAGLES Morphosyntactic Annotation. Technical Report. Pisa: Istituto di Linguistica Computazionale. Maamouri, Mohamed, Ann Bies, Tim Buckwalter, and Wigdan Mekki (2004). “The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus”. In: NEMLAR, International Conference on Arabic Language Resources and Tools, pp. 102–109. Manning, Christopher D. and Hinrich Schütze (1999). Foundations of Statistical Nat- ural Language Processing. The MIT Press. Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz (1993). “Build- ing a Large Annotated Corpus of English: The Penn Treebank”. Computational Linguistics 19, pp. 313–330.

182 Martins, André F. T., Miguel B. Almeida, and Noah A. Smith (2013). “Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers”. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: Volume 2, pp. 617–622. Martins, André F. T., Dipanjan Das, Noah A. Smith, and Eric P. Xing (2008). “Stack- ing Dependency Parsing”. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pp. 157–166. Martins, André F. T., Noah A. Smith, Pedro M. Q. Aguiar, and Mário A. T. Figueiredo (2011). “Dual Decomposition with Many Overlapping Components”. In: Proceed- ings of the Conference on Empirical Methods for Natural Language Processing (EMNLP ’11), pp. 238–249. Martins, André F. T., Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar, and Mário A. T. Figueiredo (2010). “Turbo Parsers: Dependency Parsing by Approximate Variational Inference”. In: Proceedings of the 2010 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP ’10), pp. 34–44. McDonald, Ryan, Koby Crammaer, and Fernando Pereira (2005a). “Online Large- Margin Training of Dependency Parsers”. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 91–98. McDonald, Ryan and Joakim Nivre (2011). “Analyzing and Integrating Dependency Parsers”. Computational Linguistics 37.1, pp. 197–230. McDonald, Ryan and Fernando Pereira (2006). “Online Learning of Approximate Dependency Parsing Algorithms”. In: Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 81–88. McDonald, Ryan, Fernando Pereira, Kiril Ribarov, and Jan Hajicˇ (2005b). “Non- Projective Dependency Parsing Using Spanning Tree Algorithms”. In: Proceed- ings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT/EMNLP), pp. 523–530. Mcdonald, Ryan et al. (2013). “Universal Dependency Annotation for Multilingual Parsing”. In: Proceeding of the 51st Annual Meeting of the Association for Com- putational Linguistics (Volume 2: Short Papers, pp. 92–97. Mizan, Corpus (2013). Mizan English-Persian Parallel Corpus. Technical Report. Supreme Council of Information and Communication Technology, (http://dadegan. ir/catalog/mizan). Nivre, Joakim (2003). “An Efficient Algorithm for Projective Dependency Pars- ing”. In: Proceedings of the 8th International Workshop on Parsing Technologies (IWPT), pp. 149–160. Nivre, Joakim (2004). “Incrementality in Deterministic Dependency Parsing”. In: Pro- ceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cog- nition Together (ACL), pp. 50–57. Nivre, Joakim (2008a). “Algorithms for Deterministic Incremental Dependency Pars- ing”. Journal of Computational Linguistics 34.4, pp. 513–553. Nivre, Joakim (2008b). “Treebanks”. In: : An International Hand- book. Vol. 1. Walter de Gruyter, pp. 225–241. Nivre, Joakim (2009). “Non-Projective Dependency Parsing in Expected Linear Time”. In: Proceedings of the Joint Conference of the 47th Annual Meeting of ACL

183 and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACL-IJCNLP), pp. 351–359. Nivre, Joakim, Johan Hall, Sandra Kübler, Ryan McDonald, Jens Nilsson, Sebastian Riedel, and Deniz Yuret (2007). “The CoNLL 2007 Shared Task on Dependency Parsing”. In: Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL, pp. 915–932. Nivre, Joakim, Johan Hall, and Jens Nilsson (2006). “MaltParser: A Data-Driven Parser-Generator for Dependency Parsing”. In: Proceedings of the 5th Interna- tional Conference on Language Resources and Evaluation (LREC), pp. 2216–2219. Nivre, Joakim, Marco Kuhlmann, and Johan Hall (2009). “An Improved Oracle for Dependency Parsing with Online Reordering”. In: Proceedings of the 11th Inter- national Conference on Parsing Technologies (IWPT’09), pp. 73–76. Oflazer, Kemal, Bilge Say, Dilek Zeynep Hakkani-Tür, and Gökhan Tür (2003). “Building A Turkish Treebank”. In: Treebanks: Building and Using Parsed Cor- pus. Ed. by Anne Abeillé. KLUWER, Dordrecht, pp. 261–277. Oroumchian, Farhad, Samira Tasharofi, Hadi Amiri, Hossein Hojjat, and Fahimeh Raja (2006). Creating a Feasible Corpus for Persian POS Tagging. Technical Re- port. UAE Institution. Palmer, David D. (2000). “Tokenization and Sentence Segmentation, Handbook of Natural Language Processing”. In: Handbook of Natural Language Processing. Marcel Dekker, pp. 11–35. Parker, Robert, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda (2011). En- glish Gigaword Fifth Edition, Linguistic Data Consortium. Technical Report. Lin- guistic Data Consortium, Philadelphia. Petrov, Slav, Dipanjan Das, and Ryan McDonald (2012). “A Universal Part-of-Speech Tagset”. In: Proceedings of the Eight International Conference on Language Re- sources and Evaluation (LREC ’12), pp. 2089–2096. Petrov, Slav and Dan Klein (2008). “Parsing German with Latent Variable Grammars”. In: Proceedings of the ACL–08: HLT Workshop on Parsing German (PaGe-08), pp. 33–39. Prokopidis, Prokopis, Elina Desipri, Maria Koutsombogera, Harris Papageorgiou, and Stelios Piperidis (2005). “Theoretical and Practical Issues in the Construction of a Greek Dependency Treebank”. In: Proceedings of the Fourth Workshop on Tree- banks and Linguistic Theories (TLT), pp. 149–160. QasemiZadeh, Behrang and Saeed Rahimi (2006). “Persian in MULTEXT-East Framework”. In: Proceedings of the Advances in Natural Language Processing, 5th International Conference on NLP, FinTAL, pp. 541–551. Raja, Fahimeh, Hadi Amiri, Samira Tasharofi, Hossein Hojjat, and Farhad Oroum- chian (2007). “Evaluation of Part-of-Speech Tagging on Persian Text”. In: 2nd Workshop on Computational Approaches to Arabic Script-based Languages (CAASL2), pp. 120–127. Rasooli, Mohammad Sadegh, Manouchehr Kouhestani, and Amirsaeid Moloodi (2013). “Development of a Persian Syntactic Dependency Treebank”. In: Proceed- ings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 306–314.

184 Ratnaparkhi, Adwait (1996). “A Maximum Entropy Model for Part-of-Speech Tag- ging”. In: Proceedings of the Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pp. 133–142. Roukos, Salim, David Graff, and Dan Melamed (1995). Hansard French/English Cor- pus. Technical Report. Linguistic Data Consortium, Philadelphia. Sagae, Kenji and Alon Lavie (2006). “Parser Combination by Reparsing”. In: Pro- ceedings of the Human Language Technology Conference of the North American, pp. 129–132. Sampson, Geoffrey (1995). English for the Computer. The SUSANNE Corpus and Analytic Scheme. Oxford University Press. Sarkar, Anoop (2011). “Syntax and Parsing”. In: Multilingual Natural Language Pro- cessing Applications: From Theory to Practice. Ed. by Daniel M. Bikel and Imed Zitouni. Prentice Hall, pp. 1–51. Sassano, Manabu and Sadao Kurohashi (2010). “Using Smaller Constituents Rather Than Sentences in Active Learning for Japanese Dependency Parsing”. In: Pro- ceedings of the 48th Annual Meeting of the Association for Computational Lin- guistics, pp. 356–365. Seraji, Mojgan, Carina Jahani, Beáta Megyesi, and Joakim Nivre (2013). The Uppsala Persian Dependency Treebank Annotation Guidelines. Technical Report. Depart- ment of Linguistics and Philology, Uppsala University. Seraji, Mojgan, Carina Jahani, Beáta Megyesi, and Joakim Nivre (2014). “A Persian Treebank with Stanford Typed Dependencies”. In: Proceedings of the 9th Inter- national Conference on Language Resources and Evaluation (LREC), pp. 2245– 2252. Seraji, Mojgan, Beáta Megyesi, and Joakim Nivre (2012a). “A Basic Language Re- source Kit for Persian”. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), pp. 2245–2252. Seraji, Mojgan, Beáta Megyesi, and Joakim Nivre (2012b). “Bootstrapping a Persian Dependency Treebank”. Linguistic Issues in Language Technology 7.18, pp. 1–10. Seraji, Mojgan, Beáta Megyesi, and Joakim Nivre (2012c). “Dependency Parsers for Persian”. In: Proceedings of 10th Workshop on Asian Language Resources, COL- ING 2012, 24th International Conference on Computational Linguistics, pp. 35– 44. Shamsfard, Mehrnoush, Hoda Sadat Jafari, and Mahdi Ilbeygi (2010). “STeP-1: A Set of Fundamental Tools for Persian Text Processing”. In: Proceedings of the Sev- enth Conference on International Language Resources and Evaluation (LREC’10), pp. 859–865. Shen, Libin, Giorgio Satta, and Aravind Joshi (2007). “Guided Learning for Bidirec- tional Sequence Classification”. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 760–767. Sleator, Daniel and Davy Temperley (1993). “Parsing English with a Link Gram- mar”. In: Proceedings of the Third International Workshop on Parsing Technolo- gies, pp. 277–292. Stilo, Donald (2004). “Iranian as Buffer Zone Between the Universal Typologies of Turkic and Semitic”. In: Linguistic Convergence and Areal Diffusion: Case Studies From Iranian, Semitic, and Turkic. Routledge, pp. 35–63.

185 Taulé, Mariona, Maria Antònia Martí, and Marta Recasens (2008). “AnCora: Multi- level Annotated Corpora for Catalan and Spanish”. In: Proceedings of the Interna- tional Conference on Language Resources and Evaluation (LREC), pp. 96–101. Tiedemann, Jörg (2003). “Recycling Translation - Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing”. PhD The- sis. Studia Linguistica Upsaliensia 1. Toutanova, Kristina, Dan Klein, Christopher D. Manning, and Yoram Singer (2003). “Feature-rich Part-of-Speech Tagging with a Cyclic Dependency Network”. In: Proceedings of the 2003 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics on Human Language Technology, pp. 173– 180. Tsarfaty, Reut (2013). “A Unified Morpho-Syntactic Scheme of Stanford Dependen- cies”. In: Proceedings of the 51st Annual Meeting of the Association for Computa- tional Linguistics (ACL), pp. 578–584. Tufis, Dan, Nancy Ide, Tomaž Erjavec, and Romanian Academy (1998). “Standardized Specifications, Development and Assessment of Large Morpho-Lexical Resources for Six Central and Eastern European Languages”. In: Proceedings of The First In- ternational Conference on Language Resources and Evaluation (LREC), pp. 233– 240. van Halteren, Hans (1999). Syntactic Word Class Tagging. Kluwer Academic Publish- ers. Windfuhr, Gernot L. (2009). “Persian”. In: The World’s Major Languages. Routledge, pp. 532–546. Xue, Nianwen, Fei Xia, Fu-Dong Chiou, and Martha Palmer (2005). “The Penn Chi- nese TreeBank: Phrase Structure Annotation of a Large Corpus”. Natural Lan- guage Engineering, 11.2, pp. 207–238. Yamada, Hiroyasu and Yuji Matsumoto (2003). “Statistical Dependency Analysis with Support Vector Machines”. In: Proceedings of the Eighth International Workshop on Parsing Technologies (IWPT), pp. 195–206. Younger, Daniel H. (1967). “Recognition and Parsing of Context-Free Languages in Time n3”. Journal of Information and Control 10, pp. 189–208. ZarrabiZadeh, Hamid (2007). Tanzil Project. URL: http : / / tanzil . net / wiki / Tanzil_ Project. Zeman, Daniel (2008). “Reusable Tagset Conversion Using Tagset Drivers”. In: Pro- ceedings of the Sixth International Conference on Language Resources and Evalu- ation (LREC’08), pp. 213–218. Zhang, Yi and Rui Wang (2009). “Cross-Domain Dependency Parsing Using a Deep Linguistic Grammar”. In: Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pp. 378–386. Zhang, Yue and Stephen Clark (2008). “A Tale of Two Parsers: investigating and com- bining graph-based and transition-based dependency parsing using beam-search”. In: Proceedings of the Conference on Empirical Methods in Natural Language Pro- cessing, pp. 562–571. Zhu, Qi-bo (1989). “A Quantitative Look at the Guangzhou Petroleum English Cor- pus”. International Computer Archive of Modern English (ICAME Journal) 13, pp. 28–38.

186 Zhu, Conghui, Jie Tang, Hang Li, Hwee Tou Ng, and Tiejun Zhao (2007). “A Uni- fied Tagging Approach to Text Normalization”. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL), pp. 688–695.

187

Appendix A. UPDT Dependency Labels

6000 2535 acc 360 acomp 681 acomp-lvc 3 acomp-lvc/pc 2 acomp/pc 655 advcl 8 advcl/cop 2 advcl/pc 4157 advmod 11 advmod/pc 9211 amod 4 amod/cop 62 amod/pc 583 appos 3 appos/pc 2287 aux 217 auxpass 7657 cc 4021 ccomp 55 ccomp/cop 1 ccomp\cpobj 1 ccomp\nsubj 12 ccomp/pc 1 ccomp/pc/cop 6 ccomp\pobj 8 ccomp\poss 2022 complm 8629 conj 34 conj/cop 85 conj/pc 2 conj\pobj 3 conj\poss 4427 cop 185 cpobj

189 2 cpobj/pc 187 cprep 376 dep 3 dep/pc 68 dep-top 63 dep-voc 3929 det 3723 dobj 16 dobj/acc 4185 dobj-lvc 19 dobj-lvc/pc 123 dobj/pc 168 fw 733 mark 1773 mwe 1 mwe/pc 105 neg 3339 nn 1 nn/cop 490 npadvmod 8653 nsubj 7 nsubj-lvc 146 nsubjpass 1 nsubjpass/pc 194 nsubj/pc 2872 num 313 number 194 parataxis 6 parataxis/cop 4 parataxis/pc 16237 pobj 13 pobj/cop 162 pobj/pc 16067 poss 6 poss/acc 44 poss/cop 151 poss/pc 49 preconj 51 predet 15643 prep 41 prep/det 554 prep-lvc 49 prep/pc 1 prep/pobj

190 102 prt 13442 punct 75 quantmod 1408 rcmod 2 rcmod\amod 9 rcmod/cop 2 rcmod/pc 2 rcmod\pobj 2 rcmod\poss 1410 rel 5918 root 1 root\conj 65 root/cop 6 root/pc 13 root\pobj 7 root\poss 382 tmod 133 xcomp

191

ACTA UNIVERSITATIS UPSALIENSIS Studia Linguistica Upsaliensia Editors: Joakim Nivre and Åke Viberg

1. Jörg Tiedemann, Recycling translations. Extraction of lexical data from parallel corpora and their application in natural language processing. 2003. 2. Agnes Edling, Abstraction and authority in textbooks. The textual paths towards specialized language. 2006. 3. Åsa af Geijerstam, Att skriva i naturorienterande ämnen i skolan. 2006. 4. Gustav Öquist, Evaluating Readability on Mobile Devices. 2006. 5. Jenny Wiksten Folkeryd, Writing with an Attitude. Appraisal and student texts in the school subject of Swedish. 2006. 6. Ingrid Björk, Relativizing linguistic relativity. Investigating underlying assump- tions about language in the neo-Whorfian literature. 2008. 7. Joakim Nivre, Mats Dahllöf and Beáta Megyesi, Resourceful Language Tech- nology. Festschrift in Honor of Anna Sågvall Hein. 2008. 8. Anju Saxena & Åke Viberg, Multilingualism. Proceedings of the 23rd Scandinavi- an Conference of Linguistics. 2009. 9. Markus Saers, Translation as Linear Transduction. Models and Algorithms for Efficient Learning in Statistical Machine Translation. 2011. 10. Ulrika Serrander, Bilingual lexical processing in single word production. Swedish learners of Spanish and the effects of L2 immersion. 2011. 11. Mattias Nilsson, Computational Models of Eye Movements in Reading : A Data- Driven Approach to the Eye-Mind Link. 2012. 12. Luying Wang, Second Language Acquisition of Mandarin Aspect Markers by Native Swedish Adults. 2012. 13. Farideh Okati, The Vowel Systems of Five Iranian Balochi Dialects. 2012. 14. Oscar Täckström, Predicting Linguistic Structure with Incomplete and Cross- Lingual Supervision. 2013. 15. Christian Hardmeier, Discourse in Statistical Machine Translation. 2014. 16. Mojgan Seraji, Morphosyntactic Corpora and Tools for Persian. 2015.