Development of a Persian Syntactic Dependency Treebank
Total Page:16
File Type:pdf, Size:1020Kb
Development of a Persian Syntactic Dependency Treebank Mohammad Sadegh Rasooli Manouchehr Kouhestani Amirsaeid Moloodi Department of Computer Science Department of Linguistics Department of Linguistics Columbia University Tarbiat Modares University University of Tehran New York, NY Tehran, Iran Tehran, Iran [email protected] [email protected] [email protected] Abstract tions in tasks such as machine translation. Depen- dency treebanks are collections of sentences with This paper describes the annotation process their corresponding dependency trees. In the last and linguistic properties of the Persian syn- decade, many dependency treebanks have been de- tactic dependency treebank. The treebank veloped for a large number of languages. There are consists of approximately 30,000 sentences at least 29 languages for which at least one depen- annotated with syntactic roles in addition to morpho-syntactic features. One of the unique dency treebank is available (Zeman et al., 2012). features of this treebank is that there are al- Dependency trees are much more similar to the hu- most 4800 distinct verb lemmas in its sen- man understanding of language and can easily rep- tences making it a valuable resource for ed- resent the free word-order nature of syntactic roles ucational goals. The treebank is constructed in sentences (Kubler¨ et al., 2009). with a bootstrapping approach by means of available tagging and parsing tools and man- Persian is a language with about 110 million ually correcting the annotations. The data is speakers all over the world (Windfuhr, 2009), yet in splitted into standard train, development and terms of the availability of teaching materials and test set in the CoNLL dependency format and annotated data for text processing, it is undoubt- is freely available to researchers. edly a low-resourced language. The need for more language teaching materials together with an ever- 1 Introduction1 increasing need for Persian-language data process- ing has been the incentive for the inception of our The process of manually annotating linguistic data project which has defined the development of the from a huge amount of naturally occuring texts is a syntactic treebank of Persian as its ultimate aim. In very expensive and time consuming task. Due to the this paper, we review the process of creating the Per- recent success of machine learning methods and the sian syntactic treebank based on dependency gram- rapid growth of available electronic texts, language mar. In this treebank, approximately 30,000 sen- processing tasks have been facilitated greatly. Con- tences from contemporary Persian-language texts sidering the value of annotated data, a great deal of are manually tokenized and annotated at morpholog- budget has been allotted to creating such data. ical and syntactic levels. One valuable aspect of the Among all linguistic datasets, treebanks play an treebank is its containment of near 5000 distinct verb important role in the natural language processing lemmas in its sentences making it a good resource tasks especially in parsing because of its applica- for educational goals. The dataset is developed af- 1 ter the creation of the syntactic valency lexicon of This research is done while working in Dadegan Research Persian verbs (Rasooli et al., 2011c). This treebank Group, Supreme Council of Information and Communications Technology (SCICT), Tehran, Iran. The project is fully funded is developed with a bootstrapping approach by au- by SCICT. tomatically building dependency trees based on the root Raw Sentence SBJ MOS Insert POSDEP AJPP Encoding and I@ à @ QK. úæ JJ.Ó áK @ root Spell Correction Pæst PAn bær mobtæni Pin is that on based it Tokenization and V PR PP ADJ PR POS Tagging (a) A simple projective dependency tree for a Persian sentence: “It is based Verb Analysis on that”’. root Dependency SBJ Parsing Model AJPP Parsing MOS POSDEP Update Model root Manual Error à@ QK. I@ úæJJ.Ó áK @ Correction Retrain the (Treebank Parser PAn bær Pæst mobtæni Pin Annotation) that on is based it Yes PR PP V ADJ PR Add to the Treebank Need to (b) A simple non-projective depen- Dependency Update the dency tree for a Persian sentence: “It Treebank Parsing is based on that”. Model? Figure 1: Examples of Persian sentences with the dependency-based syntactic trees. 1(a) and 1(b) are ex- Figure 2: Diagram of bootstrapping approach in the de- amples of a projective and a non-projective dependency velopment of the dependency treebank. tree, respectively. The first lines show the original words in Persian. The pronunciation and their meanings are shown in the second line and the third line respectively. In der of words, the existence of colloquial texts, the the fourth line, the part of speech (POS) tags of the words pro-drop nature of the Persian language and its com- are presented. Note that the words are written from right to left (the direction of Perso-Arabic script). The depen- plex inflections (Shamsfard, 2011) in addition to the dency relations are described in Table 2. The relation is lack of efficient annotated linguistic data have made shown with an arc pointing from the head to the depen- the processing of Persian texts very difficult; e.g. dent. there are more than 100 conjugates and 2800 de- clensions for some word forms in Persian (Rasooli previous annotated trees. In the next step, automatic et al., 2011b), some words in the Persian language annotation is corrected manually. do not have a clear word category (i.e. the lexical category “mismatch”) (Karimi-Doostan, 2011a) and The remainder of this paper is as follows. In Sec- many compound verbs (complex predicates) can be tion 2, we briefly review the challenges in Persian separable (i.e. the non-verbal element may be sepa- language processing. In Sections 3 and 4, the de- rated from the verbal element by one or more other tails about the annotation process, linguistic and sta- words) (Karimi-Doostan, 2011b). tistical information about the data and the annotator agreement are reported. In Section 5, the conclusion After the development of the Bijankhan corpus and suggestions for future research are presented. (Bijankhan, 2004) with the annotation of word cat- egories, other kinds of datasets have been created 2 Persian Language Processing Challenges to address the need for Persian language process- ing. Among them, a Persian parser based on link Persian is an Indo-European language that is writ- grammar (Dehdari and Lonsdale, 2008), a compu- ten in Arabic script. There are lots of problems tational grammar based on GPSG (Bahrani et al., in its orthography such as encoding problems, hid- 2011), syntactic treebank based on HPSG (Ghay- den diacritics and writing standards (Kashefi et al., oomi, 2012) and Uppsala dependency treebank (Ser- 2010). A number of challenges such as the free or- aji et al., 2012) are the efforts to satisfy the need for syntactic processing in the Persian language. 3 Persian Dependency Treebank 3.1 Motivation With the creation of the Virastyar spell checker soft- root NVE ware (Kashefi et al., 2010), many open-source li- NPP braries were released for Persian word processing NPOSTMOD POSDEP such as POS tagging, encoding refinement, tok- ÐXQ» ñK AK. øXAK P øAëIJ .m root enization, etc. Regarding the need for syntactic anal- kærdæm to bA zijAdi sohbæthAje ysis of Persian texts, we decided to prepare a valu- did (1st, sing) you with a lot speaking(s) able linguistic data infrastructure for Persian syn- V PR PP ADJ N tax. In the first step, there was a need for choosing (a) A simple dependency tree with compound verb from the existing theories of grammar that best suits for a Persian sentence: “I spoke with you a lot”. Persian. Among grammatical theories, we decided The NVE is a relation between a light verb and its nonverbal element. As shown in the tree, not only to choose the dependency grammar. In dependency the nonverbal element is not near the light verb, but grammar, syntactic relations are shown with depen- also it is inflected for plurality (i.e. speakings). dencies between the words. In computational de- pendency grammar, each word has one head and the root head of the sentence is the dependent of an artificial PROG root word (Kubler¨ et al., 2009). A sample depen- VPP dency tree is shown in Figure 1(a) for a Persian sen- POSDEP NPREMOD tence. Note that Persian sentences are written from right to left. root ÐðPú× éKAg áK @ P@ ÐP@X There are several reasons for the preference of mirævæm xAne Pin Pæz dAræm dependency grammar to grammars such as phrase- go (pres.cont., 1st sing.) house this from have (pres., 1st sing.) V N PREM PP V based structure grammars. Although in both of the (b) A simple dependency tree for a Persian sentence with a pro- representations, one can show the syntactic analy- gressive auxiliary: “I am going from this house”. The PROG is a sis of a sentence, dependency representation has the relation between a verb and its progressive auxiliary. power to account for the free word order of many languages such as Turkish (Oflazer et al., 2003) and root VPP Czech (Hajic, 1998) and also Persian. As an exam- POSDEP ple, a sample non-projective dependency tree for the NPREMOD Persian language is shown in Figure 1(b). The re- root I Ã Ñë@ñm'QK. éKA g áK @ éK. cent advances in very fast dependency parsing mod- barnæxAhæm gæSt xAne Pin be els (e.g. (Nivre, 2009; Bohnet and Nivre, 2012)), return (future, neg., 1st sing.) house this to has made the syntactic processing task very popular V N PREM PP in the recent decade. (c) A simple dependency tree for a Persian sen- In the Persian language, in addition to the abun- tence with a an inflected form of a prefixed verb dance of crossings of the arcs, another problem oc- “I will not return to this house.”.