Developing a Tagset for Automated Part-Of-Speech Tagging in Urdu

Developing a tagset for automated part-of-speech tagging in Urdu Andrew Hardie Department of Linguistics and Modern English Language, University of Lancaster [email protected] 1. Abstract While part-of-speech tagging is an established technology for Western European languages such as English or Spanish, extending the technique to Urdu presents a range of interesting issues. There are some problems associated with the writing system, e.g. the problems of locating token boundaries in the Urdu version of the Arabic script. However, there are also linguistic issues. Little work has hitherto been done in the area of tagset creation for Urdu. The tagset discussed here was created in accordance with the EAGLES guidelines for morphosyntactic annotation of corpora. Although these guidelines were written to cover the languages of the European Union, they can be applied fairly easily to Urdu, which, coming as it does from another branch of the Indo- European family, is structurally quite similar. They can also be extended to deal with the idiosyncrasies presented by Urdu grammar. This paper will look at the process of creating one of the necessary resources for the development of a POS tagging system for Urdu, that of a suitable tagset, considering some of the problems encountered along the way. 2. Introduction As part of the EMILLE project1, it was decided to develop a POS tagger for one of the languages of South Asia covered by the project. Urdu was chosen as the language in question for a number of reasons. Firstly, it is widely spoken in the UK, both as a first and second language, and native speakers were available to be consulted at Lancaster where this part of the EMILLE project is taking place. Secondly, as the lingua franca of a multilingual community (that of South Asian Muslims) and the official language of Pakistan, Urdu has considerable political and cultural importance. Thirdly, there are a number of factors that we anticipated would make tagging Urdu more complicated than tagging any other EMILLE language. For example, the right-to-left directionality of the Indo-Perso-Arabic script in which Urdu is written and the presence of grammatical forms borrowed from Arabic and Persian, which are structurally quite distinct from Indo-Aryan forms, mean that Urdu represents a unique challenge within the EMILLE corpora. It seemed the best course of action to confront these problems by choosing Urdu as the language for which to develop POS tagging. The first stage of the work was to develop a tagset for use in Urdu texts and corpora, an area which has not been research extensively heretofore2. The next stage, now underway, is to test the tagset’s usability in manual tagging, and build up a set of tagged texts to serve as training data for the final phase of this part of the project. This will be to automate the tagging and subsequently tag the whole of the EMILLE Urdu corpus. In this paper, the first, completed stage of this process is discussed: the devising of a tagset for Urdu based on the Urdu grammar of Schmidt (1999). 3. Some background on the Urdu language Urdu is an Indo-European languages of the Indo-Aryan branch of the family. It is spoken in India and Pakistan (where it is the main official language) and also throughout the world Urdu is more closely related to Hindi than either is to any other language. Indeed, their high level of similarity has led some to consider them dialects of the same language (as reported by Bhatia and Koul 2000: ix-x). Masica (1991: 27) goes to so far as to suggest that by one definition of a dialect, Urdu and Hindi “are different literary styles based on the same linguistically defined subdialect”. Both originate from the dialect of the Delhi region and share their phonology, morphology and syntax in all but the smallest details. However, Urdu has borrowed a great deal of vocabulary (and its writing system) from Persian and Arabic, whilst Hindi has borrowed much vocabulary from Sanskrit. 1 A project to develop language engineering resources for the languages of South Asia undertaken at the Universities of Lancaster and Sheffield: see Baker et al. (2003). 2 We are only currently aware of one other study into this area, undertaken by the Department of Electronics of the Indian government (personal communication, Dr. I. Hasnain). We did not become aware of this study until a late stage in the research and so it is not discussed further in this paper. 1 The most notable features of Urdu grammar are as follows (see Schmidt 1999 for further detail). Its word order is principally SXOV, with some flexibility in the order of these elements; subject pronouns are frequently dropped. It possesses postpositions rather than prepositions. Inflection on verbs, nouns and adjectives takes the form of fusional affixes, many of which are homophonous with one another. Nouns are inflected for case and number (singular/plural); the suffixes also indicate their gender (masculine/feminine). Gender agreement is marked by suffixes on verbs and adjectives; verbs show agreement either with the subject or with the direct object, although not both at once. Urdu verbs have one simple finite verb form (the subjunctive), two simple forms that may be finite or non-finite (the perfective and imperfective participles), and two further non-finite simple forms (the root and the infinitive). Tense and aspect, however, are mostly expressed through the use of irregular auxiliary elements within the verb phrase; there are also a number of frequently-used semi-auxiliary elements which confer semantic shading. The history of linguistic investigation into Urdu (and Hindi3) is described by Bhatia (1987). The current standard grammar is that of Schmidt (1999), although a great many pedagogical books have also been published (e.g. Bailey et al. 1956). There has also been a certain amount written on the language within the field of theoretical linguistics (e.g. Butt 1995). However, there remains some contention about certain points of its grammar. There is for instance some disagreement as to whether Urdu possesses three cases, or a much larger number. Urdu nouns generally display two clear cases, oblique (most commonly before postpositions) and nominative (elsewhere). A third case, the vocative, is identical with the oblique except in the plural. However, because of the structure of the Urdu noun phrase, postpositions always occur directly after the head noun of the noun phrase which they govern – in stark contrast to English, for instance, where a variety of elements may come between a preposition and the noun it governs. This phenomenon has led some (e.g. Kellogg 1875) to conclude that Urdu postpositions are actually “case suffixes” and that Urdu thus has a much larger variety of cases, possibly including accusative, dative, genitive, ergative, and so on. In this context it may be noted that the case system for other parts of speech (e.g. adjectives) displays the two-way nominative/oblique split that we would predict if the postpositions were not nominal affixes. Another contentious issue relates to whether the language should be considered to display split-ergativity or not. In past and perfective clauses in Urdu, the subject is marked with the postposition nē, which has no other use than to indicate such a subject, whereas the object remains in the nominative case. This can be treated as evidence of split ergativity in the language. However, it has been argued (e.g. by Butt 1995) that nē, rather than marking an ergative case, is a semantic marker of agentivity or volitionality. These differences of opinion on matters of Urdu grammar are not insignificant for the task of designing a set of morphosyntactic categories. Ideally one would wish to compose a markup scheme that does not commit the user to a particular theoretical analysis (as suggested, for example, by Leech 1997), thus making the categories equally acceptable and useful to researchers on either side of the two debates mentioned above. 4. A model for categorisation of the Urdu language To create the linguistic categories of a tagset, it is necessary to have a model of the language to categorise. An ideal approach would be to derive this model from empirical data – however, this cannot be done prior to the creation of a tagset. A native speaker of a language could use their own intuitions about the language as a model, but as none of the researchers on the EMILLE project are native speakers of Urdu, this is not an option. The only remaining option is to make use of a published description of Urdu grammar as a model of the language. The decision was taken to rely on the current standard grammar of Urdu by Schmidt (1999) to furnish a model of the language. It would probably have been preferable to rely on a synthesis of a range of published descriptions; however, in practice this was impossible. Most other recent works fall into two categories: pedagogical manuals, and works in theoretical linguistics that look at Urdu (or “Hindi-Urdu”). It might be assumed that the latter group of studies could be used to compile, in conjunction with Schmidt (1999), a synthesised model of the language on which to base the tagset categories. However, this is not so. Works in theoretical linguistics which concentrate on Hindi-Urdu tend to focus on one aspect of the language to the exclusion of the rest4. Thus they are of little use in 3 Until the early 20th Century, the distinction between Hindi and Urdu was not as clearly defined as it later became. 4 For example, Butt (1995) looks in detail at complex verb phrases, touching in cursory fashion or not at all on other aspects of Urdu grammar.

Developing a Tagset for Automated Part-Of-Speech Tagging in Urdu

Why Is Language Typology Possible?

Modeling Language Variation and Universals: a Survey on Typological Linguistics for Natural Language Processing

Urdu Treebank

Linguistic Profiles: a Quantitative Approach to Theoretical Questions

Morphological Processing in the Brain: the Good (Inflection), the Bad (Derivation) and the Ugly (Compounding)

Inflection), the Bad (Derivation) and the Ugly (Compounding)

Parsing the SYNTAGRUS Treebank of Russian

A Crosslinguistic Verb-Sensitive Approach to Dative Verbs

Chaining and the Growth of Linguistic Categories

Maurice Gross' Grammar Lexicon and Natural Language Processing

Word Classes in Radical Construction Grammar

Corpus Approaches to the Language of Literature