An Open Source Punjabi Resource Grammar
Total Page:16
File Type:pdf, Size:1020Kb
An Open Source Punjabi Resource Grammar Shafqat Mumtaz Virk Muhammad Humayoun Aarne Ranta University of Gothenburg, Sweden University of Savoie, France University of Gothenburg, Sweden [email protected] [email protected] [email protected] Abstract agreement rules. For this purpose, we have the concrete syntax, which is a set of linguistic We describe an open source computational objects (strings, inflection tables, records) grammar for Punjabi; a resource-poor providing rendering and parsing. We may have language. The grammar is developed in GF multiple parallel concrete syntaxes for one (Grammatical framework), which is a tool for abstract syntax, which makes the GF grammars multilingual grammar formalism. First, we multilingual. Also, as each concrete syntax is explore different syntactic features of Punjabi independent from others, it becomes possible to and then we implement them in accordance model the rules accordingly (i.e. word order, with GF grammar requirements, to make Punjabi the 17th language in the GF resource word forms and agreement features are chosen grammar library. according to language requirements). Current state-of-the-art machine translation 1. Introduction systems such as Systran, Google Translate, etc. provide huge coverage but sacrifice precision Grammatical Framework (Ranta, 2004) is a and accuracy of translations. On the contrary, special-purpose programming language for domain-specific or controlled multilingual multilingual grammar applications. It can be grammar based translation systems can provide used to write multilingual resource or a higher translation quality, on the expense of application grammars (two types of grammars limited coverage. In GF, such controlled in GF). grammars are called application grammars. Multilingualism of the GF grammars is based Writing application grammars from scratch on the principle that same grammatical can be very expensive in terms of time, effort, categories (e.g. noun phrases and verb phrases) expertise and money. GF provides a library and syntax rules (e.g. predication) can appear in called the GF resource library that can ease this different languages (Ranta, 2009a). A collection task. It is a collection of linguistic oriented but of all such categories and rules, which are general-purpose resource grammars, which try independent of any language, makes the abstract to cover the general aspects of different syntax of GF grammars (every GF grammar has languages (Ranta, 2009a). two levels: abstract and concrete). More Instead of writing application grammars from precisely, the abstract syntax defines semantic scratch for different domains, one may use conditions to form abstract syntax trees. For resource grammars as libraries (Ranta, 2009b)2. example the rule that a common noun can be This method enables to create the application modified by an adjective is independent of any grammar much faster with a very limited language and hence is defined in the abstract linguistic knowledge. syntax, e.g.: The number of languages covered by GF Very big blue house resource library is growing (17 including 1 fun AdjCN : AP → CN → CN ; Punjabi). Previously, GF and/or its libraries have been used to develop a number of However, the way this rule is implemented multilingual as well as monolingual domain- may vary from one language to another; as each language may have different word order and/or 2 This idea is influenced by programming language API tradition in which, a standard general-purpose library is 1In GF code, cat and fun belong to abstract syntax. On supported by the language. It is then used by programmers the contrary, lincat and lin belong to concrete syntax. to write specific applications. 70 Proceedings of Recent Advances in Natural Language Processing, pages 70–76, Hissar, Bulgaria, 12-14 September 2011. specific application grammars (see GF converted into richer syntactic categories, i.e. homepage 3 for details on these application noun phrases (NP), verb phrases (VP), and grammars). adjectival phrases (AP), etc. With this up-cast In this paper, we describe the resource the linguistic features such as word-forms, grammar development for Punjabi. Punjabi is an number & gender information, and agreements, Indo-Aryan language widely spoken in Punjab etc, travel from individual words to the richer regions of Pakistan and India. Punjabi is among categories. one of the morphologically rich languages In this section, we explain this conversion (others include Urdu, Hindi, Finish, etc) with from lexical to syntactic categories and SOV word order, partial ergative behavior, and afterwards, we demonstrate how to glue the verb compounding. In Pakistan it is written in individual pieces to make clauses. These are Shahmukhi, and in India, it is written in then can be used to make well-formed sentences Gurmukhi script (Humayoun, 2010). Language in Punjabi. The following subsections explain resources for Punjabi are very limited various types of phrases. (especially for the one spoken in Pakistan). With the best of our knowledge this work is the 3.1. Noun Phrases first attempt of implementing a computational A noun phrase (NP) is a single word or a group Punjabi grammar as open-source software, of words that does not have a subject and a covering a fair enough part of Punjabi predicate of its own, and does the work of a morphology and syntax. noun (Verma, 1974). Now we show the structure of noun phrase in our implementation, 2. Morphology followed by the description of its different parts. Every grammar in GF resource grammar library Structure: In GF, we represent the NP as a has a test lexicon, which is built through the record with three fields, labeled as: ‘s’ , ‘a’ and lexical functions called the lexical paradigms; ‘isPron’: see (Bringert et el, 2011) for synopsis. These NP: Type={s : NPCase => Str ; paradigms take lemma of a word and make a : Agr ; finite inflection tables, containing different isPron : Bool } ; forms of the word, according to the lexical rules of that particular language. A suite of Punjabi The label ‘s’ is an inflection table from resources including morphology and a big NPCase to string (NPCase => Str). NPCase lexicon are reported by (Humayoun and Ranta, has two constructs (NPC Case, and NPErg) as 2010). With minor required adjustments, we shown below: have reused morphology and a subset of that NPCase = NPC Case | NPErg ; lexicon, as a test lexicon of about 450 words for Case = Dir | Obl | Voc | Abl ; our grammar implementation. However, the morphological details are beyond the scope of The construct (NPC Case) stores the lexical cases (i.e. Direct, Oblique, Vocative and this paper and we refer to (Humayoun and 4 Ranta, 2010) for more details on Punjabi Ablative) of a noun . As an example consider morphology. the following table for the noun “boy”: s .NPC Dir => mʊnɖɑ: Čij 3. Syntax s .NPC Obl => mʊnɖɛ čij s .NPC Voc => mʊnɖi:a ĕij While morphology is about types and formation of individual words (lexical categories), it is the s .NPC Abl => mʊnɖɛo:ɳ ij syntax, which decides how these words are Other than storing the lexical cases of a noun grouped together to make well-formed as shown in the above table, we also construct sentences. For this purpose, individual words, the ergative case (i.e. NPErg in the code above). which belong to different lexical categories, are We do it at the noun phrase level for the 3 http://www.grammaticalframework.org/ 4Punjabi nouns have four lexical cases. 71 following reason: In Urdu, the case markers that mʊnɖɛ:_boy nɛ:_ErgMarker ru:ʈi:_bread kʰadi:_ate follow the nouns in the form of post-positions ďĒĕńaőŋa čij cannot be handled at lexical level through The boy ate bread. morphological suffixes and thus need to be From the above examples, we can see that, 5 handled at syntax level (Butt and King, 2002) . when we have the first or second person It also applies to Punjabi. So we construct the pronoun as subject, the ergative case marker is ergative case of a noun by attaching ergative not used (first two examples). On the contrary, case marker 'nɛ' to the oblique case of the noun it is used in all other cases. So for our running at NP level. For instance, the ergative form of example, i.e. the noun (boy, mʊnɖɑ:), the label our running example “boy” is: ‘isPron’ is false. s.NPErg => mʊnɖɛ nɛ_Erg a čij Construction: First, the lexical category noun (N) is converted to an intermediate category, It is used for the subjects of perfective common noun (CN) through the UseN function. transitive verbs (see Section 3.5 for more details). fun UseN : N → CN ; -- mʊnɖɑ:, boy The label ‘a’ represents the agreement feature CN is a syntactic category, which is used to (Agr) and stores information about gender, deal with the modifications of nouns by number and person that will be used for adjectives, determiners, etc. Then, the common agreement with other constituents. It is defined noun is converted to the syntactic category, as follows: noun phrase (NP). Three main types of noun Agr = Ag Gender Number Person ; phrases are: (1) common nouns with In Punjabi, the gender can be masculine or determiners, (2) proper names, and (3) feminine; number can be singular and plural; pronouns. We build these noun phrases through and person can be first, second casual, second different noun phrase construction functions with respect and third person near & far. These depending on the constituents of NP. As an are defined as shown below: example consider (1). We define it with a function DetCN given below: Gender = Masc | Fem ; hər mʊnɖɑ: Number = Sg | Pl ; Every boy, _every _boy fun DetCN : Det → CN → NP ; Person = Pers1 | Pers2_Casual | Pers2_Respect | Here (Det) is a lexical category representing Pers3_Near | Pers3_Far determiners. The above given function takes the Finally, the label ‘isPron’ is a Boolean determiner (Det) and the common noun (CN) as parameter, which shows whether NP is parameters and builds the NP, by combining constructed from a pronoun.