Design of a Rule-Based Stemmer for Natural Language Text in Bengali
Total Page:16
File Type:pdf, Size:1020Kb
Design of a Rule-based Stemmer for Natural Language Text in Bengali Sandipan Sarkar Sivaji Bandyopadhyay IBM India Computer Science and Engineering Department [email protected], Jadavpur University, Kolkata [email protected] [email protected] words tokens tagged with their respective parts of Abstract speech (POS). This paper presents a rule-based approach 2 Orthographic Syllable for finding out the stems from text in Ben- gali, a resource-poor language. It starts by Unlike English or other Western-European lan- introducing the concept of orthographic guages, where the basic orthographic unit is a syllable, the basic orthographic unit of character, Bengali uses syllable. A syllable is typi- Bengali. Then it discusses the morphologi- cally a vowel core, which is preceded by zero or cal structure of the tokens for different more consonants and followed by an optional dia- parts of speech, formalizes the inflection critic mark. rule constructs and formulates a quantita- However, the syllable we discuss here is ortho- tive ranking measure for potential candi- graphic and not phonological, which can be differ- date stems of a token. These concepts are ent. As for example, the phonological syllables of applied in the design and implementation word কতর্া [kartaa] are কr [kar_] and তা [taa]. of an extensible architecture of a stemmer Whereas, the orthographic syllables will be ক [ka] system for Bengali text. The accuracy of and তর্া [rtaa] respectively. Since the term 'syllable' the system is calculated to be ~89% and is more used in phonological context, we use 'o- above. syllable' to refer orthographic syllables, which will be a useful tool in this discussion. 1 Introduction Formally, using regular expression syntax, an o- While stemming systems and algorithms are being syllable can be represented as CV*?? Dwhere C studied for European, Middle Eastern and Far is a consonant, V is a vowel and D is a diacritic Eastern languages for sometime, such studies in mark or halant. If one or more consonants are pre- Indic scripts are quite limited. Ramanathan and sent, the vowel becomes a dependent vowel sign Rao (2003) reported a lightweight rule-based [maatraa]. stemmer in Hindi. Garain et. al. (2005) proposed a We represent the o-syllables as a triple (C, V, D) clustering-based approach to identify stem from where C is a string of consonant characters, V is a Bengali image documents. Majumdar et. al. (2006) vowel character and D is a diacritic mark. All of accepted the absence of rule-based stemmer in these elements are optional and their absence will Bengali and proposed a statistical clustering-based be denoted by Ø. V will be always represented in approach to discover equivalence classes of root independent form. words from electronic texts in different languages We define o-syllabic length |τ| of token (τ) as including Bengali. We could not find any publica- the number of o-syllables in τ. tion on Bengali stemmer following rule-based ap- Few examples are provided below: proach. Our approach in this work is to identify and Token (τ) O-syllable Form |τ| formalize rules in Bengali to build a stemming sys- মা [maa] (ম,আ,Ø) 1 tem with acceptable accuracy. This paper deals চঁাদ [chaa`nd] (চ,আ,◌ঁ)(দ,a,Ø) 2 with design of such a system to stem Bengali aগs্য [agastya] (Ø,a,Ø)(গ,a,Ø)(সতয,a,Ø) 3 65 Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, pages 65–72, Hyderabad, India, January 2008. c 2008 Asian Federation of Natural Language Processing Token (τ) O-syllable Form |τ| tense, simple aspect and colloquial style) + i [i] আট্কা [aaT_kaa] (Ø,আ,Ø) (ট,Ø,◌্ ) (ক,আ,Ø) 3 (inflection representing emphasis). While appended, the inflections may affect the Table 1: O-syllable Form Examples verb stem in four different ways: 1. Inflections can act as simple suffix and do not 3 Morphological Impact of Inflections make any change in the verb stem. Examples: করা Like English, the inflections in Bengali work as a (stem) + িc [chchhi] (inflection) > করািc [karaach- suffix to the stem. It typically takes the following chhi], খা (stem) + ব (inflection) > খাব [khaaba] etc. form: 2. Inflections can change the vowel of the first <token> ::= <stem><inflections> o-syllable of the stem. Example (the affected vow- <inflections> ::= <inflection> | els are in bold and underlined style): শুধ্ রা <inflection><inflections> [shudh_raa] (stem) + স [sa] (inflection) > (শ,u,Ø) Typically Bengali word token are formed with (ধ,Ø,◌্ ) (র,আ,Ø) + স > (শ,o,Ø) (ধ,Ø,◌্ ) (র,আ,Ø) + স > zero or single inflection. Example: মােয়র [maayer] েশাধ্ রা [shodh_raa] + স > েশাধ্রাস [shodh_raasa]. < মা [maa] (stem) + েয়র [yer] (inflection) 3. Inflections can change the vowel of the last o- However, examples are not rare where the token syllable of the stem. Example: আট্ কা [aaT_kaa] is formed by appending multiple inflections to the (stem) + িছ [chhi] (inflection) > (Ø,আ,Ø) (ট,Ø,◌্ ) stem: করেলo [karaleo] < কr [kar_] (stem) + েল [le] (ক,আ,Ø) + িছ > (Ø,আ,Ø) (ট,Ø,◌্ ) (ক,e,Ø) + িছ > আট্ (inflection) + o [o] (inflection), ভাiেদরেকi [bhaaid- েক [aaT_ke] + িছ > আট্েকিছ [aaT_kechhi]. erakei] < ভাi [bhaai] (stem) + েদর [der] (inflec- 4. Inflections can change the vowel of both first tion) + েক [ke] (inflection) + i [i] (inflection). and last o-syllable of the stem. Example: েঠাk রা [Thok_raa] (stem) + o [o] (inflection) > (ঠ,o,Ø) 3.1 Verb (ক,Ø,◌্ ) (র,আ,Ø) + o > (ঠ,u,Ø) (ক,Ø,◌্ ) (র,i,Ø) Verb is the most complex POS in terms of in- + o > ঠুkির [Thuk_ri] + o > ঠুkিরo [Thuk_rio]. flected word formation. It involves most number of inflections and complex formation rules. 3.2 Noun Like most other languages, verbs can be finite Noun is simpler in terms of inflected token forma- and non-finite in Bengali. While inflections for tion. Zero or more inflections are applied to noun non-finite verbs are not dependent on tense or per- stem to form the token. Nouns are inflected based son; finite verbs are inflected based on person (first, on number (singular, plural), article and case [kā- second and third), tense (past, present and future), raka] (nominative, accusative, instrumental, dative, aspect (simple, perfect, habitual and progressive), ablative, genitive, locative and vocative). Unlike honour (intimate, familiar and formal), style (tradi- verbs, stems are not affected when inflections are tional [saadhu], standard colloquial [chalit] etc.) applied. The inflections applicable to noun is a dif- mood (imperative etc.) and emphasis. Bengali verb ferent set than verb and the number of such inflec- stems can yield more than 100 different inflected tions also less in count than that of verb. tokens. Example: বািড়টারi [baarhiTaarai] < বািড় [baarhi] Some examples are: করািতস [karaatis] < করা (stem) + টা [Taa] (inflection representing article) + [karaa] (stem) + িতস [tis] (inflection representing র [ra] (inflection representing genitive case) + i [i] second person, past tense, habitual aspect, intimate (inflection representing emphasis), মানষগু েলােকু honour and colloquial style), [khaaiba] < খাiব খা [maanushhaguloke] < মানুষ [maanushha] (stem) + [khaa] (stem) + [iba] (inflection representing iব গুেলা [gulo] (inflection representing plural number) + first person, future tense, simple aspect and tradi- েক [ke] (inflection representing accusative case) etc. tional style) etc. A verb token does not contain more than two in- 3.3 Pronoun flections at a time. Second inflection represents Pronoun is almost similar to noun. However, there either emphasis or negation. are some pronoun specific inflections, which are Example: আসবi [aasabai] < আs [aas_] (stem) + ব not applicable to noun. These inflections represent [ba] (inflection representing first person, future location, time, amount, similarity etc. 66 Example: েসথা [sethaa] < েস [se] (stem) + থা [thaa] studied these inflections and inflected tokens and (inflection representing location). This inflection is framed the rules inspired by the work of Porter not applicable to nouns. (1981). We had following observations: Moreover, unlike noun, a pronoun stem may 1. To find out the stem, we need to replace the have one or more post-inflection forms. inflection with empty string in the word token. Example: stem আিম [aami] becomes আমা [aamaa] Hence all rules will take the following form: (আমােক < আমা + েক) or েমা [mo] (েমােদর < েমা + েদর) once <inflection> → "" inflected. 2. For rules related to verbs, the conditionals are present but they are dependent on the o-syllables 3.4 Other Parts of Speeches instead of 'm' measure, as defined and described in Other POSs in Bengali behave like noun in their Porter (1981). inflected forms albeit the number of applicable 3. For pronouns the inflection may change the inflections is much less comparing to that of noun. form of the stems. The change does not follow any Example: েতম [shreshhThatama] < ে rule. However, the number of such changes is [shreshhTha] (adjective stem) + তম [tama] (inflec- small enough to handle on individual basis instead tion representing superlative degree), মেধয্ [madhye] of formalizing it through rules. < মধয্ [madhya] (post-position stem) + ে◌ [e] 4. A set of verb stems, which are called incom- plete verbs, take a completely different form than (inflection) etc. the stem. Such verbs are very limited in number. 4 Design Examples: যা [Jaa] (েগলাম [gelaam] etc. are valid tokens for this verb), আs (eলাম [elaam] etc. are 4.1 Context valid tokens), আছ্ [aachh_] (থাকলাম [thaakalaam], As we identified in the previous section, the impact িছল [chhila] etc. are valid tokens) of inflections on stem are different for different 5. For non-verb POSs, there is no conditional.