A Hybrid Approach to Develop the First Stemmer in Maithili

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/339677717

Article · December 2019 DOI: 10.1145/3368567.3368581

CITATIONS READS 0 37

2 authors:

Sujan Kumar Saha Ankur Priyadarshi Birla Institute of Technology, Mesra Birla Institute of Technology, Mesra

56 PUBLICATIONS 415 CITATIONS 9 PUBLICATIONS 10 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

First Maithili NER View project

Automatic question generation and evaluation system. View project

All content following this page was uploaded by Ankur Priyadarshi on 28 December 2020.

The user has requested enhancement of the downloaded file. A Hybrid Approach to Develop the First Stemmer in Maithili Ankur Priyadarshi, Sujan Kumar Saha Computer Science and Engineering Birla Institute of Technology, Mesra Jharkhand, Ranchi {priyadarshiankur81, sujan.kr.saha}@gmail.com

ABSTRACT stemmers have been developed for various languages (English [1], French [2], German [3]). Efforts has also been devoted to This paper presents a hybrid approach to extract the stems from developing stemmers in several Indian languages, including Hindi the Maithili words. Development of stemmer for various [4], [5], Bengali [6],[7], Telugu [8], Tamil [9], Gujarati [10] and languages has gained adequate attention of the researchers for Malayalam [11]. However, in the literature we did not find any decades. Stemmers have also been developed for most of the stemmer for the Maithili language. Therefore, we worked on the major Indian languages. However, no work is found on the development of the first stemmer in Maithili. development of Maithili stemmer. In this first attempt, we use a Maithili is an Indo-Aryan language spoken mainly in the Bihar hybrid model where rule-base acts as the core module. We study and Jharkhand states of India. It is also used outside India and is the second most predominant language of Nepal. There are around the inflections of the major parts-of-speech categories and define 50 million native speakers in Maithili. In 2003, Maithili was rules accordingly. The rule-based module is supported by neural included in the Eighth Schedule of the Indian Constitution as a word-embedding and suffix stripping. Suffix stripping is used to recognized regional language of India, which allows it to be used increase the coverage and word embedding is used to restrict in education, government, and other official contexts. In March erroneous stem generation. The system is evaluated using a test 2018, Maithili acquired the second official language status in the data collected from Maithili literature and news text. The final Indian state of Jharkhand. Earlier Tirhuta was used as the script system achieves an accuracy of 84.6%. for written Maithili, but Devanagari has been primarily used since the 20th century. CCS CONCEPTS We used a hybrid approach for development. However, the core module of the system is rule-based. We study the inflections of • Computing Methodologies→ Artificial Intelligence; •Artificial the Maithili words and develop a rule-base for removing the Intelligence → Natural Language Processing inflections. We found that the noun, verb, pronoun and adjective are the major classes that contain inflected words. We identify KEYWORDS various types of inflections in these classes. Multiple inflections Stemmer, Maithili Language, Suffix Stripper, Rule-based are also observed in Maithili, where more than one inflection is appended to the stem to form the word. Such multiple inflections Stemmer are primarily observed in the noun class. Therefore, parts-of- speech (POS) information plays a role in handling the inflections. ACM Reference format: As no open POS tagger is available in Maithili, we use an In- Priyadarshi and Saha. 2019. A Hybrid Approach to Develop the First house POS tagger during the development. The rule-based module Stemmer in Maithili. In Proceedings of FIRE’19 studies the orthographic constitutions of an input word and removes the inflections to get the stem. However, the rule-based system may also treat a syllable of the actual stem as inflection 1 Introduction and remove it to form a meaningless stem. To restrict this, we use This paper presents a hybrid approach to extract the stem from a a word-embedding based module. We use the Maithili Wikipedia Maithili word. Information regarding the stem and inflections of a dump to train the Word2Vec embedding. These word vectors are word is required in various natural language processing, used to validate the system generated stems. The performance of information retrieval and other text processing applications. the rule-based module depends on the coverage of rules. It is quite Therefore, the development of stemmer is a key research area, and costly and time-consuming to develop a rule set that covers all possible inflections. The system might treat an inflected word as Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or the root if the required rule is not defined. Therefore, we use a distributed for profit or commercial advantage and that copies bear this notice and suffix-stripping module as the final component of the system. This the full citation on the first page. Copyrights for third-party components of this module is also supported by the word-embedding based work must be honored. For all other uses, contact the owner/author(s). FIRE’19, 2019, Kolkata, India validation. © 2018 Copyright held by the owner/author(s). 978-1-4503-0000-0/18/06...$15.00 As, this is the first attempt on Maithili stemmer development, https://doi.org/10.1145/1234567890 there is no open test data to assess the developed system. For the system evaluation, we create a test data containing 2000 words. FIRE’19, Kolkata, India Priyadarshi et al.

The sentences in the test data were collected from Maithili 3.1 System Architecture literature and e-newspaper. To perform a comparative analysis, we consider the suffix-stripping based module, without the rule- Figure 1 presents a high-level workflow of the proposed system. It base, as the baseline system. The baseline system achieves an explains about hybrid approach implemented to build a stemmer. accuracy of 70.9%. When the POS dependent Rule-based module The core-module of the system is rule-based. The rules use parts- is implemented, the accuracy climbed to 84.6%. of-speech information of the words. To get the POS information, we incorporated an in-house Maithili POS Tagger. The POS 2 Related Work tagger is developed using a manually annotated training data For English, Julie Beth Lovins developed the first stemmer in containing ~50K words and achieves ~85% accuracy. First, the 1968 [12]. They described the theoretical and practical aspects of input text is pre-processed and tokenized. Tokenization is the stemming algorithms. Later, Martin Porter developed a rule based process of splitting a sequence into tokens and discarding widely known stemmer [1]. The Porter stemmer is based on punctuations. vowels and consonant based inflection which is fast in nature. The next step consists of a rule-based module, which is discussed Ample effort has been devoted for the development of in detail in Section 3.2. The rule-based module identifies the stemmers in Indian languages. In Hindi the morphology based inflection and removes it to obtain the stem. A word-embedding stemmer was developed by Larkey et al, they used a lightweight based module, discussed in Section 3.3., is used to validate the stemmer with 27 common suffixes [13]. Similarly, Ramanathan generated stems. If the rule-base is unable to find any match, then and Rao developed a stemmer in Hindi with 65 inflectional the word passes through the suffix stripper module, which is suffixes. In 2008, Pandey and Siddiqui introduced an discussed in Section 3.4. The stripper generated stem is also unsupervised Hindi stemmer, which achieved an accuracy of validated through the word-embedding module. If the validation 89.9% over training data of 106403 words obtained from succeeds, it will generate the stemmed final output else the input EMILLE corpus. The initial approach included word level word will be duplicated without any inflection. segmentation and post-processing of word was concluded using heuristic method. In 2007, a light weight stemmer for Bengali was developed by Islam et al. using a predefined suffix list and stripping the suffixes. In this paper the inflections were shown in Noun, Verb and Adjective parts of speech with addition to usage in spell checker. Further, Sarkar and Bandyopadhyay (2008) developed a Rule-based stemmer in Bengali which achieved ~89% accuracy. The paper discusses about the morphological structure of tokens for different parts of speech and token based inflections which were tested over Rabindranath Tagore short stories. In Dravidian languages, Sunitha and Kalyani developed a rule based Telugu unsupervised stemmer that achieved an accuracy of 84.2% on raw CIIL Telugu corpus, collected from IIIT Hyderabad. In this paper, initially they segmented the words with the help of word segmenter followed by cluster preparation and finally word list was created with proper decomposition. Ramchandran and Krishnamurthi developed an iterative stemmer for Tamil in 2012 with effectiveness of 84.32% over 36720 words. Here, an iterative stemmer will remove a longest matching suffix and headway towards the root. There were various issues encountered during computation such as Homographs, Irregular Verbs, and Proper Noun Derivations etc. In 2010, Patel et al. proposed a hybrid stemmer for Gujarati which achieved an accuracy of 67.86% over EMILLE corpus comprised of 8,525,649 words. The approach was based on Goldsmith’s (2001) take-all-splits method. Primarily this method was unsupervised but the authors included a list of hand crafted Gujarati suffixes for better learning during the training phase. 3 Methodology Figure 1: High Level Workflow of Proposed Stemmer In this section, we discuss the details ofthe proposed system.

A Hybrid Approach to Develop the First Stemmer in Maithili FIRE’19, Kolkata, India

3.2 Rule-based Module 3.2.2 Pronoun However, in pronoun the inflections are almost similar to noun. Dissimilar to English or other Western-European languages, However, there are few major inflections in Relative pronoun. where a character is considered as the basic orthographic unit, Maithili uses syllable. A syllable is commonly a vowel center, Such as, �हनकर [hinkara] < �हनक[hinak] (Stem) + र [ra] which is anticipated before by zero or more consonants and (Inflection), हुनकर [hunkara] < हुनक[hunak] (Stem) + र [ra] pursued by a discretionary diacritic imprint. (Inflection), ओकरा [okarA] < ओकर [okar] (Stem) + आ [aa] In this section, we explain the different morphological impact of (Inflection), [kakarA] < [kakar] (Stem) + [aa] inflections in various Maithili word classes. Like in other north ककरा ककर आ Indian languages such as Hindi and Bengali, the inflections in (Inflection), जकरा [jakarA] < जकर [jakar] (Stem) +आ [aa] Maithili place as a suffix to the stem. The word-formation after (Inflection). infection can be represented as follows. word := stem + inflections 3.2.3 Verb inflections := null | inflection There are large numbers of inflections encountered in Maithili inflections := inflection + inflections verbs. We have grammatically categorised them on the basis of Example: �लखनाय [likhnAy] < �लख [likh] (Stem) + नाय [nAy] Tense, Person, Aspect and Honour. (Inflection). 1. On no change in verb forms: Example: पढ़लक [pa.Dhalaka] < Here, in the above examples, we have used the पढ़ [pa.Dha] (Stem) + लक [laka] (Inflection), देखलक Itrans1transliterator to denote Hindi or Maithili words. The same [dekhalaka] < देख [dekha] (Stem) + लक [laka] (Inflection). translation and transliteration approach is used throughout the paper. 2. On present continuous verb forms:If the suffixes of the word Maithili language also supports multiple inflections of a token. consist of ‘इत’ and proceeding token is ‘अ�छ’, the verb is in Example: लोक�नक [lokanik] < लोक [lok](Stem) + �न [ni] present continuous form. Example: पढइत [pa.Dhaita] अ�छ[achi] = पढ[pa.Dha] (Stem) (Inflection) + क [ka] (Inflection) + इत[ita] (Inflection) अ�छ 3.2.1 Noun 3. On past tense verb forms: There are few inflections which The inflections occurring in Noun for Maithili is smooth occur in past tense of verb. Example: देखलहुँ [dekhalahu] = compared to other POS. We discuss below the major inflections देख [dekha] (Stem) + ल [la] (Inflection] +हँ [hun] (past tense incurred in Maithili nouns. ु inflection). 4. On future tense verb forms: In future tense, first and second 1. On singular form of noun: There are words in singular form, person we use ‘ब’ followed by inflection, whereas for third which ends with ‘क’ [ka] are categorised under singular form person ‘त’ is used. Example: First Person- हम [hum] देखबहँ of noun. Example: देशक [deshka] < देश [desh] (Stem) + क ु [dekhabhu] [ka] (Inflection), नेताक [netAka] < नेता [netA] (Stem) + क देखबहुँ [dekhabuh] < देख [dekha] (Stem) + ब [ba] (first person [ka] (Inflection), �कताबक [kitAbka] < �कताब [kitAb] (Stem) + future inflection) + हँ [hu] (Inflection), क [ka] (Inflection). ु 2. On plural form of noun: In Hindi, to denote multiples, the Second Person- त� [to] देखबह [dekhabah] multiple denoting words like ‘सभी’ are used as a separate देखबह [dekhabah] < देख [dekha] (Stem) + ब [ba] (second word along with the noun. So, the translation of ‘all kids’ is person future inflection) + ह [ha] (Inflection), ‘सभी ब楍चे’ in Hindi. However, in Maithili the plural denoting Third person- बाबजीु [babuji] देखताह [dekhatAh] words like ‘सभ/सब’ are attached with the noun and देखताह [dekhatAh] < देख [dekha] (Stem) + त [ta] (third person represented as a single word. Example: �कताबसभ future inflection) + आह [Ah] (Inflection) [kitAbasabh] < �कताब [kitAba] (Stem) + सभ [sabh] 5. On Honorific verb forms: Honorific inflections in verb (Inflection), लड़कासब [la.Dakasab] < लड़का [la.Daka] (Stem) + generally occur in Maithili language. Example: देखल�न सब [sab] (Inflection) 3. On emphasis: In Maithili, morphologically affected token [dekhalani] < देख [dekha] (Stem) + ल [la] (Inflection) + �न based on emphasis is not commonly found. But, still there [ni] (Honorific suffix), are few instances such as: लड़कासबस [la.DakAsabsa] < लड़का पढ़ल�न [padhlani] < पढ़ [padh] (Stem) + ल [la] (Inflection) + [la.Daka] (Stem) [sab] (Inflection) + [sa] (Inflection सब स �न [ni] (Honorific suffix). representing emphasis)

3.2.4Adjective 1https://www.aczoom.com/itrans/online/

FIRE’19, Kolkata, India Priyadarshi et al.

In Maithili, there are different types of inflections in adjective, 3.4 Suffix Stripper those are following: There has been immense amount of work carried out using 1. On Demonstrative adjectives: different approaches of suffix stripping for Indian languages [4], Example: एहन [ehan] कलम [kalam] – एहन [ehan] < एह [eh] [9], [15]. After studying different approaches of suffix stripping, (Stem) + न [na] (Inflection), ओहन [ohan] कलम [kalam]- we adopted the following approach that has been used in other Indian languages. The suffix stripper for Maithili works as ओहन [ohan] < ओह[oh] (Stem) + न [na] (Inflection). follows: 2. On Interrogative adjectives: 1. The input for suffix stripper is Maithili string and all possible Example: [kehan] [kitab]- [kehan] < [keha] के ह न �कताब के ह न के ह suffix lists. (Stem) + न [na] (Inflection) 2. For removing multiple suffixes, iterative stripping is 3. On Quantifying adjectives - executed. (In Maithili multiple inflections do occur but the Example: कतेक [ktaik] < कते [ktai] (Stem) + क [ka] frequency of occurrence is less than moderate). 3. The stripping of string starts with right to left i.e., it will start (Inflection), जतेक [jatek] < जते [jate] (Stem) + क [ka] from end of the string. (Inflection). 4. The stripping operation will remove a longest matching suffix at a time and it will progress towards the root with the 3.3 Word Embedding Based Validation Module help of suffix list initially provided in order to generate final Word embedding is a technique where tokens or phrases from stem word. the lexicon or vocabulary are mapped to vectors of real numbers. Word2vec, proposed by Mikolov et al. [14], is a computationally The issues encountered although using suffix stripping is under- efficient predictive model for producing relevant word stemming and over-stemming of a string. Example: ‘�कताबक’ and embeddings. It uses a two-layer neural network trained on the vast ‘ ’ is when stemmed based on suffix list both are having suffix corpus to reconstruct linguistic contexts of words. The meaning of डाक a word is learned from its neighbours in the sentence and encoded as ‘क’ and stemmed word will be ‘�कताब’ and ‘डा’. The word ‘डा’ is as a vector of real values. Word2vec training takes a large corpus not a legitimate word in Maithili literature. The issue is overcome of text as input and produces a vector of several hundred by Rule-based technique which uses the POS information of a dimensions for each unique word in the corpus. In this work, we string before stemming. The above suffix rule ‘क’ is only used for train word2vec models using the continuous bag of word model singular noun as discussed. This is the major reason where the (CBOW). We collected the Maithili word embedding resource accuracy depreciates and root word extraction based on suffix list from Wikimedia dump to ensure that all Maithili words are breaks down. included in it. In the proposed system, we use word embedding as a 4 Results and Discussions validation module. This module aims to verify whether a system- This section presents the evaluation results and discusses the generated stem is a valid word or not. The rule-based approach observations we made during the experiments. finds a matching in the suffix part of the word. If a match is found, then the suffix is replaced by an empty string. There is a possibility that the matched suffix is not inflection; it is a part of 4.1 Dataset the root-word. So, the removal of such suffix causes the As this is the first attempt on Maithili stemmer development, we generation of the meaningless stem. We aimed to restrict this did not find any test data using which we can evaluate the using word embedding. The occurrence of the stem generated by proposed system. So, we create our own test data for the the rule-based approach is checked first in the word-embeddings. evaluation. In the test data we tried to accommodate sentences If not found, then the stem is considered invalid, and the system from newswire and literature. Maithili is a resource-poor restores the original word. If found, then also we compute the language, and it is not rich on the web. We found only a few similarity between the vectors of the original word and generated websites where machine-readable Maithili text is available. We stem. If the similarity is higher than a threshold, then only the collected the test data sentences from Maithili e-newspapersमै�थल� stem is considered as a valid stem. 2 3 िजꅍदाबाद1 , �म�थला दै�नक 2 and literature from E-journals Let us take an example of two words ‘�कताबक’ [kitabak] 4 सा�ह配यअकादेमी3 (Sahitya Akademi: Central Institution for Literary and‘कलंक’ [kalank]. Subsequently, when stemming is performed Dialogue).The test data consist of 110 sentences (2000 words) on both the words, we obtain �कताबक [kitabak] <�कताब[kitab] where total number of unique words was 822 (41.1%). A human expert manually processes the test data sentences and transcribes (Stem) + क[ka] (Inflection) and कलंक [kalank] <कलं[kalan] (Stem) + क[ka] (Inflection). The word ‘�कताब’ is legitimate but there is 2http://www.maithilijindabaad.com/ no meaning of word ‘कलं’ in Maithili. The word ‘कलं’ is labelled 3http://www.mithiladainik.in as invalid by the word-embedding module and ignored. 4http://sahitya-akademi.gov.in/sahitya-akademi/index.jsp

A Hybrid Approach to Develop the First Stemmer in Maithili FIRE’19, Kolkata, India the root of the individual words. The detail description provide manual POS labels. So in our experiments also, we did oninformation regarding test dataset with statistics is shown in not use manual POS labelling. Table 1. We have also computed the class-specific accuracies. There we Information Number found that the test data contains 680 nouns, i.e., 34% of the total Inflected Words 1478 words are noun. Among these 575 words are inflected. It also Non-inflected Words 522 consists of 436 verbs, 335 pronouns, 284 adjectives and 265 other POS tags. This ratio is quite high. We study other north-Indian Unique Words 822 languages and found that noun-inflection is comparatively higher Noun 680 in Maithili. This is because of inflection on singular nouns in Pronoun 335 Maithili. Adjective 284 Additionally, we have also calculated the accuracy on unique words from test data. We have a total of 822 unique words which Verb 436 achieved an accuracy of 72.73%. The performance can be further Other POS Tags 265 enhanced by increasing the training data and incorporating Total No. of Words 2000 additional linguistic features. Table 1: Detailed Description of Test Dataset 5 Conclusion 4.2 System Accuracy Here we presented the first stemmer in Maithili. Initially, we As there is no existing system so far, we use a baseline system developed suffix stripping based stemmer where we found issues to compare the performance of the proposed system. We found in like under-stemming and over-stemming of a string the literature that several suffix stripping based stemmers have whichgenerated too many invalid strings. Further, we developed a been developed in various Indian languages. So, the suffix POS dependent rule-based approach, which resulted in a decrease stripper module independently is considered as the baseline in the count of invalid strings. To validate the system generated system in this experiment. The rule-based and word-embedding stems, we incorporated a word-embedding module. The final based modules are not used in the baseline system. The baseline system achieved moderate performance. system achieves an accuracy of 70.9%. The final system accuracy concludes that there are numerous Similarly, we also compute the efficiency of the proposed areas for further improvement of the system. The rule-based rule-based module individually. The rule-based module achieves system alone provides 78.5% accuracy, which implies that the an accuracy of 78.5%. So, the module outperforms the baseline rules are not sufficient. So, there is scope to study the Maithili system by 7.6%. This improvement proves the superiority of the words further and define sophisticated rules. We have primarily POS-based rules over generic suffix stripping in infection studied the noun, pronoun, adjective and verb classes. Other POS handling. When we studied the system generated stems, we classes can also be analyzed to identify more rules. Other areas observed that the suffix-stripping generates more invalid stems can be morphological study of the Maithili words, improvement in than the rule-based module. The final system, with word- suffix stripping approach, and incorporation of other techniques embedding based validation, achieves an accuracy of 84.6% and like clustering-based stemming. performance is shown in Table 1. ACKNOWLEDGMENTS System Accuracy This work is supported by Science and Engineering Research Baseline System 70.9 % Board, India [Grant No.: EEQ/2016/000241]. Rule-based Module 78.5% Final System 84.6% REFERENCES [1] Porter, M. 1980. An algorithm for suffix stripping program. Vol.14, 130-137. Table 2: Performance of Stemmer [2] Savoy, J. 1993. Stemming of French words based on grammatical categories.Journal for the American Society for Information Science. 44(1), 1-9. 4.3 Discussions [3] Braschler and Ripplinger. 2003. Stemming and decompounding for German text retrieval. In European Conference on Information Retrieval, 177-192. [4] Ramanathan and Rao. 2003. A lightweight stemmer for Hindi. In Proc. The rule-based module is dependent on POS information. The in- Workshop of Computational Linguistics for South Asian Languages - house Maithili POS tagger that has an accuracy of 85.88%. Expanding Synergies with Europe, EACL-2003, 42– 48. However, the rules demand more accurate POS tagger. Error in [5] Pandey and Siddiqui. 2008. An unsupervised Hindi stemmer with heuristic improvements. In Proceedings of the second workshop on Analytics for noisy POS label causes problem in applying the appropriate rule. We unstructured text data, ACM, 99-105. hope that an increase in accuracy of POS tagger might result in [6] P. Majumder, M. Mitra, S. Parui, G. Kole, P. Mitra, and K. Datta. 2006. YASS: better performance of the rule-based stemmer. The researchers Yet Another Suffix Stripper. ACM Transactions on Information Systems. 25(4), 18. sometimes used error-free POS labels during system evaluation. [7] Sarkar and Bandhopadhyay. 2008. Design of a rule-based stemmer for natural For instance, Sarkar and Bandyopadhyay [7] used manual POS language text in Bengali. In Proceedings of the IJNLP-08 workshop on NLP for labelling with 100% accuracy in their rule-based stemmer. less privileged languages. 65-72. However, in a real application scenario, it is not feasible to

FIRE’19, Kolkata, India Priyadarshi et al.

[8] Sunitha and Kalyani. 2009. A novel approach to improve rule based Telugu morphological analyzer. In 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC). IEEE.1649-1652. [9] Ramchandran and Krishnamurthi. 2012. An iterative stemmer for Tamil language. In Asian Conference on Intelligent Informataion and Database Systems. Springer, Berlin, Heidelberg 197-205. [10] Patel, Popat and Bhattacharya. 2010. Hybrid Stemmer for Gujarati. Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing (WSSANLP). 51-55 [11] Prajitha, Sreejith and Raj. 2013. LALITHA: A Light Weight Malayalam Stemmer Using Suffix Stripping Method. In 2013 International Conference on Control Communication and Computing (ICCC). IEEE. 244-248. [12] Lovins. 1968. Development of a stemming algorithm. Mech. Translat. and Comp. Linguistics. 11(1-2). 22-31. [13] Larkey, Connell and Abduljaleel. 2003. Hindi CLIR in Thirty Days. ACM Transaction on Asian Language Information Processing. Vol-2, No.-2, 130-142. [14] Mikolov et al. 2013. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR. 1370-3781. [15] Chaupattnaik, Nanda and Mohanty. 2012. A Suffix Stripping Algorithm for Odia Stemmer. International Journal of Computational Linguistics and Natural Language Processing. 1(1), 1-5.

View publication stats