Quick viewing(Text Mode)

A Sandhi Splitter for Malayalam

A Sandhi Splitter for Malayalam

A Splitter for

Devadath V V Litton Kurisinkel Dipti Misra Sharma Vasudeva Varma Technology Research Centre International Institute of Information Technology - , . devadathv.v,litton.jkurisinkel @research.iiit.ac.in, dipti, @iiit.ac.in { } { }

Abstract occur at the point of joining. The presence of Sandhi is abundant in and all Dravidian Sandhi splitting is the primary task for . When compared to other Dravidian computational processing of text in San- languages, the presence of Sandhi is relatively skrit and . In these high in Malayalam. Even a full sentence may languages, words can join together with exist as a single string due to the process of morpho-phonemic changes at the point of Sandhi. For example, Ae\ncnWmm (avanaaraaN) joining. This phenomenon is known as is a sentence in Malayalam which means “Who Sandhi. Sandhi splitter splits the string is he ?”. It is composed of 3 independent words, of conjoined words into individual words. namely Ae°(avan (he)), Bcm(aar(who)) and Accurate execution of sandhi splitting is BWmm (aaN(is)). However, ambiguous splits for crucial for text processing tasks such as a word is very less in Malayalam. are POS tagging, topic modelling and doc- of two types, Internal and External. Internal ument indexing. We have tried differ- Sandhi exists between a root or a stem with a ent approaches to address the challenges suffix or a morpheme. In the example given below, of sandhi splitting in Malayalam, and fi- nally, we have thought of exploiting the ]l(para) + Dè(unnu) = ]lÆè(parayunnu) phonological changes that take place in the words while joining. This resulted in a hy- Here ]l(para) is a root with the mean- brid method which statistically identifies ing “to say” and Dè(unnu) is an inflectional the split points and splits using predefined suffix for marking present tense. They join character level linguistic rules. Currently, together to form ]lÆè(parayunnu), meaning our system gives an accuracy of 91.1% . “say”(PRES). External sandhi is between words. Two or more words join to form a single string of 1 Introduction conjoined words. Malayalam is one among the four main Dravidian tNåw(ceyyuM) + F¹²(enkil) = tNåta¹o² languages and 22 official . It (ceyyumenkil) is spoken in the State of , which is situated in the south west coast of India . This language tNåw(ceyyuM) is a finite verb with the meaning is believed to be originated from , hav- “will do” and F¹o²(enkil) is a connective with ing a strong influence of Sanskrit in its vocabulary. meaning “if”. They join together to form a single Malayalam is an inflectionally rich and agglutina- string tNåta¹o²(ceyyumenkil). tive language like any other Dravidian language. The property of eventually leads to the process of sandhi. For most of the text processing tasks such as POS tagging, topic modelling and document in- Sandhi is the process of joining two words dexing, External Sandhi is a matter of concern. All or characters, where morphophonemic changes156 these tasks require individual words in the text to D S Sharma, R Sangal and J D Pawar. Proc. of the 11th Intl. Conference on Natural Language Processing, pages 156–161, , India. December 2014. c 2014 NLP Association of India (NLPAI) be identified. The identification of words becomes lead to the loss of linguistic meaning. In an other complex when words are joined to form a single work (Das et al., 2012), which is a hybrid - string with morpho-phonemic changes at the point proach for sandhi splitting in Malayalam, employs of joining. More over, the sandhi can happen be- a TnT tagger to tag whether the input string to be tween any linguistic classes like, a and a split or not and splits according to a predefined set verb, or a verb and a connective etc. This leads to of rules. But this particular work does not report misidentification of classes of words by POS tag- any empirical results. In our approach, we identify ger which eventually affects parsing. Sandhi acts precisely the point to be split in the string using as a bottle-neck for all term distribution based ap- statistical methods and then applies predefined set proaches for any NLP and IR task. of character level rules to split the string.

Sandhi splitting is the process of splitting a 3 Our Approach string of conjoined words into a sequence of in- dividual words, where each word in the sequence We have tried various approaches for sandhi split- has the capacity to stand alone as a single word. To ting and finally arrived at a hybrid approach which be precise, Sandhi splitting facilitates the task of decides the split points statistically and then splits individual word identification within such a string the string in to words by applying pre-defined of conjoined words. character level sandhi rules.

Sandhi splitting is a different type of word Sandhi rules for external sandhi, are identi- segmentation problem. Languages like Chinese fied from a corpus of 400 sentences from Malay- (Badino, 2004), Vietnamese (Dinh et al., 2008) alam , which includes text from old liter- Japanese, do not mark the word boundaries ex- ature(texts before 1990) as well as modern litera- plicitly. Highly like Turk- ture. In comparison with text from old , ish(Oflazer et al., 1994) also need word segmenta- modern literature have very less use of sandhi. By tion for text processing. In these languages, words analysing the text, we were able to identify 5 un- are just concatenated without any kind of morpho- ambiguous character level rules which can be used phonemic change at the point of joining, whereas to split the word, once the split point is statistically morpho-phonemic changes occur in Sanskrit and identified. Sandhi rules identified are listed in the Dravidian languages at the point of joining (Mit- Table 1. tal, 2010). R Rule Example (CSC)V = en·oÈ = en·m + CÈ 1 s 2 Related works (CSC)S + V word(Noun)+no(verb) ]SobnWm = u]So + BWm 2 (b/)Vs = V Mittal (2010) adopted a method for Sandhi fear(Noun) + is(verb) ]WaoÈ = ]Ww + CÈ 3 aVs = w +V splitting in Sanskrit using the concept of optimal- money(Noun) + no(verb) ity theory( and Smolensky, 2008), in such (c/d/j/\/W)Vs = ®IjodnWm = ®Ijo² + BWm 4 a way that it generates all possible splits and - (±/²/³/°/®) + V above(Loc)+is(verb) CV = BtWÁm = BWm + FÁm 5 s dates each split using a morph analyser. In another (CS) + V is(verb)+that(quotative) work, statistical methods like Gibbs Sampling and Ae°eè = Ae° + eè 6 just split Dirichlet Process are adopted for Sanskrit Sandhi He(Noun)+Came(verb) splitting (Natarajan and Charniak, 2011). C=, V =, Vs=Vowel symbol, S= Table 1: Sandhi rules To the best of our knowledge, only two related works are reported for the task of Sandhi splitting in Malayalam. Rule based word split- Rules in Table 1 are given in a particular or- ter for Malayalam ( and Peter, 2011), goes in der corresponding to their inherent priority which the direction of identifying morphemes using rules avoids any clash between them. Every character and trie data structure. They adopted a ap- level Sandhi rule is based on a consonant and a proach to split both external and internal sandhi vowel. Rule 5 is the most general rule for Sandhi which includes compound words. But splitting a that we could identify in Malayalam which states compound word which is conceptually united will157 that a group of a consonant and a vowel can be split into a consonant and a vowel. Rule 1 is a As per our observation, the phonological changes special case of Rule 5, which is framed in order of to x0 and of y to y0 are independent of each to treat the special case of compound of conso- other. So, nants with a schwa in between. This rule is be- P (S x0 , y0 ) = P (S x0 ) P (S y0 ) (2) ing introduced , because the specified p| p| ∗ p| in rules 2, 3 and 4 can come as the last con- 0 0 sonant in the case of a compound of consonants. To produce the values of x and y for each char- This will lead to an improper split of the string acter point, we take character points backwards by any of the rules 2, 3 or 4. So every word, and k character points forward from that particular with an identified split point should be primarily point. For example, the probability of the marked checked against rule 1 to avoid wrong split by point P ointX in the string given in the Figure 1 other rules in the case of a sandhi containing a below to be a split point is given by equation (3) compound of consonants. Rule 2 is to handle the process of phonemic insertion came after the pro- cess of Sandhi. In Malayalam, these extra char- acters b/e can be inserted between words after sandhi in certain context. Rule 3 and Rule 4 are to handle the case of phonemic variation caused due to the process of Sandhi. Rule 3 enforces that, the letter ‘a’ with a vowel symbol becomes ‘w’ and a vowel, while Rule 4 enforces that, the charac- ters, ‘c/d/j/\/W’ with a vowel symbol becomes Figure 1: Split point chillu1,‘±/²/³/°/®’ and a vowel. The hybrid approach utilises the phonological P (S a a ...a ) P (S b b ...b ) (3) changes (Klein et al., 2003) due to the presence of p| 1 2 k ∗ p| 1 2 k sandhi. The remaining part of the paper explains Where a a ...a is x0 and b b ...b is y0 . The this approach in detail, with theoretical formula- 1 2 k 1 2 k value of k is experimentally optimized. tion, Experimental Results and Error analysis. The split points of different agglutinated strings When the words are conjoined, they undergo in training data are annotated in the following phonological changes at the point of joining. format. These phonological changes can be evidential in identifying the split point in the given string of W ordN = i1, i2, i3, ...iz conjoined words. For example, words W x and yZ W x0 y0 Z are conjoined to form a new string . As This indicates that the string W ordN needs to a part of this process, substring x in the original be splitted at “z” character indexes i1, i2, i3, ...iz W x string has undergone a within the string. to become x0 and the substring y in the original string yZ has undergone a phonological change to We employ two Orthographic Tries (Whitelaw 0 become y . We try to identify the split point be- and Patrick, 2003) to statistically capture the 0 0 0 0 0 0 tween x and y in W x y Z using x and y as the phonological differences in x0 and y0 for split evidence. Going forward, we use ”Sp” to denote points and non-split points. A mould of ortho- Split Point and ”Nsp” to denote Non Split Point. graphic tries used is given in Figure 2. The trie P (S x0 , y0 ) will give the probability of a char- p| in Figure 2 is trained with character sequences acter point between x0 and y0 within a conjoined c1c2c3, c1c2c4, c1c3c4, c1c3c5. In the first trie, string to be a split point. A character point will be the path from root to k nodes represents the string classified as split point if, a a ...a . Each node (1 i k) in the path 1 2 k ≤ ≤ stores the of occurrences of split points P (S x0 , y0 ) > P (N x0 , y0 ) (1) p| sp| and non-split points in the entire training data 1 which are preceded by aiai 1...a1. In the sec- A chillu is a pure consonant which can stand alone inde- − pendently without the help of 158 ond trie, a path from root to k nodes represent within the word are annotated with the within- word index of the corresponding character point. The test data contains 1000 random words, out of which 260 are words with sandhi.

Results In our experiments, we tried evaluating the accu- racy of split point identification, split accuracy of different sandhi rules and overall accuracy of the system. By overall accuracy, we mean percentage of the words with sandhi in which the split is ex- actly as expected. We have conducted experiments on split point identification with different values of k and initial Skip. Figure 2: Word trie k-S P R F Accuracy the string b b ...b . Each node i(1 i k) in 3-1 85.37 75 79.85 89.6 1 2 k ≤ ≤ the path stores the number of occurrences of split- 3-2 84.73 73.26 78.58 88.7 points and non-split points which are succeeded 4-1 86.56 76.04 80.96 90.1 4-2 85.48 73.61 79.10 89 by b1b2...bi. The frequencies stored in these trie nodes are used to calculate P (C a a ...a ) and 5-1 88.37 79.16 83.51 90.9 | 1 2 k P (C b1b2...bk) where C is Sp or Nsp. 5-2 85.94 74.30 79.30 88.7 | 6-1 88.50 80.20 84.15 91.1 As split points are rare within a string 6-2 85.59 76.38 80.88 89.2 P (C a a ...a ) and P (C b b ...b ) needs to be | 1 2 k | 1 2 k smoothed. For this purpose, we use the infor- Table 2: Results mation stored along the depth of the tries for the Here P implies precision, R implies Recall, F strings a1a2...ak and b1b2...bk as follows implies F-measure and k-S implies k and Initial skip respectively. k=6 and initial Skip=1 have k shown the better result. As per our observation, P (C a1a2...ak) = P (C aiai+1...ak) (4) | | phonological changes as a part of sandhi would not i=initial skip X happen beyond a range of six characters in each of the participating words. So the upper bound for k

k value is taken as 6.

P (C b1b2...bk) = P (C bibi+1...bk) (5) | | i=initialX skip 5 Error Analysis

Here the initial-skip decides optimum ’smooth- In Split point identification, most of the incorrectly ing range’ within the string.The value of initial- identified split points are character points between skip decides the threshold phonological similarity a word and inflectional suffix attached to it. As that needs to be considered while smoothing. The the system evolve, this error can be rectified by experimental optimisation of initial-skip is done. the use of a post-processor which maintains the The identified split points are splitted using the ap- finite list of inflectional suffixes in the language. plicable sandhi rules. Wrong splits in the middle of an actual word are very few in number and will reduce as the size of 4 Data and Results the training data increases.

Data Set Most of the unidentified split points are due to the presence of certain rare patterns. These errors We created a dataset which contained 2000 Sandhi will be reduced with the incorporation of words bound words for training. Each of the split point159 from diverse texts in the training data. When it comes to rules, we have used only char- Software acter level rules for splitting the identified split points. Rule 1,4 and 5 ambiguous at certain The source code of our Hybrid Sandhi rare contexts. To resolve this, certain word level splitter for Malayalam is available at information like POS tags are required. But for https://github.com/Devadath/ an accurate POS tagging, particularly for this dis- Malayalam_Sandhi_Splitter ambiguation purpose, a sandhi splitter with a good level of accuracy is inevitable. Our Sandhi splitter can contribute for a better POS tagger. Vice versa, References a POS tagger can complement the Sandhi splitter. [Badino2004] Leonardo Badino. 2004. Chinese Rule Accuracy text word-segmentation considering semantic links among sentences. In INTERSPEECH. 1 92.10 2 96.29 [Das et al.2012] Divya Das, Radhika K T, Rajeev R R, 3 100 and Raghu Raj. 2012. Hybrid sandhi-splitter for malayalam using . In In proceedings of Na- 4 80.64 tional Seminar on Relevance of Malayalam in Infor- 5 87.71 mation Technology.

Table 3: Rule accuracy when k=6 and S= 1 [Dinh et al.2008] Quang Thang Dinh, Hong Phuong Le, Thi Minh Huyen Nguyen, Cam Tu Nguyen, Math- ias Rossignol, Xuan Luong Vu, et al. 2008. Word segmentation of vietnamese texts: a comparison of 6 Conclusion & Future work approaches. In 6th international conference on Lan- guage Resources and Evaluation-LREC 2008.

This experiment has given us a new insight that, [Klein et al.2003] Dan Klein, Smarr, Huy there exists a character level pattern in the text Nguyen, and D Manning. 2003. where the sandhi can be splitted. When com- Named entity recognition with character-level mod- els. In Proceedings of the seventh conference on pared to a completely rule based approach which Natural language learning at HLT-NAACL 2003- is effort intensive, a hybrid approach gives us a Volume 4, pages 180–183. Association for Compu- more fast and accurate performance with relatively tational . lesser amount of training data. We expect that, [Mittal2010] Vipul Mittal. 2010. Automatic sanskrit this method can be successfully implemented in segmentizer using finite state transducers. In Pro- all other Dravidian languages for the Sandhi split- ceedings of the ACL 2010 Student Research Work- ting. The only language dependent part will be to shop, pages 85–90. Association for Computational split the identified split points using language spe- Linguistics. cific character level rules. The system component [Nair and Peter2011] Latha R Nair and S David Peter. to split the identified split points can be decided 2011. Development of a rule based learning sys- at runtime based on the language on which it is tem for splitting compound words in malayalam lan- operating on. guage. In Recent Advances in Intelligent Computa- tional Systems (RAICS), 2011 IEEE, pages 751–755. IEEE. There is a possibility of improving the cur- rent system into a language independent and fully [Natarajan and Charniak2011] Abhiram Natarajan and Eugene Charniak. 2011. S3-statistical sam. statistical system. Instead of totally depending splitting. on language specific Sandhi rules for splitting, phonological changes after split can be inferred [Oflazer et al.1994] Kemal Oflazer, Elvan Goc¸men,¨ El- from training data. For Example, van Gocmen, and Cem Bozsahin. 1994. An outline of turkish . W x0 y0 Z W x + yZ ⇒ [Prince and Smolensky2008] Alan Prince and Paul Smolensky. 2008. Optimality Theory: Constraint P (x x0 ) and P (y y0 ) can be tabulated to predict interaction in generative . John Wiley & | | the phonological changes after split. But an accu- Sons. rate tabulation demands a large amount of anno- [Whitelaw and Patrick2003] Casey Whitelaw and Jon tated training data. 160 Patrick. 2003. Named entity recognition using a character-based probabilistic approach. In Proceed- ings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 196– 199. Association for Computational Linguistics.

161