Automatic Conversion of Dialectal Tamil Text to Standard Written Tamil Text using FSTs

Marimuthu K Sobha Lalitha Devi AU-KBC Research Centre, AU-KBC Research Centre, MIT Campus of Anna University, MIT Campus of Anna University, Chrompet, , . Chrompet, Chennai, India. [email protected] [email protected]

oriented activities such as blogging, social media Abstract chats, and discussions in online forums. Given the unmediated nature of these services, users We present an efficient method to auto- conveniently share the contents in their native matically transform spoken language text languages in a more natural and informal way. to standard written language text for var- This has resulted in bringing together the con- ious dialects of Tamil. Our work is novel tents of various languages. More often these con- in that it explicitly addresses the problem tents are informal, colloquial, and dialectal in and need for processing dialectal and nature. The dialect is defined as a variety of a spoken language Tamil. Written language language that is distinguished from other varie- equivalents for dialectal and spoken lan- ties of the same language by features of phonol- guage forms are obtained using Finite ogy, grammar, and vocabulary and by its use by State Transducers (FSTs) where spoken a group of speakers who are set off from others language suffixes are replaced with ap- geographically or socially. The dialectal varia- propriate written language suffixes. Ag- tion refers to changes in a language due to vari- glutination and compounding in the re- ous influences such as geographic, social, educa- sultant text is handled using Conditional tional, individual and group factors. The dialects Random Fields (CRFs) based word vary primarily based on geographical locations. boundary identifier. The essential Sandhi They also vary based on social class, caste, corrections are carried out using a heuris- community, gender, etc. which differ phonologi- tic Sandhi Corrector which normalizes cally, morphologically, and syntactically (Ha- the segmented words to simpler sensible bash and Rambow, 2006). Here we study spoken words. During experimental evaluations and dialectal and aim to auto- dialectal spoken to written transformer matically transform them to standard written lan- (DSWT) achieved an encouraging accu- guage. racy of over 85% in transformation task Tamil language has more than 70 million and also improved the translation quality speakers worldwide and is spoken mainly in of Tamil-English machine translation southern India, Sri Lanka, Singapore, and Ma- system by 40%. It must be noted that laysia. It has 15 known dialects 1 which vary there is no published computational work mainly based on geographic location and reli- on processing Tamil dialects. Ours is the gious community of the people. The dialects first attempt to study various dialects of used in southern are different from Tamil in a computational point of view. the dialects prevalent in western and other parts Thus, the nature of the work reported of Tamil Nadu. Sri Lankan Tamil is relatively here is pioneering. conservative and still retains the older features of Tamil2. So its dialect differs considerably from 1 Introduction the dialects spoken elsewhere. Tamil dialect is also dependent on religious community. The var- With the advent of Web 2.0 applications, the fo- cus of communication through the Internet has shifted from publisher oriented activities to user 1http://en.wikipedia.org/wiki/Category:Tamil_dialects 2 www.lmp.ucla.edu

37 Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM, pages 37–45, Baltimore, Maryland USA, June 27 2014. c 2014 Association for Computational Linguistics iation of dialects based on caste is studied and representation for the transformation of Arabic described by A.K. Ramanujan (1968) where he dialects to Modern Standard Arabic (MSA) and observed that Tamil speak a very dis- performed morphological analysis. In the case of tinct form of Tamil known as Tamil Tamil language, Umamaheswari et al. (2011) (BT) which varies greatly from the dialects used proposed a technique based on pattern mapping in other religious communities. While perform- and spelling variation rules for transforming col- ing a preliminary corpus study on Tamil dialects, loquial words to written-language words. The we found that textual contents in personal blogs, reported work considered only a handful of rules social media sites, chat forums, and comments, for the most common spoken forms. So this ap- comprise mostly dialectal and spoken language proach will fail when dialectal variants of words words similar to what one can hear and use in are encountered because it is more likely that the day-to-day communication. This practice is spelling variation rules of the spoken language common because the authors intend to establish a vary from the rules of dialectal usages. This limi- comfortable communication and enhance intima- tation hinders the possibility of the system to cy with their audiences. This activity produces generalize. Alternatively, performing a simple informal, colloquial and dialectal textual data. list based mapping between spoken and written These dialectal and spoken language usages will form words is also inefficient and unattainable. not conform to the standard spellings of Literary Spoken language words exhibit fairly regular Tamil (LT). This causes problems in many text pattern of suffixations and inflections within a based Natural Language Processing (NLP) sys- given paradigm (Schiffman, 1999). So we pro- tems as they generally work on the assumption pose a novel method based on Finite State that the input is in standard written language. To Transducers for effectively transforming dialec- overcome this problem, these dialectal and spo- tal and spoken Tamil to standard written Tamil. ken language forms need to be converted to We make use of the regularity of suffixations and Standard Written language Text (SWT) before model them as FSTs. These FSTs are used to doing any computational work with them. perform transformation which produces words in Computational processing of dialectal and standard literary Tamil. spoken language Tamil is challenging since the Our experimental results show that DSWT language has motley of dialects and the usage in achieves high precision and recall values. In ad- one dialect varies from other dialects from very dition, it improves the translation quality of ma- minimal to greater extents. It is also very likely chine translation systems when unknown words that multiple spoken-forms of a given word with- occur mainly due to colloquialism. This im- in a dialect which we call as „variants‟ may cor- provement gradually increases as the unknown respond to single canonical written-form word word rate increases due to colloquial and dialec- and a spoken-form word may map to more than tal nature of words. one canonical written-form. These situations ex- Broadly, DSWT can be used in a variety of ist in all Tamil dialects. In addition, it is very NLP applications such as Morphological Analy- likely to encounter conflicts with the spoken and sis, Rule-based and Statistical Machine Transla- written-forms of one dialect with other dialects tion (SMT), Information Retrieval (IR), Named- and vice versa. Most importantly, the dialects are Entity Recognition (NER), and Text-To-Speech used mainly in spoken communication and when (TTS). In general, it can be used in any NLP sys- they are written by users, they do not conform to tem where there is a need to retrieve written lan- standard spoken-form spellings and sometimes guage words from dialectal and spoken language inconsistent spellings are used even for a single Tamil words. written-form of a word. In other words Schiff- The paper is further organized as follows: In man (1988) noted that every usage of a given section 2, the challenges in processing Tamil di- spoken-form can be considered as Standard Spo- alects are explained. Section 3 explains the cor- ken Tamil (SST) unless it has wrong spellings to pus collection and study. Section 4 explains the become nonsensical. peculiarities seen in spoken and dialectal Tamil. Few researchers have attempted to transform Section 5 introduces the system architecture of the dialects and spoken-forms of languages to DSWT. Section 6 describes conducted Experi- standard written languages. Habash and Rambow mental evaluations and the results. Section 7 dis- (2006) developed MAGEAD, a morphological cusses about the results and the paper concludes analyzer and generator for Arabic dialects where with a conclusion section. the authors made use of root+pattern+features

38 2 Challenges in Processing Tamil Di- In the case of one-to-many mapping, multiple alects written language words will be obtained. Choos- ing a correct written language word over other Tamil, a member of Dravidian language family, words is dependent on the context where the di- is highly inflectional and agglutinative in nature. alectal spoken language word occurs. In some The phenomenon of agglutination becomes much cases, the sentence may be terminated by punc- pronounced in dialects and spoken-form com- tuations such as question marks which can be munication where much of the phonemes of suf- made use of to select an appropriate written lan- fixes get truncated and form agglutinated words guage word. To achieve correct selection of a which usually have two or more simpler words in word, an extensive study has to be conducted and them. A comprehensive study on the Grammar is not the focus of this paper. In the current work of Spoken Tamil for various syntactic categories we are interested in obtaining as many possible is presented in Schiffman (1979) and Schiffman mappings as possible. Many-to-one mapping (1999). Various dialects are generally used in occurs mainly due to dialectal and spelling varia- spoken discourse and while writing them people tions of spoken-forms whereas one-to-many use inconsistent spellings for a given spoken lan- mapping happens because a single spoken-form guage word. The spelling usages primarily de- may convey different meanings in different con- pend on educational qualification of the authors. texts. Dialectal spoken forms of many-to-one and Sometimes, the authors intentionally use certain one-to-many mappings are more prevalent than types of spelling to express satire and humor. one-to-one mapping where a dialectal spoken Due to this spelling and dialectal variation form maps to exactly one written form word. many-to-one mapping happens where all the va- riants correspond to single canonical written 3 Data Collection and Corpus Study form. This is illustrated with the dialectal and spelling variants of the verb “paarkkiReen” (see) The dialectal spoken form of a language is pri- in Fig 1. marily used for colloquial and informal commu- nication among native speakers. They are also commonly seen in personal blogs, social media chats and comments, discussion forums etc. Giv- en this informal nature of the language usage, such a variety is not used in formal print and broadcasting media as they mainly use standard literary Tamil. In our preliminary study, we found that textual contents in personal blogs, tweets, and chats have significantly large number of dialectal and Figure 1. many-to-one mapping spoken language words than those are found in other standard online resources such as news For the words that belong to the above case, publishers, entertainment media websites etc. there is no hard rule that a particular pattern of Since we focus on processing various Tamil spelling will be used and referred to while the dialects and their spoken language variants, we text is written by people. In addition to this map- have collected publicly available data from the ping, one-to-many mapping is also possible above mentioned online resources for this work. where a single spoken form maps to multiple The collected data belongs to authors from canonical written forms. various geographic locations where different Tamil dialects exist. The textual contents in the engee selected resources mainly contain movie reviews, (where) narratives, travel experiences, fables, poems, and enga sometimes an informal discourse, all in a casual engaL and colloquial manner. Further, we were able to (ours) collect variants of spoken forms which vary with respect to person, social status, location, com- munity, gender, age, qualification etc. Figure 2. one-to-many mapping

39 Though Tamil language has 15 dialects, in this work, we focused only on 5 dialects namely, Spoken form Written form , Tamil, Tirunelve- Variants Equivalent li Tamil, , Kongu Tamil and [Noun/Pronoun/Verb] [Noun/Pronoun/Verb] + common spoken language forms. In Table 1, we + “nu” “enRu” present the corpus distribution with respect to the [Noun/Pronoun/Verb] [Noun/Pronoun/Verb] + dialects and the number of dialectal and spoken + “nnu” “enRu” [Noun/Pronoun/Verb] [Noun/Pronoun/Verb] + language words. + “unu” “enRu” [Noun/Pronoun/Verb] [Noun/Pronoun/Verb] + Name of the Tamil No. of Dialectal + “unnu” “enRu” Dialect words Central Tamil dialect 584 Table 2. Spoken variants and written language 864 Tamil 2074 The dialectal variants of the verb “vanthaarkaL” Brahmin Tamil 2286 (they came) is illustrated in table 3. Kongu Tamil 910 Common Spoken Forms 5810 Dialectal variants Written form Equivalent Table 1. Corpus distribution among dialects [Verb] + “aaka” [Verb] + “aarkaL” [Verb] + “aangka” [Verb] + “aarkaL” We performed an in-depth study on the collected Table 3. Dialectal variants & written language data and found some peculiarities which exist in some dialects. Some of the observed peculiarities It can be observed from Table 3 that the di- are described in Section 4. alectal suffixes vary from each other but they all map to same written form suffix. Despite the di- 4 Tamil Dialects and their Peculiarities alectal variation, they all convey the same mean- Some dialectal words have totally different ing. But they vary syntactically. The “aaka” suf- meaning in SST and in other dialects or in stan- fix functions as adverbial marker in standard lite- dard literary Tamil. For instance, consider the rary Tamil whereas it acts as person, number, following dialectal sentence () gender (PNG) marker in Madurai Tamil dialect. 5 System Architecture ela, inga vaala. Hey here come In this section we describe our system architec- „Hey come here!‟ ture which is depicted in Figure 3. Our dialectal spoken to written transformer (DSWT) has three The words “ela” and “vaala” convey different main components namely, Transformation En- meanings in different contexts and dialects. In gine, CRF word boundary identifier and heuristic SST they denote “leaf” and “tail” respectively Sandhi corrector. while in Tirunelveli Tamil dialect they convey Transformation Engine contains FSTs the meaning “hey” and “come” respectively. for the dialectal and spoken language Though these ambiguities are resolved when to standard written language transfor- the context is considered, they make the trans- mation. The resultant words may be formation task challenging since this is a word- agglutinated and is decomposed with level task and no context information is taken the help of CRF boundary identifier. into account during transformation. CRF Word Boundary Identifier mod- The example in table 2, illustrates spelling ule identifies the word boundaries in based variants where the variants map to single agglutinated words and splits them in- canonical written form. We observed that the to a set of constituent simpler words. most common form of spoken-language usage is Heuristic Sandhi Corrector module the use and representation of “enRu” (ADV) as makes necessary spelling changes to four variants which are shown in Table 2. the segmented constituent words and standardizes them to canonical and meaningful simpler words.

40

Figure 3. System Architecture

5.1 Transformation Engine the number of unique suffixation is few when compared to the number of root words. This will The function of Transformation engine is to make the suffix matching faster and hence transform dialectal and spoken language words achieves quick transformation. This makes FSTs into standardized literary Tamil words, similar to as an efficient tool for dialectal or variation mod- the official form of Tamil that is used in gov- eling. ernment publications such as official memoran- dums, news and print media, and formal political Algorithm for Transformation speeches. The algorithm that is used to transform dialectal Modeling FSTs for Transformation and spoken language text is given below.

Given the regular pattern of inflections within a 1: for each dialectal/spoken-language word paradigm, we use paradigm based approach for 2: check possible suffixations in FST the variation modeling. Specifically, the dialectal 3: for each suffixation usages, spoken language forms and their variants 4: if FST accepts & generates written are modeled as “root+spoken-language-suffix” language equivalents for all suffixes where it will get transformed into “root+written- 5: return (root + written-language-suffix) language-suffix” after transformation. We had 3 6: else used AT&T's FSM library for generating FSTs. 7: return dialectal/spoken-language-word The FST shown in Fig. 4 shows the state transi- 8: for each agglutinated & compound word tions for some spoken language words. 9: do CRF word boundary identification 10: for each constituent word (CW) 11: do Sandhi Correction 12: return simple constituent words

5.2 Decomposition of Agglutinated and Compound Words using CRF Since Tamil is a morphologically rich language, the phenomenon of agglutination and compound- ing in standard written language Tamil is high

and very common. It is also present in dialectal Figure 4. Sample FST and spoken language Tamil. This poses a number

of challenges to the development of NLP sys- It can be observed from Figure 4 that spoken tems. To solve these challenges, we segment the and dialectal words are processed in right to left agglutinated and compound words into simpler fashion. This way of processing is adopted since constituent words. This decomposition is achieved using two components namely 3 http://www2.research.att.com/~fsmtools/fsm/

41

Agglutinated word or Boundary Identification Sandhi Correction Functions Compound Word and Word Segmentation No Change Insertion Deletion Substitution nampuvathillaiyenRu nampuvath nampuvathu (will not be believing) illai illai yenRu enRu muththokuppukaLutaya muth muu (comprising of three thokuppukaL thokuppukaL volumes) utaya utaya Table 4. Boundary identification and Sandhi Correction Table 4 clearly manifests the boundary of a constituent word within a compound or an agglutinated word which may contain one or more word-boundaries. It is observed that for “n” constituent words in a compound or an agglutinated word, there exists exactly (n-1) shared word-boundaries where (n>0).

CRF word boundary identifier and Heuristic The absence of word-boundary ambiguity in Sandhi Corrector. We have developed the word Tamil language favors the boundary identifica- boundary identifier for boundary identification tion task and predominantly eliminates the need and segmentation as described in Marimuthu et for providing further knowledge to CRFs such as al. (2013) and heuristic rule based Sandhi correc- multiple lexicons as in the case of Chinese word tor for making spelling changes to the segmented segmentation. Hence we have used word level words. features alone for training the CRFs.

CRF Word-Boundary Identifier Sandhi Correction using Word-level Contex- CRF based word-boundary identifier marks the tual Rules boundaries of simpler constituent words in ag- Word-level contextual rules are the spelling glutinated and compound words and segments rules in which each constituent word of an agglu- them. CRFs are a discriminative probabilistic tinated or compound word is dependent either on framework for labeling and segmenting sequen- the previous or the next or both constituent tial data. They are undirected graphical models words to give a correct meaning. trained to maximize a conditional probability After boundary identification, suppose an ag- (Lafferty et al., 2001). glutinated or a compound word is split into three Generally word-boundary identification is stu- constituent words, Sandhi correction for the first died extensively for languages such as Chinese constituent word is dependent only on the second and Japanese but the necessity for Indian lan- constituent word while the second word's Sandhi guages was not considered until recently. Al- correction depends on both first and third consti- though there is no standard definition of word- tuent word whereas the third constituent word's boundary in Chinese, Peng et al. (2004) describe Sandhi correction depends on second constituent a robust approach for Chinese word segmenta- word alone. tion using linear-chain CRFs where the flexibili- Sandhi correction is performed using these ty of CRFs to support arbitrary overlapping fea- rules to make necessary spelling changes to the tures with long-range dependencies and multiple boundary-segmented words in order to normalize levels of granularity are utilized by integrating them to sensible simpler words. It is accom- the rich domain knowledge in the form of mul- plished using three tasks namely insertion, dele- tiple lexicons of characters and words into the tion, and substitution as described in Marimuthu framework for accurate word segmentation. et al. (2013). In case of Japanese, though the word bounda- For instance, after boundary identification the ries are not clear, Kudo et al. (2004) used CRFs word “nampuvathillaiyenRu” (will not be believ- for Japanese morphological analysis where they ing) will be boundary marked and Sandhi cor- show how CRFs can be applied to situations rected as shown in the Table 4 above. where word-boundary ambiguity exists. Marimuthu et al. (2013) worked on word Advantages of Word boundary Identification boundary identification and segmentation in Ta- Morphological Analysis of simpler words is mil where they model the boundary identification much easier than analyzing agglutinated and as a sequence labeling task [i.e. a tagging task]. compound words.

42

Tamil Dialects No. of dialectal words Precision (%) Recall (%) F-Measure (%) Central Tamil dialect 584 88.0 89.3 88.6 Madurai Tamil 864 85.2 87.5 85.3 Tirunelveli Tamil 2074 83.4 88.6 85.9 Brahmin Tamil 2286 87.3 89.5 88.4 Kongu Tamil 910 89.1 90.4 89.7 Common Spoken Forms 5810 86.0 88.3 87.1 Table 5. Direct Evaluation Results

So the word-boundary identifier eases the task of Precision is then calculated as: A/(D-C) morphological analyzer in identifying the indi- Recall is calculated as: (A+B)/D vidual morphemes. In addition, it nullifies the F-Measure is the harmonic mean of Precision unknown words category if it occurs due to ag- and Recall. glutination and compounding. As a result, it im- The obtained results for the considered 5 Tamil proves the recall of the morphological analyzer dialects and common spoken language forms are and any advanced NLP system. For example, summarized in Table 5 above. with Tamil, SMT models usually perform better when the compound words are broken into their 6.2 Indirect Evaluation components. This 'segmentation' gives the word For indirect evaluation, we had used DSWT with alignment greater resolution when matching the Google Translate (GT) to measure the influence groupings between the two languages. of DSWT in Tamil-English machine translation, and evaluated the improvement. 6 Experimental Evaluation Our test data had 100 Tamil sentences which Here we perform evaluation of the performance are of dialectal and colloquial in nature. At first, of DSWT with test corpus of 12528 words. We we used GT to translate these sentences to Eng- perform two types of evaluations: direct and indi- lish. This is Output1. Then we used our DSWT rect evaluation. to transform the dialectal sentences into standard In direct evaluation, we evaluate the system written Tamil. After this, the standard sentences using gold standard. In indirect evaluation the were translated to English using GT. This cor- system is evaluated using machine translation responds to Output2. application. The aim in indirect evaluation is to We then performed subjective evaluations of understand the effect of dialectal and spoken lan- Output1 and Output2 with the help of three na- guage transformation in machine translation. tive Tamil speakers whose second language is English. The three evaluation scores for each 6.1 Direct Evaluation sentence in Output1 and Output2 are averaged. We evaluate DSWT performance using the The obtained scores are shown in Table 6. standard evaluation metrics: Precision, Recall, and F-measure. Precision and Recall values are Subjective Evaluation Subjective Evaluation calculated separately for each dialect using a Scores before dialectal Scores after dialectal gold standard. They are calculated using the cas- Transformation Transformation No. of Achieved No. of Achieved es described below: sentences Scores Sentences Scores 20 0 4 0 A: The dialectal or spoken language transforma- 70 1 14 1 tion yields one or many correct standard written 8 2 28 2 language words. 2 3 30 3 B: The dialectal or spoken language transforma- 0 4 24 4 tion yields at least one correct standard written language word. Table 6. Subjective evaluation results C: The dialectal or spoken language transforma- tion yields no output. We used a scoring scale of 0-4 where D: Number of dialectal or spoken language 0  no translation happened. words given as input.

43 Before performing Dialectal Transformation Task After performing Dialectal Transformation Task Dialectal Spoken Tamil Google Translate results Standardized Written Tamil Google Translate results . vanturu otane. (✘) . Come immediately. (✔) (otanee vanthuru) (utanee vanthuvitu) . vanturula otane. (✘) . Come immediately. (✔) (otanee vanthurula) (utanee vanthuvitu) . she had come. (?) . They came. (✔) (avanga vanthaanga) (avarkaL vanthaarkaL) . avuka to come. (✘) . They came. (✔) (avuka vanthaaka) (avarkaL vanthaarkaL) Table 7. Tamil-English Google Translate results before and after dialectal text transformation Sentences marked as (✘) are incorrectly translated into English and those that are marked as (?) may be partially correct. The sentences that are marked as (✔) are the correct English translations.

1  lexical translation of few words happen and Tirunelveli Tamil dialect and this resulted in and no meaning can be inferred from the low accuracy of transformation. translation output. While performing transformation, the possible 2  complete lexical translations happen and causes for ending up with unknown words may some meaning can be inferred from the be due to the absence of suffix patterns in FSTs, translation output. errors in input words, uncommonly transliterated 3  meaning can be inferred from translation words, and English acronyms. The standard writ- output but contains some grammatical ten language words convey a particular meaning errors. in standard literary Tamil and completely differ- 4  complete meaning is understandable with ent meaning in dialectal usages. For instance, very minor errors. consider the verb “vanthaaka”. In standard lite- rary Tamil, this is used in the imperative sense It can be observed from the results in Table 6 “should come” while in Tirunelveli Tamil dialect that GT failed to translate dialectal and spoken it is used in the sense “somebody came”. language sentences. But the failure got mitigated after transformation causing dramatic improve- 8 Conclusion and Future Work ment in translation quality. The following Table We have presented a dialectal and spoken lan- illustrates few examples where the translation guage to standard written language transformer quality has improved after transforming dialectal for Tamil language and evaluated its perfor- spoken language. mance directly using standard evaluation metrics It must be noted from Table 7 that after the and indirectly using Google Translate for Tamil transformation of dialectal spoken language, all to English machine translation. The achieved the sentences were able to achieve their English results are encouraging. equivalents during machine translation. This There is no readily available corpus for suggests that almost all word categories in Tamil processing dialectal and spoken Tamil texts and can achieve improved translations if the words we have collected the dialectal and spoken lan- are given as standard simple written language guage corpus for developmental and evaluation words. This experiment emphasizes the impor- tasks. This corpus can be made use of for devel- tance of feeding the machine translation systems oping other NLP applications. with standard written language text to achieve In case of one-to-many mapping, multiple quality translations and better results. written language forms will be emitted as out- 7 Results and Discussion puts. Hence, determining which written-form of word to be adopted over other resultant written- We observe that the achieved accuracy is higher forms has to be done based on the meaning of the for Kongu Tamil dialect when compared to other whole sentence in which the spoken-language dialects. This is because words in this dialect are word occurs. This will be the focus of our future rarely polysemous in nature. But the number of direction of the work. polysemous words is high in the case of Madurai

44 References ference on Natural Language Processing, Macmil- lan Publishers, India. A. K. Ramanujan. 1968. Spoken and Written Tamil, the verb. University of Chicago. Pages 74.

Fuchun Peng, Fangfang Feng and Andrew McCallum. 2004.Chinese Segmentation and New Word Detec- tion using Conditional Random Fields, Computer Science Department Faculty Publication Series. Paper 92. University of Massachusetts – Amherst. Harold F. Schiffman. 1979. A Grammar of Spoken Tamil, Christian Literature Society, Madras, India. Pp. i-viii, 1-104. Harold F. Schiffman. 1988. Standardization or res- tandardization: The case for “Standard” Spoken Tamil, Language in Society, Cambridge University Press, United States of America. Pages 359-385. Harold F. Schiffman. 1999. A Reference Grammar of Spoken Tamil, Cambridge University Press, Pp. i- xxii,1-232. J. Lafferty, A. McCallum, and F. Pereira. 2001. Con- ditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Pro- ceedings of the 18th International Conference on Machine Learning, pages 282–289. Marimuthu K., Amudha K., Bakiyavathi T. and Sobha Lalitha Devi. 2013. Word Boundary Identifier as a Catalyzer and Performance Booster for Tamil Morphological Analyzer, in proceedings of 6th Language and Technology Conference, Human Language Technologies as a challenge for Com- puter Science and Linguistics, Poznan, Poland. Milton Singer and Bernard S. Cohn. 2007. The Struc- ture of Variation: A Study in Caste Dialects, Struc- ture and Change in Indian Society, University of Chicago, Chapter 19, pages 461-470. Nizar Habash and Owen Rambow. 2006. MAGEAD: A Morphological Analyzer and Generator for the Arabic Dialects, In proceedings of the 21st Interna- tional Conference on Computational Linguistics and 44th Annual Meeting of the ACL, Sydney, Aus- tralia. Pages 681-688 Sajib Dasgupta and Vincent Ng. 2007. Unsupervised Word Segmentation for Bangla, In proceedings of the Fifth International Conference on Natural Lan- guage Processing(ICON), Hyderabad, India. Taku Kudo, Kaoru Yamamoto and Yuji Matsumoto. 2004. Applying Conditional Random Fields to Jap- anese Morphological Analysis, In proceedings of Empirical Methods on Natural Language Processing, Barcelona, Spain. Umamaheswari E, Karthika Ranganathan, Geetha TV, Ranjani Parthasarathi, and Madhan Karky. 2011. Enhancement of Morphological Analyzer with compound, numeral and colloquial word handler, Proceedings of ICON-2011: 9th International Con-

45