Automatic Conversion of Dialectal Tamil Text to Standard Written Tamil Text Using Fsts

Automatic Conversion of Dialectal Tamil Text to Standard Written Tamil Text using FSTs Marimuthu K Sobha Lalitha Devi AU-KBC Research Centre, AU-KBC Research Centre, MIT Campus of Anna University, MIT Campus of Anna University, Chrompet, Chennai, India. Chrompet, Chennai, India. [email protected] [email protected] oriented activities such as blogging, social media Abstract chats, and discussions in online forums. Given the unmediated nature of these services, users We present an efficient method to auto- conveniently share the contents in their native matically transform spoken language text languages in a more natural and informal way. to standard written language text for var- This has resulted in bringing together the con- ious dialects of Tamil. Our work is novel tents of various languages. More often these con- in that it explicitly addresses the problem tents are informal, colloquial, and dialectal in and need for processing dialectal and nature. The dialect is defined as a variety of a spoken language Tamil. Written language language that is distinguished from other varie- equivalents for dialectal and spoken lan- ties of the same language by features of phonol- guage forms are obtained using Finite ogy, grammar, and vocabulary and by its use by State Transducers (FSTs) where spoken a group of speakers who are set off from others language suffixes are replaced with ap- geographically or socially. The dialectal varia- propriate written language suffixes. Ag- tion refers to changes in a language due to vari- glutination and compounding in the re- ous influences such as geographic, social, educa- sultant text is handled using Conditional tional, individual and group factors. The dialects Random Fields (CRFs) based word vary primarily based on geographical locations. boundary identifier. The essential Sandhi They also vary based on social class, caste, corrections are carried out using a heuris- community, gender, etc. which differ phonologi- tic Sandhi Corrector which normalizes cally, morphologically, and syntactically (Ha- the segmented words to simpler sensible bash and Rambow, 2006). Here we study spoken words. During experimental evaluations and dialectal Tamil language and aim to auto- dialectal spoken to written transformer matically transform them to standard written lan- (DSWT) achieved an encouraging accu- guage. racy of over 85% in transformation task Tamil language has more than 70 million and also improved the translation quality speakers worldwide and is spoken mainly in of Tamil-English machine translation southern India, Sri Lanka, Singapore, and Ma- system by 40%. It must be noted that laysia. It has 15 known dialects 1 which vary there is no published computational work mainly based on geographic location and reli- on processing Tamil dialects. Ours is the gious community of the people. The dialects first attempt to study various dialects of used in southern Tamil Nadu are different from Tamil in a computational point of view. the dialects prevalent in western and other parts Thus, the nature of the work reported of Tamil Nadu. Sri Lankan Tamil is relatively here is pioneering. conservative and still retains the older features of Tamil2. So its dialect differs considerably from 1 Introduction the dialects spoken elsewhere. Tamil dialect is also dependent on religious community. The var- With the advent of Web 2.0 applications, the focus of communication through the Internet has shifted from publisher oriented activities to user 1http://en.wikipedia.org/wiki/Category:Tamil_dialects 2 www.lmp.ucla.edu 37 Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM, pages 37–45, Baltimore, Maryland USA, June 27 2014. c 2014 Association for Computational Linguistics iation of dialects based on caste is studied and representation for the transformation of Arabic described by A.K. Ramanujan (1968) where he dialects to Modern Standard Arabic (MSA) and observed that Tamil Brahmins speak a very dis- performed morphological analysis. In the case of tinct form of Tamil known as Brahmin Tamil Tamil language, Umamaheswari et al. (2011) (BT) which varies greatly from the dialects used proposed a technique based on pattern mapping in other religious communities. While perform- and spelling variation rules for transforming col- ing a preliminary corpus study on Tamil dialects, loquial words to written-language words. The we found that textual contents in personal blogs, reported work considered only a handful of rules social media sites, chat forums, and comments, for the most common spoken forms. So this ap- comprise mostly dialectal and spoken language proach will fail when dialectal variants of words words similar to what one can hear and use in are encountered because it is more likely that the day-to-day communication. This practice is spelling variation rules of the spoken language common because the authors intend to establish a vary from the rules of dialectal usages. This limi- comfortable communication and enhance intima- tation hinders the possibility of the system to cy with their audiences. This activity produces generalize. Alternatively, performing a simple informal, colloquial and dialectal textual data. list based mapping between spoken and written These dialectal and spoken language usages will form words is also inefficient and unattainable. not conform to the standard spellings of Literary Spoken language words exhibit fairly regular Tamil (LT). This causes problems in many text pattern of suffixations and inflections within a based Natural Language Processing (NLP) sys- given paradigm (Schiffman, 1999). So we pro- tems as they generally work on the assumption pose a novel method based on Finite State that the input is in standard written language. To Transducers for effectively transforming dialec- overcome this problem, these dialectal and spo- tal and spoken Tamil to standard written Tamil. ken language forms need to be converted to We make use of the regularity of suffixations and Standard Written language Text (SWT) before model them as FSTs. These FSTs are used to doing any computational work with them. perform transformation which produces words in Computational processing of dialectal and standard literary Tamil. spoken language Tamil is challenging since the Our experimental results show that DSWT language has motley of dialects and the usage in achieves high precision and recall values. In ad- one dialect varies from other dialects from very dition, it improves the translation quality of ma- minimal to greater extents. It is also very likely chine translation systems when unknown words that multiple spoken-forms of a given word with- occur mainly due to colloquialism. This im- in a dialect which we call as „variants‟ may cor- provement gradually increases as the unknown respond to single canonical written-form word word rate increases due to colloquial and dialec- and a spoken-form word may map to more than tal nature of words. one canonical written-form. These situations ex- Broadly, DSWT can be used in a variety of ist in all Tamil dialects. In addition, it is very NLP applications such as Morphological Analy- likely to encounter conflicts with the spoken and sis, Rule-based and Statistical Machine Transla- written-forms of one dialect with other dialects tion (SMT), Information Retrieval (IR), Named- and vice versa. Most importantly, the dialects are Entity Recognition (NER), and Text-To-Speech used mainly in spoken communication and when (TTS). In general, it can be used in any NLP sys- they are written by users, they do not conform to tem where there is a need to retrieve written lan- standard spoken-form spellings and sometimes guage words from dialectal and spoken language inconsistent spellings are used even for a single Tamil words. written-form of a word. In other words Schiff- The paper is further organized as follows: In man (1988) noted that every usage of a given section 2, the challenges in processing Tamil di- spoken-form can be considered as Standard Spo- alects are explained. Section 3 explains the cor- ken Tamil (SST) unless it has wrong spellings to pus collection and study. Section 4 explains the become nonsensical. peculiarities seen in spoken and dialectal Tamil. Few researchers have attempted to transform Section 5 introduces the system architecture of the dialects and spoken-forms of languages to DSWT. Section 6 describes conducted Experi- standard written languages. Habash and Rambow mental evaluations and the results. Section 7 dis- (2006) developed MAGEAD, a morphological cusses about the results and the paper concludes analyzer and generator for Arabic dialects where with a conclusion section. the authors made use of root+pattern+features 38 2 Challenges in Processing Tamil Di- In the case of one-to-many mapping, multiple alects written language words will be obtained. Choos- ing a correct written language word over other Tamil, a member of Dravidian language family, words is dependent on the context where the di- is highly inflectional and agglutinative in nature. alectal spoken language word occurs. In some The phenomenon of agglutination becomes much cases, the sentence may be terminated by punc- pronounced in dialects and spoken-form com- tuations such as question marks which can be munication where much of the phonemes of suf- made use of to select an appropriate written lan- fixes get truncated and form agglutinated words guage word. To achieve correct selection of a which usually have two or more simpler words in word, an extensive study has to be conducted and them. A comprehensive study on the Grammar is not the focus of this paper. In the current work of Spoken Tamil for various syntactic categories we are interested in obtaining as many possible is presented in Schiffman (1979) and Schiffman mappings as possible. Many-to-one mapping (1999). Various dialects are generally used in occurs mainly due to dialectal and spelling varia- spoken discourse and while writing them people tions of spoken-forms whereas one-to-many use inconsistent spellings for a given spoken lan- mapping happens because a single spoken-form guage word.

Automatic Conversion of Dialectal Tamil Text to Standard Written Tamil Text Using Fsts

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support