Transchat: Cross-Lingual Instant Messaging for Indian Languages

TransChat: Cross-Lingual Instant Messaging for Indian Languages Diptesh Kanojia1 Shehzaad Dhuliawala1 Abhijit Mishra1 Naman Gupta2 Pushpak Bhattacharyya1 1Center for Indian Language Technology, CSE Department 1IIT Bombay, India, 2Yahoo Labs, Japan 1{diptesh,shehzaadzd,abhijitmishra,pb}@cse.iitb.ac.in 2naman.bbps@gmail.com

Abstract ing professor on modern agricultural tools and techniques, communication is often hin- 1 We present TransChat , an open dered by language barrier. This problem source, cross platform, Indian language has been recognized and well-studied by com- Instant Messaging (IM) application putational linguists as a result of which a that facilitates cross lingual textual large number of automatic translation sys- communication over English and multi- tems have been proposed and modified in the ple Indian Languages. The application last 30 years. Some of the notable Indian is a client-server IM architecture based Language translation systems include Angla- chat system with multiple Statistical bharati (Sinha et al., 1995), Anusaraka (Pad- Machine Translation (SMT) engines manathrao, 2009), Sampark (Anthes, 2010) working towards efficient translation and Sata-Anuvaadak2 (Kunchukuttan et al., and transmission of messages. Tran- 2014). Popular organizations like Google and sChat allows users to select their pre- Microsoft also provide translation solutions for ferred language and internally, selects Indian languages through Google- and Bing- appropriate translation engine based Translation systems. Stymne (2011) demon- on the input configuration. For trans- strate techniques for replacement of unknown lation quality enhancement, necessary words and data cleaning for Haitian Creole pre- and post-processing steps are ap- SMS translation, which can be utilized in a plied on the input and output chat- chat scenario. But even after so many years texts. We demonstrate the efficacy of of MT research, one can still claim that these TransChat through a series of qualita- systems have not been able to attract a lot of tive evaluations that test- (a) The us- users. This can be attributed to factors like- ability of the system (b) The quality of (a) poor user experience in terms of UI de- the translation output. In a multilin- sign, (b) systems being highly computational- gual country like India, such applica- resource intensive and slow in terms of re- tions can help overcome language bar- sponse time, and (c) bad quality translation rier in domains like tourism, agricul- output. Moreover, the current MT interfaces ture and health. do not provide a natural environment to attract more number of users to use the sys- 1 Introduction tem in an effective way. As Flournoy and In a multilingual country like India which Callison-Burch (2000) point out, a successful has around 22 official languages spoken across translation application is one that can recon- 29 states by more than one billion people, cile overly optimistic user expectations with it often becomes quite difficult for a non- the limited capabilities of current MT tech- speaker of a particular language to commu- nologies. We believe, a chat application like nicate with peers following the same lan- TransChat bridges the current web program- guage. Be it a non-Indian tourist visiting In- ming paradigms with MT to provide users a dia for the first time or a farmer from Pun- powerful yet natural mode of communication. jab trying to get tips from a Tamil speak- 2http://www.cfilt.iitb.ac.in/ 1http://www.cfilt.iitb.ac.in/transchat/ 106 indic-translator D S Sharma, R Sangal and E Sherly. Proc. of the 12th Intl. Conference on Natural Language Processing, pages 106–111, Trivandrum, India. December 2015. c 2015 NLP Association of India (NLPAI) Our work is motivated by the following fac- 2 System Architecture tors, • The ever-increasing usage of hand-held devices and Instant messaging provides us a rich platform to float our application. We can expect a large number of users to download and use it. • Unavailability of cross lingual instant messaging applications for Indian languages motivates us to build and open- source our application that can be modi- Figure 1: Chat Architecture fied by developers as per the needs. Figure 1 shows the system architecture There are several challenges in building such of TransChat. Our system consists of a an application. Some of them are, highly scalable chat server ejabberd3 based 1. IM users often abbreviate text (especially on Extensible Messaging and Presence Proto- when the input language is English) to col(XMPP)4 protocol. save time. (e.g. “Hw r u?” instead It is extensible via modules, which can pro- of “How are you?”). A translation sys- vide support for additional capabilities such as tem is often built on top of human-crafted saving offline messages, connecting with IRC rules (RBMT) or by automatically learn- channels, or a user database which makes use ing patterns from examples (EBMT) or of user’s vCards. It handles the chat requests, by memorizing pattens(SMT). In all of users and message transactions from one user these cases, the systems require the input to another. text to be grammatically correct, thereby, We build our chat client using publicly avail- making it necessary to deal with ungram- able Smack library APIs5 , which is easily matical input text portable to Android devices, thus, providing cross device integration for our chat system. 2. To provide a rich User Experience, the ba- A user logs in to the server with their respec- sic requirement of an IM system is that it tive chat client and gets an option to select a should transmit messages with ease and language for his chats. The user then sends a speed. Hence, language processing mod- message which is transferred via XMPP proto- ules should be light-weight in order not to col to the server where it’s processed and then incur delay. shown to the user at the other end. We try to handle these challenges by making Figure 2 shows in detail the processes use of appropriate language processing mod- through which a message passes. The message ules. The novelty of our work is three fold: is processed using the following techniques de- • Our system is scalable, i.e., it allows large scribed in sections 2.1, 2.2, 2.3, and 2.5 number of concurrent user access. 2.1 Compression • It handles ungrammatical input through While chatting, users often express their emo- fast efficient text normalization and spell tions/mood by stressing over a few characters checking. in the word. For example, usage of words like thankssss, hellpppp, sryy, byeeee, wowww, • We have tried to build a flexible sys- goood corresponds the person being obliged, tem, i.e it can work with multiple transla- needy, apologetic, emotional, amazed, etc. tion engines built using different Machine As far as we know, it is unlikely for an En- Translation toolkits. glish word to contain the same character con- In the following sections we describe the 3https://www.process-one.net/en/ejabberd/ system architecture, UI design and evaluation 4http://xmpp.org/xmpp-protocols/ methodologies. 107 5https://www.igniterealtime.org/projects/smack/ Figure 2: Server Side Processing of TransChat

Input Sentence Output Sentence ization system7(Raghunathan and Krawczyk, I feeellll soooooo gooood I feel so good I m veryyyy happyyyyy I m very happy 2009) as a Phrase Based Machine Transla- tion System, that learns normalization pat- Table 1: Sample output of Compression module terns from a large number of training exam- secutively for three or more times. We, hence, ples. We use Moses (Koehn et al., 2007), a compress all the repeated windows of charac- statistical machine translation system that al- ter length greater than two, to two charac- lows training of translation models. ters. For example “pleeeeeeeaaaaaassseeee”is Training process requires a Language Model converted to “pleeaassee”. of the target language and a parallel corpora containing aligned un-normalized and normal- Each window now contains two characters of ized word pairs. Our language model consists the same alphabet in cases of repetition. Let n of 15000 English words taken from the web. be the number of windows, obtained from the Parallel corpora was collected from the fol- previous step. Since average length of English lowing sources : word (Mayzner, 1965) is approximately 4.9 , n we apply brute force search over 2 possibili- 1. Stanford Normalization Corpora which ties to select a valid dictionary word. If none consists of 9122 pair of un-normalized and of the combinations form a valid English word, normalized words / phrases. the compressed form is used for normalization. Table 1 contains sanitized sample output 2. The above corpora, however, lacks from our compression module for further pro- acronyms and short hand texts like 2mrw, cessing. l8r, b4 which are frequently used in chatting. We collected additional data 2.2 Normalization through crowd-sourcing involving peers from CFILT lab8. They were asked Text Message Normalization is the process of to enter commonly used chatting sen- translating ad-hoc abbreviations, typographi- tences/acronyms and their normalized cal errors, phonetic substitution and ungram- versions. We collected 215 pairs un- matical structures used in text messaging normalized to normalized word/phrase (SMS and Chatting) to plain English. Use of mappings. such language (often referred as Chatting Lan- guage) induces noise which poses additional 3. Dictionary of Internet slang words was ex- processing challenges. While dictionary look- tracted from http://www.noslang.com. up based methods6 are popular for Normal- ization, they can not make use of context and Table 2 contains input and normalized domain knowledge. For example, yr can have output from our module. multiple translations like year, your.

We tackle this by implementing a normal- 7Normalization Model: http://www.cfilt.iitb. ac.in/Normalization.rar 6http://www.lingo2word.com 108 8http://www.cfilt.iitb.ac.in Input Sentence Output Sentence paradigm, powered by the Sata-Anuvadak10 shd v go 2 ur house should we go to your house u r awesm you are awesome system. hw do u knw how do you know Table 2: Sample output of Normalization module 2.5 Transliteration There is a high probability that a chat will con- 2.3 Spell Checking tain many Named Entities and Out of Vocab- Users often make spelling mistakes while ulary (OOV) words. Such words are not parts chatting. A spell checker makes sure that a of the training corpora of a translation sys- valid English word is sent to the Translation tem, and thus do not get translated. For such 11 system and for such a word, a valid output words, we use Google Transliteration API is produced. We take this problem into as a post processing step, so that the output account and introduce a spell checker as a comes out in the desired script. Table 3 con- pre-processing module before the sentence is tains the sample output from our Translitera- sent for translation. We have used the JAVA tion module. API of Jazzy spell checker9 for handling spelling mistakes. 3 User Interface We designed the interface using HTML, CSS, An example of correction provided by JavaScript and JSP on the front end as a the Spell Checker module is given below:- wrapper on our system based on Tomcat web server. We have used Bootstrap12 and Seman- Input: Today is mnday tic UI13 CSS frameworks for the beautification Output: Today is monday of our interface. The design of our system is simple and sends the chat message using socket Please note that, our current system per- to the chat server and receives the response forms compression, normalization and spell- from the server, and shown on the interface. checking if the input language is English. If We use a similar interface for the mobile ver- the user selects one of the available Indian sion of our applications. We have used Web- Languages as the input language, the input View class in Android, and Windows applica- text bypasses these stages and is fed to the tions, and UIWebView for iOS. Figure 3 shows translation system directly. We, however, plan a screen-shot of the interface. to include such modules for Indian Languages in future. If the input language is not English, the user can type the text in the form of Romanized script, which will get transliterated to indic- script with the help of Google’s transliteration API. The user can also input the text directly using an in-script keyboard (available in Win- dows and Android platforms), in which case the transliteration has to be switched off.

2.4 Translation Translation system is a plug-able module that provides translation between two language pairs. Currently TransChat is designed for translation between English to five Indian Figure 3: Chat Screenshot languages namely, Hindi, Gujarati, Punjabi, Marathi and Malayalam. Translations be- 10http://www.cfilt.iitb.ac.in/static/download.html tween these languages are carried out using 11https://developers.google.com/ Statistical Phrase Based Machine Translation transliterate/ 12http://getbootstrap.com/ 9http://sourceforge.net/projects/jazzy/ 109 13http://semantic-ui.com/ Input Sentence Output Sentence Putin अमेरका आए थे । पुितन अमेरका आए थे । Example 1 (Putin amerika aaye the) (Putin amerika aaye the) (Putin visited America) (Putin visited America) Telangana भारत का नया रा煍य है तेलगानां भारत का नया रा煍य है । Example 2 (telangaana bhaarat ka nayaa rajya hai) (telangaana bhaarat ka nayaa rajya hai) (Telangana is a new state of India) (Telangana is a new state of India) Table 3: Sample output of Transliteration module

BLEU BLEU BLEU METEOR METEOR METEOR (Google) (SATA) Bing (Google) (SATA) (Bing) Hindi 0.1991 0.1757 0.0497 0.2323 0.1529 0.2217 Marathi 0.088 0.0185 NA 0.1478 0.0843 NA Punjabi 0.2277 0.0481 NA 0.2084 0.0902 NA Gujarati 0.0446 0.0511 NA 0.1354 0.1255 NA Malayalam 0.0612 0.0084 NA 0.1259 0.0666 NA Table 4: Evaluation Statistics on Normalized input

P1 P2 P3 Anuvaadak. We then obtain the BLEU (Pap- P1 1 ineni et al., 2002) and METEOR (Denkowski P2 .63 1 and Lavie, 2014) translation scores. We then P3 .67 .61 1 compute the evaluation scores after applying the pre- and post-processing steps explained in Table 6: Inter Annotator Agreement for User experience evaluation section 2. Table 4 and 5 present the scores for processed and un-processed inputs. As we can 4 Evaluation Details see, applying pre- and post-processing steps help us achieve better evaluation scores. We do a two fold evaluation of our system that We observe a slight inconsistency between ensures the following BLEU and METEOR scores for some cases. • Good user experience in terms of (a) using BLEU has been shown to be less effective for the system and (b) acceptable translation systems involving Indian Languages due to quality etc. language divergence, free-word order and mor- phological richness (Ananthakrishnan et al., • The pre- and post-processing steps em- 2007). On the other hand, our METEOR ployed helped enhance the quality of module has been modified to support stem- translated chat. ming, paraphrasing and lemmatization (Dun- We study the first point qualitatively by em- garwal et al., 2014) for Indian Languages, ploying three human users who are asked to tackling such nuances to some extent. This use the system by giving 20 messages each. may have accounted for the score differences. They have to eventually assign three scores to 5 Conclusion and Future Work the system, Usability score, Fluency score of the chat output and Adequacy score of the chat In today’s era of short messaging and chatting, output. Fluency indicates the grammatical there is a great need to make available a mul- correctness where as adequacy indicates how tilingual chat system where users can inter- appropriate the machine translation is, when act despite the language barrier. We present it comes to semantic transfer. We then com- such an Indian language IM application that pute the average Inter Annotator Agreement facilitates cross lingual text based communi- (IAA) between their scores. Table 6 presents cation. Our success in developing such an ap- the results. We have obtained a substantial plication for English to Indian languages is a IAA via Fliess’ Kappa evaluation. small step in providing people with easy ac- To justify the second point, we input the cess to such a chat system. We plan to make un-normalized chat messages to three differ- efforts towards improving this application to ent translators viz. Google, Bing and Sata-110 enable chatting in all Indian languages via a BLEU BLEU BLEU METEOR METEOR METEOR (Google) (SATA) (Bing) (Google) (SATA) (Bing) Hindi 0.0306 0.0091 0.0413 0.0688 0.053 0.0778 Marathi 0.0046 0.0045 NA 0.0396 0.0487 NA Punjabi 0.0196 0.0038 NA 0.0464 0.0235 NA Gujrati 0.0446 0.0109 NA 0.1354 0.06 NA Malyalam 0.0089 0.0168 NA 0.0358 0.1045 NA Table 5: Evaluation Statistics on Unnormalized input common interface. We also plan to inculcate Philipp Koehn, Hieu Hoang, Alexandra Birch, Word Suggestions based on corpora statistics, Chris Callison-Burch, Marcello Federico, Nicola as they are being typed, so that the user is Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bo- presented with better lexical choices as a sen- jar, Alexandra Constantin, and Evan Herbst. tence is being formed. We can also perform 2007. Moses: Open source toolkit for statisti- emotional analysis of the chat messages and cal machine translation. In Proceedings of the introduce linguistic improvements such as eu- 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, phemism for sensitive dialogues. pages 177–180, Stroudsburg, PA, USA. Associ- Such an application can have a huge impact ation for Computational Linguistics. on communication, especially in the rural and Anoop Kunchukuttan, Abhijit Mishra, Rajen semi-urban areas, along with enabling people Chatterjee, Ritesh Shah, and Pushpak Bhat- to understand each other. tacharyya. 2014. Sata-anuvadak: Tackling mul- tiway translation of indian languages. 6 Acknowledgment S. Mayzner. 1965. Tables of Single-letter and Di- We would like to thank CFILT, IIT Bombay gram Frequency Counts for Various Word-length for granting us the computational resources, and Letter-position Combinations. Psychonomic monograph supplements. Psychonomic Press. and annotation of our experiments. We would also like to thank Avneesh Kumar, who helped Anantpur Amba Padmanathrao. 2009. us setup the experiment in its initial phase. Anusaaraka: An approach for mt taking insights from the indian grammatical tradition.

Kishore Papineni, Salim Roukos, Todd Ward, and References Wei-Jing Zhu. 2002. Bleu: a method for au- R Ananthakrishnan, Pushpak Bhattacharyya, tomatic evaluation of machine translation. In M Sasikumar, and Ritesh M Shah. 2007. Some Proceedings of the 40th annual meeting on asso- issues in automatic evaluation of english-hindi ciation for computational linguistics, pages 311– mt: More blues for bleu. ICON. 318. Association for Computational Linguistics. Karthik Raghunathan and Stefan Krawczyk. 2009. Gary Anthes. 2010. Automated translation of in- Cs224n: Investigating sms text normalization dian languages. Communications of the ACM, using statistical machine translation. Depart- 53(1):24–26. ment of Computer Science, Stanford University.

Michael Denkowski and Alon Lavie. 2014. Meteor RMK Sinha, K Sivaraman, A Agrawal, R Jain, universal: Language specific translation evalua- R Srivastava, and A Jain. 1995. Anglabharti: tion for any target language. In Proceedings of a multilingual machine aided translation project the EACL 2014 Workshop on Statistical Machine on translation from english to indian languages. Translation. In Systems, Man and Cybernetics, 1995. Intelli- gent Systems for the 21st Century., IEEE Inter- Piyush Dungarwal, Rajen Chatterjee, Abhijit national Conference on, volume 2, pages 1609– Mishra, Anoop Kunchukuttan, Ritesh Shah, 1614. IEEE. and Pushpak Bhattacharyya. 2014. The iit bombay hindi english translation system at Sara Stymne. 2011. Spell checking techniques for wmt 2014. ACL 2014, page 90. replacement of unknown words and data cleaning for haitian creole sms translation. In Pro- Raymond S Flournoy and Christopher Callison- ceedings of the Sixth Workshop on Statistical Burch. 2000. Reconciling user expectations and Machine Translation, WMT ’11, pages 470–477, translation technology to create a useful real- Stroudsburg, PA, USA. Association for Compu- world application. In Proceedings of the 22nd tational Linguistics. International Conference on Translating and the Computer, pages 16–17. 111