Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance

Soumil Mandal, Karthick Nanmaran

Department of Computer Science & Engineering SRM Institute of Science & Technology, Chennai, India {soumil.mandal, karthicknanmaran}@gmail.com

Abstract ating systems for code-mixed data, post language tagging, normalization of transliterated text is ex- Building tools for code-mixed data is rapidly gaining popularity in the NLP research com- tremely important in order to identify the word munity as such data is exponentially rising on and understand it’s semantics. This would help social media. Working with code-mixed data a lot in systems like opinion mining, and is actu- contains several challenges, especially due to ally necessary for tasks like summarization, trans- grammatical inconsistencies and spelling vari- lation, etc. A normalizing module will also be of ations in addition to all the previous known immense help while making word embeddings for challenges for social media scenarios. In this code-mixed data. article, we present a novel architecture focus- In this paper, we present an architecture for ing on normalizing phonetic typing variations, which is commonly seen in code-mixed data. automatic normalization of phonetically translit- One of the main features of our architecture is erated words to their standard forms. The lan- that in addition to normalizing, it can also be guage pair we have worked on is Bengali-English utilized for back- and word iden- (Bn-En), where both are typed in Roman script, tification in some cases. Our model achieved thus the Bengali words are in their transliterated an accuracy of 90.27% on the test data. form. The canonical or normalized form we have 1 Introduction considered is the Indian Languages Translitera- tion (ITRANS) form of the respective Bengali With rising popularity of social media, the amount word. Bengali is an Indo-Aryan language of In- of data is rising exponentially. If mined, this data dia where 8.10% 3 of the total population are 1st can proof to be useful for various purposes. In language speakers and is also the official language countries where the number of bilinguals are high, of Bangladesh. The mother script of Bengali is we see that users tend to switch back and forth be- the Eastern Nagari Script 4. Our architecture uti- tween multiple languages, a phenomenon known lizes fully char based sequence to sequence learn- as code-mixing or code-switching. An interest- ing in addition to Levenshtein distance to give the ing case is switching between languages which final normalized form or as close to it as possible. share different mother scripts. On such occasions, Some additional advantages of our system is that one of the two languages is typed in it’s phoneti- at an intermediate stage, the back-transliterated arXiv:1805.08701v1 [cs.CL] 22 May 2018 cally transliterated form in order to use a common form of the word can be fetched (.. word identifi- script. Though there are some standard transliter- cation), which will be very useful in several cases ation rules, for example ITRANS 1, ISO 2, but it as original tools (i.e. tools using mother script) can is extremely difficult and un-realistic for people to be utilized, for example emotion lexicons. Some follow them while typing. This indeed is the case other important contributions of our research are as we see that identical words are being transliter- the new lexicons that have been prepared (dis- ated differently by different people based on their cussed in Sec3) which can be used for building own phonetic judgment influenced by dialects, lo- various other tools for studying Bengali-English cation, or sometimes even based on the informal- code-mixed data. ity or casualness of the situation. Thus, for cre-

1https://en.wikipedia.org/wiki/ITRANS 3https://en.wikipedia.org/wiki/Languages of India 2https://en.wikipedia.org/wiki/ISO 15919 4https://www.omniglot.com/writing/bengali.htm 2 Related Work then converted it into standardized ITRANS for- mat. The final size of the PL was 6000. The Normalization of text has been studied quite a second lexicon we created was a transliteration lot (Sproat et al., 1999), especially as it acts as dictionary (BN TRANS) where the first column a pre-processing step for several text process- had Bengali words in Eastern Nagari script taken ing systems. Using Conditional Random Fields from samsad 5, while the second column had the (CRF), Zhu et al.(2007) performed text normal- standard (ITRANS). The number ization on informal emails. Dutta et al.(2015) of entries in the dictionary was 21850. For test- created a system based on noisy channel model ing, we took the data used in Mandal and Das for text normalization which handles wordplay, (2018), language tagged it using the system de- contracted words and phonetic variations in code- scribed in Mandal et al.(2018a), and then col- mixed background. An unsupervised framework lected Bn tagged tokens. Post manual checking was presented by Sridhar(2015) for normaliz- and discarding of misclassified tokens, the size ing domain-specific and informal noisy texts using of the list was 905. Finally, each of the words distributed representation of words. The soundex were tagged with their ITRANS using the same algorithm was used in (Sitaram et al., 2015) and approach used for making PL. For PL and test (Sitaram and Black, 2016) for spelling correc- col 1 data, some initial rule based normalization tech- tion of transliterated words and normalization in niques were used. If the input string contains a a speech to text scenario of code-mixed data re- digit, it was replaced by the respective phone (e.g. spectively. Sharma et al.(2016) build a normal- ek for 1, dui for 2, etc), and if there are n consecu- ization system using noisy channel framework and tive identical characters where n > 2 (elongation), SILPA spell checker in order to build a shallow it was trimmed down to 2 consecutive characters parser. Sproat and Jaitly(2016) build a system (e.g. baaaad will become baad), as no word in combining two models, where one essentially is a it’s standard form has more than two consecutive seq2seq model which checks the possible normal- identical characters. izations and the other is a language model which considers context information. Jaitly and Sproat 4 Proposed Method (2017) used a seq2seq model with attention trained at sentence level followed by error pruning us- Our method is a two step modular approach com- ◦ ing finite-state filters to build a normalization sys- prising of two degrees of normalization. The 1 tem, mainly targeted for text to speech purposes. normalization module does an initial normaliza- A similar flow was adopted by Zare and Rohatgi tion and tries to convert the input string closest ◦ (2017) as well where seq2seq was used for nor- to the standard transliteration. The 2 normaliza- malization and a window of size 20 was consid- tion module takes the output from the first mod- ered for context. Singh et al.(2018) exploited the ule and tries to match with the standard transliter- fact that words and their variations share similar ations present in the dictionary (BN TRANS). The context in large noisy text corpora to build their candidate with the closest match is returned as the normalizing model, using skip gram and cluster- final normalized string. ing techniques. To the best of our knowledge, 5 1◦ Normalization Module the system architecture proposed by us hasn’t been tried before, especially for code-mixed data. The purpose of this module is to phonetically nor- malize the word as close to the standard transliter- 3 Data Sets ation as possible, to make the work of the match- On a whole, three data sets or lexicons were cre- ing module easier. To achieve this, our idea was ated. The first data set was a parallel lexicon (PL) to train a sequence to sequence model where the where the 1st column had phonetically transliter- input sequences are user transliterated words and ated Bn words in Roman taken from code-mixed the target sequences are the respective ITRANS data prepared in Mandal et al.(2018b). The 2 nd transliterations. We had specifically chosen this column consisted of the standard Roman translit- architecture as it has performed amazingly well in erations (ITRANS) of the respective words. To get complex sequence mapping tasks like neural - chine translation and summarization. this, we first manually back-transliterated PLcol 1 to the original word in Eastern Nagari script, and 5http://dsal.uchicago.edu/dictionaries/biswas-bengali/ 5.1 Seq2Seq Model 5.2 Training The sequence to sequence model (Sutskever et al., The model is trained by minimizing the negative 2014) is a relatively new idea for sequence learn- log-likelihood. For training, we used the fully ing using neural networks. It has been especially character based seq2seq model (Lee et al., 2016) popular since it achieved state of the art results in with stacked LSTM cells. The input units were machine translation task. Essentially, the model user typed phonetic transliterations (PLcol 1) while takes as input a sequence X = {x1, x2, ..., xn} the target units were respective standard transliter- and tries to generate another sequence Y = {y1, ations (PLcol 2). Thus, the model learns to map y2, ..., ym}, where xi and yi are the input and user transliterations to standard transliterations, target symbols respectively. The architecture of effectively learning to normalize phonetic varia- seq2seq model comprises of two parts, the en- tions. The lookup table Ex we used for character coder and decoder. As the input and target vec- encoding was a dictionary where the keys were the tors were quite small (words), attention (Vaswani 26 English alphabets and the values were the re- et al., 2017) mechanism was not incorporated. spective index. Encodings at character level were then padded to the length of the maximum existing 5.1.1 Encoder word in the dataset, which was 14, and was con- Encoder essentially takes a variable length se- verted to one-hot encodings prior to feeding the to quence as input and encodes it into a fixed length the seq2seq model. We created our seq2seq model vector, which is suppose to summarize it’s mean- using the Keras (Chollet et al., 2015) library. The ing taking into context as well. A recurrent neural batch size was set to 64, and number of epochs network (RNN) cell is used to achieve this. The was set to 100. The size of the latent dimension directional encoder reads the sequence from one was kept at 128. Optimizer we chose was rm- end to the other (left to right in our case). sprop, learning rate set at 0.001, loss was categor- ical crossentropy and transfer function used was ~ ~ ~ ht = f enc(Ex(xt), ht-1) softmax. Accuracy and loss graphs during train- ing with respect to epochs are shown in Fig1. Here, Ex is the input embedding lookup table (dic- ~ tionary), f enc are the transfer function for the re- current unit e.g. Vanilla, LSTM or GRU. A con- tiguous sequence of encodings C = {h1, h2, ..., hn} is constructed which is then passed on to the de- coder.

5.1.2 Decoder Decoder takes input context vector C from the en- coder, and computes the hidden state at time t as,

st = f dec(Ey(yt-1), st-1, ct) Subsequently, a parametric function out returns k Figure 1: Training accuracy and loss. the conditional probability using the next target symbol being k. Here, the concept of teacher forc- As we can see from Fig1, the accuracy reached ing is utilized, the strategy of feeding output of the at the end of training was not too high (around model from a prior time-step as input. 41.2%) and the slope became asymptotic. This is quite understandable as the amount of training p(yt = k | y < t, X) = 1 data was relatively quite low for the task, and the exp(out (E (y − 1), s , c )) Z k y t t t phonetic variations were quite high. On running this module on our testing data, an accuracy of Z is the normalizing constant 51.04% was achieved. It should be noted that even a single character predicted wrongly by the soft- X jexp(outj(Ey(yt − 1), st, ct)) max function reduces the accuracy. 6 2◦ Normalization Module (by 30.94%). This proves that instead of sim- ple distance comparison with lexicon entries, a This module basically comprises of the string prior seq2seq normalization can have great impact matching algorithm. For this, we have used on the performance. Additionally, we can also Levenshtein distance (LD) (Levenshtein, 1966), see that when modified input is given to the Lev- which is a string metric for measuring difference enshtein distance (LD), the accuracies achieved between two sequences. It does so by calculating are slightly better. On analyzing the errors, we the minimum number of insertions, deletions and found out that majority (92%) of them is due to substitutions required for converting one sequence the fact that the standard from was not present in to the other. Here, the output from the previous BN TRANS, i.e. was out of vocabulary. These module is compared with all the standard ITRANS words were mostly slangs, expressions, or two entries present in BN TRANS and the string with words joined into a single one. The other 8% was the least Levenshtein distance is given as out- due to the 1◦ module casuing substantial devia- put, which is the final normalized form. If there tion from normal form. For deeper analysis, we are ties, the instance which has higher matches collected the ITRANS of errors due out of vocab, traversing from left to right is given more prior- and on comparison with the 1◦ normalizations, the ity. Also, observing the errors from 1◦ normal- mean LD was calculated to be 1.89, which is sug- izer, we noticed that in a lot of cases, the charac- gesting that if they were present in BN TRANS, ter pairs {a,} and {b,v} are used interchangeably the normalizer would have given the correct out- quite often, both in phonetic transliterations alone, put. as well as when compared with ITRANS. Thus, along with the standard approach, we tried a mod- 7.2 Task Level ified version as well where the cost of the above For task level evaluation, we decided to go with mentioned character pairs are same, i.e. they are sentiment analysis using the exact setup and data treated as identical characters. This was simply described in Mandal et al.(2018b), on Bengali- done by assigning special symbols to those pairs, English code-mixed data. All the training and test- and replacing them in the input parameters. For ing data were normalized using our system along example, post replacement, distance(chalo, chala) with the lexicons that are mentioned. Finally, the will become distance(ch$l$, ch$l$). same steps were followed and the different metrics 7 Evaluation were calculated. The comparison of the systems prior (noisy) and post normalization (normalized) Our system was evaluated in two ways, one at is shown in Table2. word level and another at task level. Model Acc Prec Rec F1 7.1 Word Level noisy 80.97 81.03 80.97 81.20 Here, the basic idea was compare the normalized normalized 82.47 82.52 82.47 82.61 words with the respective standard transliterations. Table 2: Prior and post normalization results. For this, the testing data discussed in Sec3 was used. For comparison purposes, three other se- tups other than our proposed model (setup 4) were We can see an improvement in the accuracy (by tested, all of which are described in Table1. 1.5%). On further investigation, we saw that the unigram, bigram and trigram matches with the bag Model 1◦ Norm LD Acc of n-grams and testing data increased by 1.6%, setup 1 no standard 58.78 0.4% and 0.1% respectively. The accuracy can setup 2 no modified 61.10 be improved further more if back-transliteration is setup 3 yes standard 89.72 done and Bengali sentiment lexicons are used but setup 4 yes modified 90.27 that is beyond the scope of this paper.

Table 1: Comparison of different setups. 8 Discussion Though our proposed model achieved high accu- From Table1, we can see that the jump in accu- racy, some drawbacks are there. Firstly is the re- racy from setup 1 to setup 3 is quite significant quirement for the parallel corpus (PL) for training a seq2seq model, as manual checking and back- tors for sentiment identification. arXiv preprint transliteration is quite tedious. Speed of process- arXiv:1801.02581. ing in terms of words/second is not very high due Soumil Mandal, Sourya Dipta Das, and Dipankar Das. to the fact that both seq2seq and Levenshthein dis- 2018a. Language identification of bengali-english tance calculation is computationally heavy, plus code-mixed data using character & phonetic based the O(n) search time. For string matching, sim- lstm models. arXiv preprint arXiv:1803.03859. pler and faster methods can be tested and search Soumil Mandal, Sainik Kumar Mahata, and Dipankar area reduction algorithms (e.g. specifying search Das. 2018b. Preparing bengali-english code-mixed domains depending on starting character) can be corpus for sentiment analysis of indian languages. arXiv preprint arXiv:1803.04000. tried to improve the processing speed. A simple lexical checker can be added as well before using Arnav Sharma, Sakshi Gupta, Raveesh Motlani, Piyush seq2seq and/or matching module to see if the word Bansal, Manish Srivastava, Radhika Mamidi, and Dipti M Sharma. 2016. Shallow parsing pipeline for is already in it’s transliterated form. hindi-english code-mixed social media text. arXiv preprint arXiv:1604.03136. 9 Conclusion & Future Work Rajat Singh, Nurendra Choudhary, and Manish Shri- In this article, we have presented a novel archi- vastava. 2018. Automatic normalization of word tecture for normalization of transliterated words variations in code-mixed social media text. arXiv preprint arXiv:1804.00804. in code-mixed scenario. We have employed the seq2seq model with LSTM cells for initial normal- Sunayana Sitaram and Alan W Black. 2016. Speech ization followed by evaluating Levenshthein dis- synthesis of code-mixed text. In LREC. tance to retrieve the standard transliteration from a Sunayana Sitaram, Sai Krishna Rallabandi, and SRAW lexicon. Our approach got an accuracy of 90.27% Black. 2015. Experiments with cross-lingual sys- on testing data, and improved the accuracy of a tems for synthesis of code-mixed text. In 9th ISCA Speech Synthesis Workshop, pages 76–81. pre-existing sentiment analysis system by 1.5%. In future, we would like to collect more translit- Richard Sproat, Alan W Black, Stanley Chen, Shankar erated words and increase the data size in order to Kumar, Mari Ostendorf, and Christopher Richards. 1999. Normalization of non-standard words. In improve both PL and BN TRANS. Also, combin- WS’99 Final Report. Citeseer. ing this module with a context capturing system and expanding to other Indic languages like Hindi, Richard Sproat and Navdeep Jaitly. 2016. Rnn ap- proaches to text normalization: A challenge. arXiv Tamil will be one of the goals as well. preprint arXiv:1611.00068. Vivek Kumar Rangarajan Sridhar. 2015. Unsupervised References text normalization using distributed representations of words and phrases. In Proceedings of the 1st Franc¸ois Chollet et al. 2015. Keras. https:// Workshop on Vector Space Modeling for Natural keras.io. Language Processing, pages 8–16.

Sukanya Dutta, Tista Saha, Somnath Banerjee, and Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sudip Kumar Naskar. 2015. Text normalization in Sequence to sequence learning with neural net- code-mixed social media text. In Recent Trends in works. In Advances in neural information process- Information Systems (ReTIS), 2015 IEEE 2nd Inter- ing systems, pages 3104–3112. national Conference on, pages 378–382. IEEE. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Navdeep Jaitly and Richard Sproat. 2017. An rnn Kaiser, and Illia Polosukhin. 2017. Attention is all model of text normalization. you need. In Advances in Neural Information Pro- Jason Lee, Kyunghyun Cho, and Thomas Hofmann. cessing Systems, pages 6000–6010. 2016. Fully character-level neural machine trans- Maryam Zare and Shaurya Rohatgi. 2017. Deepnorm- lation without explicit segmentation. arXiv preprint a deep learning approach to text normalization. arXiv:1610.03017. arXiv preprint arXiv:1712.06994. Vladimir I Levenshtein. 1966. Binary codes capable Conghui Zhu, Jie Tang, Hang Li, Hwee Tou Ng, and of correcting deletions, insertions, and reversals. In Tiejun Zhao. 2007. A unified tagging approach to Soviet physics doklady, volume 10, pages 707–710. text normalization. In Proceedings of the 45th An- nual Meeting of the Association of Computational Soumil Mandal and Dipankar Das. 2018. Ana- Linguistics, pages 688–695. lyzing roles of classifiers and code-mixed fac-