Arxiv:1805.08701V1 [Cs.CL] 22 May 2018 Cally Transliterated Form in Order to Use a Common Form of the Word Can Be Fetched (I.E
Total Page:16
File Type:pdf, Size:1020Kb
Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance Soumil Mandal, Karthick Nanmaran Department of Computer Science & Engineering SRM Institute of Science & Technology, Chennai, India fsoumil.mandal, [email protected] Abstract ating systems for code-mixed data, post language tagging, normalization of transliterated text is ex- Building tools for code-mixed data is rapidly gaining popularity in the NLP research com- tremely important in order to identify the word munity as such data is exponentially rising on and understand it’s semantics. This would help social media. Working with code-mixed data a lot in systems like opinion mining, and is actu- contains several challenges, especially due to ally necessary for tasks like summarization, trans- grammatical inconsistencies and spelling vari- lation, etc. A normalizing module will also be of ations in addition to all the previous known immense help while making word embeddings for challenges for social media scenarios. In this code-mixed data. article, we present a novel architecture focus- In this paper, we present an architecture for ing on normalizing phonetic typing variations, which is commonly seen in code-mixed data. automatic normalization of phonetically translit- One of the main features of our architecture is erated words to their standard forms. The lan- that in addition to normalizing, it can also be guage pair we have worked on is Bengali-English utilized for back-transliteration and word iden- (Bn-En), where both are typed in Roman script, tification in some cases. Our model achieved thus the Bengali words are in their transliterated an accuracy of 90.27% on the test data. form. The canonical or normalized form we have 1 Introduction considered is the Indian Languages Translitera- tion (ITRANS) form of the respective Bengali With rising popularity of social media, the amount word. Bengali is an Indo-Aryan language of In- of data is rising exponentially. If mined, this data dia where 8.10% 3 of the total population are 1st can proof to be useful for various purposes. In language speakers and is also the official language countries where the number of bilinguals are high, of Bangladesh. The mother script of Bengali is we see that users tend to switch back and forth be- the Eastern Nagari Script 4. Our architecture uti- tween multiple languages, a phenomenon known lizes fully char based sequence to sequence learn- as code-mixing or code-switching. An interest- ing in addition to Levenshtein distance to give the ing case is switching between languages which final normalized form or as close to it as possible. share different mother scripts. On such occasions, Some additional advantages of our system is that one of the two languages is typed in it’s phoneti- at an intermediate stage, the back-transliterated arXiv:1805.08701v1 [cs.CL] 22 May 2018 cally transliterated form in order to use a common form of the word can be fetched (i.e. word identifi- script. Though there are some standard transliter- cation), which will be very useful in several cases ation rules, for example ITRANS 1, ISO 2, but it as original tools (i.e. tools using mother script) can is extremely difficult and un-realistic for people to be utilized, for example emotion lexicons. Some follow them while typing. This indeed is the case other important contributions of our research are as we see that identical words are being transliter- the new lexicons that have been prepared (dis- ated differently by different people based on their cussed in Sec3) which can be used for building own phonetic judgment influenced by dialects, lo- various other tools for studying Bengali-English cation, or sometimes even based on the informal- code-mixed data. ity or casualness of the situation. Thus, for cre- 1https://en.wikipedia.org/wiki/ITRANS 3https://en.wikipedia.org/wiki/Languages of India 2https://en.wikipedia.org/wiki/ISO 15919 4https://www.omniglot.com/writing/bengali.htm 2 Related Work then converted it into standardized ITRANS for- mat. The final size of the PL was 6000. The Normalization of text has been studied quite a second lexicon we created was a transliteration lot (Sproat et al., 1999), especially as it acts as dictionary (BN TRANS) where the first column a pre-processing step for several text process- had Bengali words in Eastern Nagari script taken ing systems. Using Conditional Random Fields from samsad 5, while the second column had the (CRF), Zhu et al.(2007) performed text normal- standard transliterations (ITRANS). The number ization on informal emails. Dutta et al.(2015) of entries in the dictionary was 21850. For test- created a system based on noisy channel model ing, we took the data used in Mandal and Das for text normalization which handles wordplay, (2018), language tagged it using the system de- contracted words and phonetic variations in code- scribed in Mandal et al.(2018a), and then col- mixed background. An unsupervised framework lected Bn tagged tokens. Post manual checking was presented by Sridhar(2015) for normaliz- and discarding of misclassified tokens, the size ing domain-specific and informal noisy texts using of the list was 905. Finally, each of the words distributed representation of words. The soundex were tagged with their ITRANS using the same algorithm was used in (Sitaram et al., 2015) and approach used for making PL. For PL and test (Sitaram and Black, 2016) for spelling correc- col 1 data, some initial rule based normalization tech- tion of transliterated words and normalization in niques were used. If the input string contains a a speech to text scenario of code-mixed data re- digit, it was replaced by the respective phone (e.g. spectively. Sharma et al.(2016) build a normal- ek for 1, dui for 2, etc), and if there are n consecu- ization system using noisy channel framework and tive identical characters where n > 2 (elongation), SILPA spell checker in order to build a shallow it was trimmed down to 2 consecutive characters parser. Sproat and Jaitly(2016) build a system (e.g. baaaad will become baad), as no word in combining two models, where one essentially is a it’s standard form has more than two consecutive seq2seq model which checks the possible normal- identical characters. izations and the other is a language model which considers context information. Jaitly and Sproat 4 Proposed Method (2017) used a seq2seq model with attention trained at sentence level followed by error pruning us- Our method is a two step modular approach com- ◦ ing finite-state filters to build a normalization sys- prising of two degrees of normalization. The 1 tem, mainly targeted for text to speech purposes. normalization module does an initial normaliza- A similar flow was adopted by Zare and Rohatgi tion and tries to convert the input string closest ◦ (2017) as well where seq2seq was used for nor- to the standard transliteration. The 2 normaliza- malization and a window of size 20 was consid- tion module takes the output from the first mod- ered for context. Singh et al.(2018) exploited the ule and tries to match with the standard transliter- fact that words and their variations share similar ations present in the dictionary (BN TRANS). The context in large noisy text corpora to build their candidate with the closest match is returned as the normalizing model, using skip gram and cluster- final normalized string. ing techniques. To the best of our knowledge, 5 1◦ Normalization Module the system architecture proposed by us hasn’t been tried before, especially for code-mixed data. The purpose of this module is to phonetically nor- malize the word as close to the standard transliter- 3 Data Sets ation as possible, to make the work of the match- On a whole, three data sets or lexicons were cre- ing module easier. To achieve this, our idea was ated. The first data set was a parallel lexicon (PL) to train a sequence to sequence model where the where the 1st column had phonetically transliter- input sequences are user transliterated words and ated Bn words in Roman taken from code-mixed the target sequences are the respective ITRANS data prepared in Mandal et al.(2018b). The 2 nd transliterations. We had specifically chosen this column consisted of the standard Roman translit- architecture as it has performed amazingly well in erations (ITRANS) of the respective words. To get complex sequence mapping tasks like neural ma- chine translation and summarization. this, we first manually back-transliterated PLcol 1 to the original word in Eastern Nagari script, and 5http://dsal.uchicago.edu/dictionaries/biswas-bengali/ 5.1 Seq2Seq Model 5.2 Training The sequence to sequence model (Sutskever et al., The model is trained by minimizing the negative 2014) is a relatively new idea for sequence learn- log-likelihood. For training, we used the fully ing using neural networks. It has been especially character based seq2seq model (Lee et al., 2016) popular since it achieved state of the art results in with stacked LSTM cells. The input units were machine translation task. Essentially, the model user typed phonetic transliterations (PLcol 1) while takes as input a sequence X = fx1, x2, ..., xng the target units were respective standard transliter- and tries to generate another sequence Y = fy1, ations (PLcol 2). Thus, the model learns to map y2, ..., ymg, where xi and yi are the input and user transliterations to standard transliterations, target symbols respectively. The architecture of effectively learning to normalize phonetic varia- seq2seq model comprises of two parts, the en- tions. The lookup table Ex we used for character coder and decoder. As the input and target vec- encoding was a dictionary where the keys were the tors were quite small (words), attention (Vaswani 26 English alphabets and the values were the re- et al., 2017) mechanism was not incorporated.