An Unsupervised Method for Discovering Lexical Variations in Roman Informal Text

Abdul Rafae, Abdul Qayyum, Muhammad Moeenuddin, ‡ ‡ Asim Karim, Hassan Sajjad and Faisal Kamiran, Qatar Computing Research† Institute, Hamad‡ Bin Khalifa University Lahore† University of Management Sciences, Information Technology University ‡

Abstract tasks such as Urdu word segmentation (Durrani and Hussain, 2010), part of speech tagging (Saj- We present an unsupervised method to jad and Schmid, 2009), spell checking (Naseem find lexical variations in Roman Urdu and Hussain, 2007), machine translation (Durrani informal text. Our method includes a et al., 2010), etc. phonetic algorithm UrduPhone, a feature- In this paper, we propose an unsupervised based similarity function, and a clustering feature-based method that tackles above men- algorithm Lex-C. UrduPhone encodes ro- tioned challenges in discovering lexical variations man Urdu strings to their phonetic equiv- in Roman Urdu. We exploit phonetic and string alent representations. This produces an similarity based features and incorporate contex- initial grouping of different spelling vari- tual features via top-k previous and next words’ ations of a word. The similarity function features. For phonetic information, we develop an incorporates word features and their con- encoding scheme for Roman Urdu, UrduPhone, text. Lex-C is a variant of k-medoids clus- motivated from Soundex. Compared to other tering algorithm that group lexical varia- available phonetic-based schemes that are mostly tions. It incorporates a similarity thresh- limited to English sounds only, UrduPhone maps old to balance the number of clusters and Roman Urdu homophones effectively. Unlike pre- their maximum similarity. We test our sys- vious work on short text normalization (see Sec- tem on two datasets of SMS and blogs and tion 2), we do not have information about stan- show an f-measure gain of up to 12% from dard word forms in the dataset. The problem be- baseline systems. comes more challenging as every word in the cor- pus is a candidate of every other word. We present 1 Introduction a variant of the k-medoids clustering algorithm Urdu is the national language of and one that forms clusters in which every word has at of the official languages of . It is written in least a specified minimum similarity with the clus- Perso-. However in social media and ter’s centroidal word. We conduct experiments on short text messages (SMS), a large proportion of two Roman Urdu datasets: an SMS dataset and Urdu speakers use roman script (i.e., the English a blog dataset and evaluate performance using a alphabet) for writing, called Roman Urdu. gold standard. Our method shows an f-measure Roman Urdu lacks standard lexicon and usu- gain of up to 12% compared to baseline methods. ally many spelling variations exist for a given The dataset and code are made available to the re- word, e.g., the word zindagi [life] is also written as search community. zindagee, zindagy, zaindagee and zndagi. Specifi- 2 Previous Work cally, the following normalization issues arise: (1) differently spelled words (see example above), (2) Normalization of short text messages and tweets identically spelled words that are lexically differ- has been in focus (Sproat et al., 2001; Wei et al., ent (e.g., bahar can be used for both [outside] 2011; Clark and Araki, 2011; Roy et al., 2013; and [spring], and (3) spellings that match words Chrupala, 2014; Kaufmann and Kalita, 2010; in English (e.g, had [limit] for the English word Sidarenka et al., 2013; Ling et al., 2013; Desai ‘had’). These inconsistencies cause a problem of and Narvekar, 2015; Pinto et al., 2012). However, data sparsity in basic natural language processing most of the work is limited to English or to other

823 Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 823–828, Lisbon, Portugal, 17-21 September 2015. c 2015 Association for Computational Linguistics. resource-rich languages. In this paper, we focus on which in turn serves as grouping words of similar Roman Urdu, an under-resourced language, that sounds (lexical variations) to one code. However, does not have any gold standard corpus with stan- most of the schemes are designed for English and dard word forms. Therefore, we are restricted to European languages and are limited when apply to the task of finding lexical variations in informal other family of languages like Urdu. text. This is a rather more challenging problem In this work, we propose a phonetic encoding since in this case every word is a possible varia- scheme, UrduPhone, tailored for Roman Urdu. tion of every other word in the corpus. The scheme is derived from the Soundex algo- Researchers have used phonetic, string, and rithm. It groups consonants on the basis of com- contextual knowledge to find lexical variations in mon homophones in Urdu and English. It is differ- informal text.1 Pinto et al. (2012; Han et al. ent from Soundex in two particular ways:3 Firstly, (2012; Zhang et al. (2015) used phonetic-based UrduPhone generates encoding of length six com- methods to find lexical variations. Han et al. pared to length four in Soundex. This enables (2012) also used word similarity and word con- UrduPhone to avoid mapping different forms of text to enhance performance. Wang and Ng (2013) a root word to same code. For example, musku- used normalization operations e.g., missing word rana [smiling] and mshuraht [smile] encode to recovery and punctuation correction to improve one form MSKR in Soundex but in UrduPhone, normalization process. Irvine et al. (2012) used they have different encoding which are MSKRN, manually prepared training data to build an au- MSKRHT respectively. Secondly, we introduce tomatic normalization system. Contractor et al. consonant groups which are mapped differently in (2010) used string edit distance to find candidate Soundex. We do this by analyzing Urdu alpha- lexical variations. Yang and Eisenstein (2013) bets that map to a single roman form e.g. words used an unsupervised approach with log linear samar [reward], sabar [patience] and saib [apple], model and sequential Monte Carlo approximation. all start with different Urdu alphabets that have We propose an unsupervised method to find lex- identical roman representation: s. In UrduPhone, ical variations. It uses string edit distance like we map all such cases to a single form.4 Contractor et al. (2010), Sound-based encoding like Pinto et al. (2012) and context like Han et al. 3.2 Similarity Function (2012) combined in a discriminative framework. The similarity between two words wi and wj is However, in contrast, it does not use any corpus of computed by the following similarity function: standard word forms to find lexical variations. F (f) (f) f=1 α σij S(wi, wj) = × F (f) 3 Our Method P f=1 α P The lexical variations of a lexical entry usually Here, α(f) > 0 is the weight given to feature f, have high phonetic, string-based, and contextual (f) σ [0, 1] is the similarity contribution made by similarity. We integrate a phonetic-based encod- ij ∈ feature f, and F is the total number of features. ing scheme, UrduPhone, a feature-based similarity In the absence of additional information, and as function, and a clustering algorithm, Lex-C. used in the experiments in this work, all weights 3.1 UrduPhone can be taken equal to one. The similarity function returns a value in the interval [0, 1] with larger Several sound-based encoding schemes for words values signifying higher similarity. have been proposed in literature such as Soundex We use two types of features in our method: (Knuth, 1973; Hall and Dowling, 1980), NYSIIS word features and contextual features. Word fea- (Taft, 1970), Metaphone (Philips, 1990), Caver- tures can be based on phonetics and/or string sim- phone (Wang, 2009) and Double Metaphone.2 ilarity. The phonetic similarity between words wi These schemes encode words based on their sound and wj is 1 (i.e., σij = 1) if both words have the 1Spell correction is also considered as a variant of text same UrduPhone ID or encoding; otherwise, their normalization (Damerau, 1964; Tahira, 2004; Fossati and Di Eugenio, 2007). Here, we limit ourselves to the previous 3Due to limited space, we limit the description of Urdu- work on short text normalization. Phone to its comparison with Soundex. 2http://en.wikipedia.org/wiki/ 4A complete table of UrduPhone mappings is provided in Metaphone the supplementary material.

824 k similarity is zero. The string similarity between satisfied (i.e., S(wi, wc ) < t) then instead of as- words wi and wj is defined as follows: signing word wi to cluster k, it starts a new cluster. These two steps are repeated until convergence. lcs(w , w ) σ = i j ij min[len(w ), len(w )] edist(w , w ) i j × i j 4 Experimental Evaluation

Here, lcs(wi, wj) is the length of the longest com- We empirically evaluate UrduPhone and our com- mon subsequence in words wi and wj and len(wi) plete method involving Lex-C separately on two is the length of word wi. edist(wi, wj) returns the real-world datasets. Performance is reported with edit distance between words except when the edit B-Cubed precision, recall, and f-measure (Bagga distance is 0, in which case it returns 1. and Baldwin, 1998; Hassan et al., 2015) on a gold Contextual features include top-k frequently oc- standard dataset. These performance measures are curring previous and next words’ features. Let based on element-wise comparisons between pre- i i i j j j a1, a2, . . . , a5 and a1, a2, . . . , a5 be the word IDs dicted and actual clusters that are then aggregated for the top-5 frequently occurring words preceding over all elements in the clustering. This avoided word wi and wj, respectively. Then, the similarity the issue of 100% precision with low recall (all between words is given by (Hassan et al., 2009) words belong to separate clusters) and 100% re- call with low precision (all words belong to one 5 k=1 ρk cluster). σij = 5 P k=1 k 4.1 Dataset and Gold Standard i P Here, ρk is zero if ak does not have a match in The first dataset, Web dataset, is scraped from Ro- j a (i.e., in the context of word wj); otherwise, 5 6 7 ∗ man Urdu websites on news , poetry , SMS and i j 8 ρk = 5 max[k, l] 1 where a = a and l is blog . The second dataset, SMS dataset, is ob- − − k l the highest rank (smallest integer) at which a pre- tained from chopaal, an internet based group SMS vious match had not occurred. Instead of word IDs service9. For evaluation, we use a manually an- in ai’s, UrduPhone IDs or string similarity based notated database of Roman Urdu variations (Khan cluster IDs can be used to reduce sparsity and im- and Karim, 2012). Table 1 shows statistics of the prove matches among similar words. datasets in comparison with the gold standard.

3.3 Lex-C: Clustering Algorithm Dataset Web SMS We develop a new clustering algorithm, called Unique words 22,044 28,908 Lex-C, for discovering lexical variations in infor- Overlap with Gold Standard 12,600 13,087 mal text. This algorithm is a modified version of UrduPhone IDs 3,952 3,599 the k-medoids algorithm (Han, 2005). It incorpo- Table 1: Datasets and gold standard statistics. rates an assignment similarity threshold, t > 0, for Overlap with gold standard = number of words ap- controlling the number of clusters and their simi- pearing in gold standard; UrduPhone IDs = num- larity. In particular, it ensures that all words in ber of distinct UrduPhone encodings. a cluster have a similarity greater than or equal to this threshold. It is important to note that the poular k-means algorithm is known to be effective 4.2 UrduPhone Evaluation for numeric datasets only which is not true in our We compare UrduPhone with Soundex and its case, and it cannot utilize our specialized similar- 10 ity function for lexical variation discovery. variants. These algorithms are used to group Specifically, Lex-C starts from an initial clus- words based on their encoding and then evalu- tering based on UrduPhone or string similarity. ated against the gold standard. Table 2 shows k It finds the centroidal word, wc , for cluster k as 5http://www.shashca.com,http://stepforwardpak.com/ the word with which the sum of similarities of all 6https://hadi763.wordpress.com/ other words in the cluster is a maximum. Then, 7http://www.replysms.com/ 8http://roman.urdu.co/ each non-centroidal word is assigned to the clus- 9 k http://chopaal.org ter k if S(wi, wc ) is a maximum among all clusters 10We use NLTK-Trainer’s phonetic library http:// and S(w ,W k) t. If the latter condition is not bit.ly/1OJGL9Q i c ≥

825 1 5000 Precision Recall FMeasure the results on Web dataset. UrduPhone out- Cluster Difference performs Soundex, Caverphone, and Metaphone while Nysiis’s f-measure is comparable to that of

UrduPhone. We observe that Nysiis produces a 0.5 0 large number of single word clusters (out of 6,943 clusters produced 5,159 have only one word). This Predicted − Actual Clusters gives high precision but recall is low. UrduPhone Precision, Recall & F−Measure produces fewer clusters (and fewer one word clus-

0 −5000 ters) with high recall. 1 2 3 4 5 6 7 8 9 Experiments

Algorithm Pre Rec Fme Soundex 0.30 0.97 0.46 Figure 2: Performance results for SMS dataset Metaphone 0.49 0.80 0.64 Caverphone 0.31 0.92 0.46 Nysiis 0.63 0.69 0.66 ID Initial Word Context-1 Context-2 UrduPhone 0.51 0.94 0.67 1 – UPhone – – 2 – String – – Table 2: Comparison of UrduPhone with other al- 3 String String Word ID – gorithms on Web dataset 4 String String Cluster ID – 5 String UPhone Cluster ID – 6 UPhone UPhone Cluster ID – 7 UPhone UPhone Word ID – 4.3 Lex-C Evaluation 8 UPhone UPhone UPhone ID Word ID 9 UPhone UPhone UPhone ID Cluster ID We compared Lex-C with k-means and EM clus- tering algorithms. With both of these algorithms Table 3: Details of experiments’ settings where we used the same feature set (i.e., word features, Initial is initial clustering based on String or Urdu- phonetic features, and contextual features), How- Phone (UPhone) ever, their performance lagged the performance of our approach. The primary reason for this is that our feature space is not continuous while k-means tures (word ID, cluster ID, and/or UrduPhone ID). and EM algorithms work best for continuous fea- Table 3 gives details of each experiment setting. ture spaces. Figures 1 and 2 show results of selected ex-

periments for Web and SMS datasets respectively. Precision 1000 Recall FMeasure The x-axis shows the experiment (Exp.) IDs while 1 Cluster Difference the left y-axis gives the precision, recall, and f- 0 0.8 measure and the right y-axis shows the difference between the number of predicted and actual clus- 0.6 −1000 ters. Exp. 1 and 2 are baselines corresponding to UrduPhone encoding (UPhone ID) and string 0.4 Predicted − Actual Clusters

Precision, Recall & F−Measure −2000 similarity based word clustering (Cluster ID) re-

0.2 spectively. The remaining experiments have dif-

−3000 ferent initial clustering, word features, and up to 0 1 2 3 4 5 6 7 8 9 Experiments two contextual features. In these results, the simi- larity threshold t is selected such that the number of discovered clusters is as close as possible to the Figure 1: Performance results for Web dataset number of actual clusters in the gold standard for each dataset. This is done to make the results com- 4.4 Performance of our Method parable across different settings. We conduct extensive experiments to evaluate the Compared to baselines, our method shows a performance of our method. We vary initial clus- gain of up to 12% and 8% in Web and SMS terings (UrduPhone encoding or string similarity datasets respectively. The best performances are based clustering); evaluate various combinations obtained when UrduPhone is used as a feature of phonetic, string, and contextual features; and and UrduPhone IDs are used to define the context consider different previous/next top-5 words’ fea- (Exp. 8 and 9). In particular, when both Urdu-

826 Phone IDs and word IDs/cluster IDs are used for Eleanor Clark and Kenji Araki. 2011. Text normal- contextual information (i.e, with two sets of top-5 ization in social media: Progress, problems and ap- previous and next words’ features) the f-measure plications for a pre-processing system of casual en- glish. Procedia-Social and Behavioral Sciences, is consistently high. 27:2–11. We analyzed the performance of Exp. 8 (best Danish Contractor, Tanveer A Faruquie, and L Venkata settings for Web dataset) with varying t and Subramaniam. 2010. Unsupervised cleansing of showed it in Figure 3. It is observed that the value noisy text. In Proceedings of the 23rd International of t controls the number of clusters smoothly, and Conference on ComputationalLinguistics: Posters, precision increases with the number of clusters pages 189–196. Association for Computational Lin- and f-measure reaches a peak when number of guistics. clusters is close to that in the gold standard. Fred J. Damerau. 1964. A technique for computer de- tection and correction of spelling errors. Commun. ACM, 7(3), March. 1.1 6000 Precision Recall Neelmay Desai and Meera Narvekar. 2015. Normal- FMeasure 1 Cluster Difference ization of noisy text data. Procedia Computer Sci- 3000 0.9 ence, 45:127–132.

0.8 Nadir Durrani and Sarmad Hussain. 2010. Urdu word 0 segmentation. In Human Language Technologies: 0.7 The 2010 Annual Conference of the North American Chapter of the Association for Computational Lin-

0.6 Predicted − Actual Clusters Precision, Recall & F−Measure −3000 guistics, pages 528–536, Los Angeles, California,

0.5 June. Association for Computational Linguistics.

0.4 −6000 Nadir Durrani, Hassan Sajjad, Alexander Fraser, and 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Threshold Helmut Schmid. 2010. -to-urdu machine translation through . In Proceedings of the 48th Annual Meeting of the Association for Figure 3: Effect of varying threshold t on Web Computational Linguistics, pages 465–474, Upp- dataset (experiment 8) sala, Sweden, July. Association for Computational Linguistics. Davide Fossati and Barbara Di Eugenio. 2007. A 5 Conclusion mixed trigrams approach for context sensitive spell We proposed an unsupervised method for finding checking. In Computational Linguistics and Intelli- lexical variations in Roman Urdu. We presented a gent Text Processing, pages 623–633. Springer. phonetic encoding scheme UrduPhone for Roman Patrick A. V. Hall and Geoff R. Dowling. 1980. Ap- Urdu, and developed a feature-based clustering al- proximate string matching. ACM Comput. Surv., gorithm Lex-C. Our experiments are evaluated on 12(4):381–402. a manually developed gold standard. The results Bo Han, Paul Cook, and Timothy Baldwin. 2012. confirmed that our method outperforms baseline Automatically constructing a normalisation dictio- methods. We made the datasets and algorithm nary for microblogs. In Proceedings of the 2012 code available to the research community. Joint Conference on Empirical Methods in Natu- ral Language Processing and Computational Natu- ral Language Learning, pages 421–432. Association for Computational Linguistics. References Jiawei Han. 2005. Data Mining: Concepts and Tech- Amit Bagga and Breck Baldwin. 1998. Algorithms for niques. Morgan Kaufmann Publishers Inc., San scoring coreference chains. In In The First Interna- Francisco, CA, USA. tional Conference on Language Resources and Eval- Malik Tahir Hassan, Khurum Nazir Junejo, and Asim uation Workshop on Linguistics Coreference, pages Karim. 2009. Learning and predicting key web nav- 563–566. igation patterns using bayesian models. In Compu- tational Science and Its Applications–ICCSA, pages Grzegorz Chrupala. 2014. Normalizing tweets with 877–887. Springer. edit scripts and recurrent neural embeddings. In Proceedings of the 52nd Annual Meeting of the As- Malik Tahir Hassan, Asim Karim, Jeong-Bae Kim, and sociation for Computational Linguistics, ACL 2014, Moongu Jeon. 2015. Cdim: Document clustering June 22-27, 2014, Baltimore, MD, USA, Volume 2: by discrimination information maximization. Infor- Short Papers, pages 680–686. mation Sciences.

827 Ann Irvine, Jonathan Weese, and Chris Callison-Burch. R. Taft. 1970. Name search techniques. Special 2012. Processing informal, romanized pakistani text report. Bureau of Systems Development,New York messages. In Proceedings of the Second Workshop State Identification and Intelligence System. on Language in Social Media, LSM ’12, pages 75– 78. Association for Computational Linguistics. Naseem Tahira. 2004. A hybrid approach for Urdu spell checking. Max Kaufmann and Jugal Kalita. 2010. Syntac- tic normalization of twitter messages. In Interna- Pidong Wang and Hwee Tou Ng. 2013. A beam-search tional Conference on Natural Language Processing, decoder for normalization of social media text with Kharagpur, India. application to machine translation. In HLT-NAACL, pages 471–481. Osama Khan and Asim Karim. 2012. A rule-based model for normalization of sms text. In IEEE 24th John Wang, editor. 2009. Encyclopedia of Data Ware- International Conference on Tools with Artificial In- housing and Mining, Second Edition (4 Volumes). telligence, ICTAI 2012, Athens, Greece, November IGI Global. 7-9, 2012, pages 634–641. Zhongyu Wei, Lanjun Zhou, Binyang Li, Kam-Fai Donald E Knuth. 1973. The Art of Computer Program- Wong, Wei Gao, and Kam-Fai Wong. 2011. Explor- ming: Volume 3, Sorting and Searching. Addison- ing tweets normalization and query time sensitivity Wesley. for twitter search. In TREC. Wang Ling, Chris Dyer, Alan W Black, and Isabel Yi Yang and Jacob Eisenstein. 2013. A log-linear Trancoso. 2013. Paraphrasing 4 microblog normal- model for unsupervised text normalization. In Pro- ization. In EMNLP, pages 73–84. ceedings of the 2013 Conference on Empirical Meth- ods in Natural Language Processing, EMNLP 2013, Tahira Naseem and Sarmad Hussain. 2007. A 18-21 October 2013, Grand Hyatt Seattle, Seattle, novel approach for ranking spelling error correc- Washington, USA, A meeting of SIGDAT, a Special tions for urdu. Language Resources and Evaluation, Interest Group of the ACL, pages 61–72. 41(2):117–128. Xin Zhang, Jiaying Song, Yu He, and Guohong Fu. Lawrence Philips. 1990. Hanging on the metaphone. 2015. Normalization of homophonic words in chi- Computer Language Magazine, 7(12):39–44, De- nese microblogs. In Intelligent Computation in Big cember. Data Era, pages 177–187. Springer. David Pinto, Darnes Vilarino˜ Ayala, Yuridiana Aleman,´ Helena Gomez-Adorno,´ Nahun Loya, and Hector´ Jimenez-Salazar.´ 2012. The soundex phonetic algo- rithm revisited for SMS text representation. In Text, Speech and Dialogue - 15th International Confer- ence, TSD 2012, Brno, Czech Republic, September 3-7, 2012. Proceedings, pages 47–55. Sudipta Roy, Sourish Dhar, Saprativa Bhattacharjee, and Anirban Das. 2013. A lexicon based algo- rithm for noisy text normalization as pre processing for sentiment analysis. International Journal of Re- search in Engineering and Technology, 02. Hassan Sajjad and Helmut Schmid. 2009. Tagging urdu text with parts of speech: A tagger compar- ison. In Proceedings of the 12th Conference of the European Chapter of the Association for Com- putational Linguistics (EACL-09), pages 692–700, Athens, Greece. Uladzimir Sidarenka, Tatjana Scheffler, and Manfred Stede. 2013. Rule-based normalization of german twitter messages. In Proc. of the GSCL Workshop Verarbeitung und Annotation von Sprachdaten aus Genres internetbasierter Kommunikation. Richard Sproat, Alan W. Black, Stanley F. Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. 2001. Normalization of non-standard words. Computer Speech & Language, 15(3):287– 333.

828