Document Similarity Amid Automatically Detected Terms∗

Hardik Joshi Jerome White Gujarat University New York University Ahmedabad, India Abu Dhabi, UAE [email protected] [email protected]

ABSTRACT center was involved. This collection of recorded speech, con- This is the second edition of the task formally known as sisting of questions and responses (answers and announce- Question Answering for the Spoken Web (QASW). It is an ments) was provided the basis for the test collection. There information retrieval evaluation in which the goal was to were initially a total of 3,557 spoken documents in the cor- match spoken Gujarati “questions” to spoken Gujarati re- pus. From system logs, these documents were divided into a sponses. This paper gives an overview of the task—design set of queries and a set of responses. Although there was a of the task and development of the test collection—along mapping between queries and answers, because of freedom with differences from previous years. given to farmers, this mapping was not always “correct.” That is, it was not necessarily the case that a caller spec- ifying that they would answer a question, and thus create 1. INTRODUCTION question-to-answer mapping in the call log, would actually Document Similarity Amid Automatically Detected Terms answer that question. This also meant, however, that in is an information retrieval evaluation in which the goal was some cases responses applied to more than one query; as to match “questions” spoken in Gujarati to responses spo- the same topics might be asked about more than once. ken in Gujarati. The design of the task was motivated by The 151 longest queries were initially divided into a train- a speech retrieval interaction paradigm first proposed by ing set of 50 questions and an evaluation set of 101 questions. Oard [4]. In this paradigm, a searcher, using speech for Training questions were those for which the largest num- both queries and responses, speaks extensively about what ber of answers were known beforehand (mappings between they seek to find until interrupted by the system with a sin- questions and known answers was available to the organizers gle potential answer. This task follows a stream of similar from data collected by the operational system). Once the efforts, most notably the Question Answering for the Spo- transcripts became available, two evaluation questions were ken Web (QASW) from FIRE 2013, and an attempted task removed for which the resulting transcripts were far shorter in MediaEval from the same year. than would be expected based on the file length. This re- sulted in a total of 50 training questions and 99 evaluation 2. QUESTIONS AND RESPONSES questions. Of the 50 training questions, results from the 2013 QASW task revealed that only 17 were actual queries. The source of the questions and the collection of possible Further, of these 17, 10 had more than one relevant doc- answers (which we call “responses”) was the IBM Spoken ument. These 17 queries along with their relevance judge- Web Gujarati collection [6]. This collection was based on a ments, were made available to participants as the training spoken bulletin board system for Gujarati farmers. A farmer set. could call the system and record their question by going The set of response files did not change from the QASW through a set of prompts. Other farmers would call the sys- task: in that case, very short response files were removed, tem to record answers to those questions. There were also a along with files that were did not seem to be inline with small group of system administrators who would periodically their corresponding transcript.1 After removal, the final test call in to leave announcements that they expected would be collection contained 2,999 responses. of interest to the broader farming community. The system was completely automated—no human intervention or call 3. SPEECH PROCESSING ∗ An initial version of this task appear in 2013 under the Recent term discovery systems [5,2] automatically iden- title, “Question Answering for the Spoken Web” tify repeating words and phrases in large collections of audio, providing an alternative means of extracting lexical features for retrieval tasks. Such discovery is performed without the assistance of supervised speech tools by instead resorting to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are a search for repeated trajectories in a suitable acoustic fea- not made or distributed for profit or commercial advantage and that copies ture space (for example MFCCs, PLP) followed by a graph bear this notice and the full citation on the first page. To copy otherwise, to clustering procedure. Due to their sometimes ambiguous republish, to post on servers or to redistribute to lists, requires prior specific 1 permission and/or a fee. The reader is referred to the initial QASW summary doc- FIRE 2014, Bangalore, India ument as to why transcripts, in general, were not available Copyright 2015 ACM. for this task. content, the discovered units are referred to as pseudoterms, Although, we received a single submission, the experi- and we can represent each question and response as a set of ments may give better MAP values by pooling the results at pseudoterm offsets and durations. Complete specification of depth of 100. We will make the result analysis more compre- the term discovery system use for this work can be found in hensive in the upcoming editions. We wish to scale the task the literature [1,3]. for more languges in future and hope more teams participate Briefly, the system functions by constructing a sparse (thresh- enthusiastically in FIRE-2015. olded) distance matrix across the frames of the entire corpus. It then searches for approximately diagonal line structures 7. ACKNOWLEDGMENTS in that matrix, as such structures are indicative that a word We are grateful to Nitendra Rajput for providing the spo- or phrase has been repeated. Once the sparse distance - ken questions and responses and for early discussions about trix has been constructed, it remains to search for runs of evaluation design. We als thank Doug Oard for his guidance nearby frames, which make up extracted terms. A thresh- throughout the design and execution of this task. Finally, old δ dictates a frame length that is considered acceptable, we wish to thank Aren Jansen for donating his time and and thus the number of extracted regions. These regions expertise in creating the psuedo-term cluster set. are then clustered based on whether they overlap in a given dimension. Regions that happen to overlap are clustered; these clusters are known as psuedo-terms. 8. REFERENCES The choice of δ has a strong influence on the number of [1] M. Dredze, A. Jansen, G. Coppersmith, and K. Church. psuedo-terms that are produced. Lower thresholds imply NLP on spoken documents without asr. In Proc. higher fidelity matches that yield purer pseudoterm clusters EMNLP, pages 460–470. Association for Computational with, on average, lower collection frequencies. The set of Linguistics, 2010. data made available for this task had, specifically, δ = 0.06, [2] A. Jansen, K. Church, and H. Hermansky. Towards yielding 406,366 unique pseudoterms. spoken term discovery at scale with zero resources. In INTERSPEECH, pages 1676–1679, 2010. 4. EVALUATION DESIGN [3] A. Jansen and B. Van Durme. Efficient spoken term Participating research teams were provided with the full discovery using randomized algorithms. In Proc. ASRU, set of psuedo-terms extracted from the Gujurati collection. 2011. The principal task of a participating research teams was sim- [4] D. W. Oard. Query by babbling: A research agenda. In ilar to task previously: rank all responses to each full query Proceedings of the First Workshop on Information and such that, to the extent possible, all correct answers were Knowledge Management for Developing Regions, ranked ahead of all incorrect answers. Each participating IKM4DR ’12, pages 17–22, 2012. system was asked to rank all responses for all training ques- [5] A. Park and J. R. Glass. Unsupervised pattern tions. Systems were evaluated on their ability to satisfy that discovery in speech. IEEE T-ASLP, 16(1):186–197, goal using mean average previcision (MAP). 2008. [6] N. Patel, D. Chittamuru, A. Jain, P. Dave, and T. S. 5. RESEARCH TEAMS Parikh. Avaaj Otalo: A field study of an interactive In FIRE-2014, the task was proposed as ”Document Sim- voice forum for small farmers in rural india. In ilarity Amid Automatically Detected Terms”, three teams Proceedings of the SIGCHI Conference on Human registered for participation, however, one team (HGH) from Factors in Computing Systems, CHI ’10, pages 733–742. DAIICT, Gandhinagar submitted two runs within the time ACM, 2010. frame. The participating team submitted runs each of depth 1000. 6. EVAUATION AND RESULTS Evaluation was done by pooling of top 10 documents. Rel- evance judgements were carried out manually by listening to each audio file. Summary of the results is show in table 1.

Particulars Run-1 Run-2 Num of Queries 99 99 MAP Score 0.1600 0.1600

Table 1: MAP Score of HGH Team

Both the runs submitted by HGH team were identical and generated same results. The documents used in the task contained pseudo-terms, participating teams were unaware of the audio. This task might have proved as a black-box for participating teams as they were asked to retrieve the matching audio by looking at the pseudo-terms, however, the judgements were made by listening to audio files. Playing with distances: Document Similarity Amid Automatically Detected Terms @FIRE 2014

Harsh Thakkar1, Ganesh Iyer1, Honey Patel2, Kesha Shah1

{DA-IICT1- Gandhinagar, Gujarat University2- Ahmedabad}, Gujarat, India {[email protected], [email protected], [email protected], [email protected]}

Abstract. Spoken information retrieval is a promising domain of re- search. In this paper we describe our participation in the pilot Document Similarity Amid Automatically Detected Terms task of FIRE 2014. We present the findings on our experiments with variants of distance and timestamp based approaches. The de-normalized distance based variant outperformed other two delivering best results of the submitted runs. However, there is scope for further improvement in the results.

1 Introduction

In the recent days, the term ”Speech Retrieval” has gained attention of re- searchers from information retrieval and speech processing communities. Previ- ous work was mostly based on retrieving useful information from semi-structured data using text-based queries. This shift of interests can prove to be extremely useful since majority of the Internet users over the world prefer voice based communication. Thus, the volume of speech based data generated every day is huge. Applying proper information retrieval and speech processing techniques can open new doors for innovative inter-disciplinary research. Processing such data also proposes challenges [1]. The query and response both should be in spo- ken format and there is lack of high precision speech recognition and conversion systems to tackle this task. Since the data is audio format, a variety of systems based on non-traditional modalities are been developed [2]. Information retrieval research communities such as Forum for Information Retrieval Evaluation (FIRE)1 have started tasks on Spoken information retrieval using spoken queries to search a set of audio files is a field that attempts to address this challenge. Advances in automatic speech recognition and speech-based information retrieval systems have driven progress within the field; however, that progress has been biased toward a rela- tively small number of languages. There are a large number of languages partic- ularly localized languages within developing regions that have been left out of these discoveries. 1 FIRE, http://www.isical.ac.in/ fire/ Addressing this problem requires either significant improvements to ASR, or viable alternatives. A promising step toward the latter is zero-resource term detection, a method that identifies matching regions amongst a collection of audio without prior knowledge of the underlying language model. Thus, given a set of audio files, a set of matching segments within and across those audio files can be created. The problem statement as described by the organizers is as ”Given a set of queries and a set of responses, both represented as sets of such segments, the purpose of this task is to identify response documents that are related to each query”2. In this paper we propose variants of Euclidean-based distance to address the challenge of spoken information retrieval domain. The 2014 Forum for Infor- mation Retrieval Evaluation (FIRE) focuses on Indian language audio retrieval. This year it has ultimately evolved into a pilot task only at FIRE 2014, with a focus only on speech information retrieval.

2 Corpus and Task description

The test dataset collection is created from the original audio recordings from a phone-based bulletin board system for farmers in Gujarat. Farmers would call into the system to ask questions; other farmers would call in to answer those questions. Periodically system administrators would leave announcements for all system participants to hear. The entire system was automated: callers were not presented with a live operator, instead interacting with the system and making their recordings by following computer generated prompts. These audio recordings were ”transcribed” using a zero-resource term detec- tion system. The result is a series of documents one for each recording consisting of identifier-segment pairs. That is, for each non-silent segment of the audio file, there is a demarcation of that segment, along with an identifier for the segment. Identifiers, across documents, are not necessarily unique matching identifiers denote matching regions of audio. These identifiers are known as pseudo-terms. In the dataset provided by the task organizers, there are a total 3,148 docu- ments, consisting of 149 queries and 2,999 responses. The entire collection of responses was made available to participants. Queries are divided into test and training sets. The training set consists of 16 queries, along with relevance judgments for those queries. The testing phase consists of some subset of the remaining 133 queries. The documents are CSV files with three columns: pseudo-term, start time, and end time. Pseudo-terms are regions of speech that appear throughout the corpus. Other documents may have similar pseudo-terms if regions were deemed to be similar in the audio space. The start and end regions mark where a given term appeared within an audio file. While pseudo-term itself is not necessarily unique, the (pseudo-term, start, end) generally is.

2 The Document Similarity Amid Automatically Detected Terms page, online at http://14.139.122.23/8000 HJJoshi/4000 Document Similarity.html 3 Proposed methodology

We employ distance-based similarity approach as the document similarity mea- sure. We prepared three variants as described below. From these three distances based we submitted two best performing runs for the final evaluation. We used the standard Euclidean based distance calculation method with a combination of the start and end time as required. The three variants are summarized as:

– Normalized distance (Time + distance): In this variant we applied the Euclidean distance method on the pseudo terms in conjunction with the time intervals. We considered a combination of the difference of pseudo terms and timestamps. The cumulative score obtained was then normalized for each document to provide precise statis- tics. Thus, we considered the (document, query) pair for every pseudo term present in the query which produced the least cumulative difference and marked it as relevant for that particular query. Thus this method gives the best global – De-normalized distance(Distance only): In this variant we employed the same methodology as in the above method, except we did not aggregate/normalized the results for each query. It was observed that the local minimum difference proved to be a better method than the previous global minimum difference. The findings are discussed in section 4. – Normalized distance (Distance only): This variant is based on variant 1 with a modification, in which we consider only the pseudo-term difference for every document per each query. We do not consider difference of the timestamps.

4 Experimentation and result analysis

It was discovered during the analysis of the training corpus that not all the queries have relevant documents in the QREL file. This remains a mystery for the participants. The number of total queries with relevant documents are shown in the figure 1 below. There are in total 12 queries with 34 relevant documents in the training corpus. The other 4 queries however have no relevant documents in the QREL or they are not mentioned retrieved in the QREL file. Figure 2 represents the comparison of normalized and de-normalized distance variants respectively. The de-normalized variant retrieved relevant documents for 8 distinct queries of the total 12. While the normalized variant managed to retrieve only for 5 queries. Figure 3 represents the comparison of all the three distance based variants with the total number of relevant documents in the training corpus. It can be observed from the results shown below that the de-normalized based variant outperforms the other two on an overall basis. While it fails to retrieve relevant Fig. 1. The figure represents the graph of query id. vs. relevant documents per query of the training dataset as per the qrel file provided by organizers.

Fig. 2. The figure represents a comparison of relevant retrieved documents per query for normalized distance vs. de-normalized distance. documents for query number 43. The reason for which is currently unknown. The normalized distance approach outperforms the de-normalized based variant for queries 44 and 6. It is observed that for query 44, all the three variants retrieve more documents than the relevant documents mentioned in the QREL file of the training data. The possible explanations can be: Since for all the other queries the variants work fine and any other relevant information is not known regarding the relevance parameter/threshold of the documents, all the three models fail drastically to differentiate between relevant and non-relevant models; OR; this can be an error in the QREL file provided by the organizers. The normalized (distance only) variant, purple column bar, manages to perform slightly better than the normalized (distance + time) based variant by retrieving relevant documents for 7 queries as compared to 5 to the later. For query 43 the normalized (distance only) variant outperforms all the other variants. For queries 11 and 38 all the three variants display a disappointing zero retrieval of relevant documents.

Fig. 3. The figure represents a comparison of relevant retrieved documents per query for normalized distance vs. de-normalized distance.

5 Conclusion

We conclude that the time based variants are not sufficient for the purpose of this task due to a variety of reasons. The parameters are not sufficient for the approaches to differentiate between relevant and non-relevant files. Not all queries have relevant documents, only 12 out of 16 have relevant documents in the QREL file, reason is not known. These variants do not retrieve documents for all the 12 queries present in the dataset, the best retrieving model is the de-normalized variant retrieving for a maximum of 8 queries of the total 12.

References

1. White, Jerome, Douglas W. Oard, Nitendra Rajput, and Marion Zalk. ”Simulating Early-Termination Search for Verbose Spoken Queries.” In EMNLP, pp. 1270-1280. 2013. 2. Oard, Douglas W. ”Query by babbling: a research agenda.” In Proceedings of the first workshop on information and knowledge management for developing region, pp. 17-22. ACM, 2012. IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for Indian Languages

Abhisek Chakrabarty Sharod Roy Choudhury Utpal Garain CVPR Unit Narula Institute of Technology CVPR Unit Indian Statistical Institute Kolkata, India Indian Statistical Institute 203 B.T. Road, Kolkata, India [email protected] 203 B.T. Road, Kolkata, India [email protected] [email protected]

Abstract

An unsupervised and language independent lemmatization procedure has been developed for major Indian languages (Bengali, Hindi etc) which are morphologically very rich and agglutinative in nature. The task of a lemmatizer is mapping an inflected surface word to its appropriate dictionary root word and it is a pre-requisite for implementing several NLP tools like Word Sense Disambiguation system, Machine Translation system, etc. Here the proposed method builts a trie structure using the root words from dictionary and tries to find out a potential lemma of a surface word by efficiently searching in the trie. In this present work, our lemmatization system is tested on Bengali.

1 Introduction

Unlike English, the most popular and widely used language in the world, there exist various inflectional and derivational morphological variants of a root word in the pre- mier Indian languages causing a lot of problems for computational processing. In the lexical knowledge-bases like dictionary, WordNet, etc, the entries are usually root words with their morphological and semantic descriptions and therefore, when a surface word is encountered in a raw text, its meaning cannot be obtained unless and until its ap- propriate root word is determined through lemmatization. Thus, lemmatization is a basic need for any kind of semantic processing for Indian languages. So far, there have been several researches conducted on stemming ( (Majumder et al., 2007), (Dolamic and Savoy, 2012), (Paik and Parui, 2011), (Paik et al., 2011) and (Ganguly et al., 2012)) in the context of information retrieval, but no good lemmatizer is still available for Indian languages. For Bengali, few works ( (Loponen et al., 2013) and (Sarkar and Bandyopad- hyay, 2012)) on lemmatization and morphological analysis are found in the literature. (Loponen et al., 2013) proposed a lemmatizer (.., GRALE) but no algorithmic details are given in their paper. The work of (Sarkar and Bandyopadhyay, 2012) is based on a set of Bengali-specific rules and therefore, it is not workable for other languages. Our present work, in a little way, similar to the recent work of (Bhattacharyya et al., 2014) as they also built a trie structure consisting of the words in the WordNet of a language and took two searching strategies in the trie, firstly direct searching and secondly back- tracking for finding lemma of a surface word, but we have brought more sophistication in searching process in the trie.

This article is designed as follows. Section 2 gives the details of our proposed method and in section 3, we report the result of our system for Bengali in the MET-FIRE-2014. Section 4 discusses the limitations of our method and concludes the paper. 2 The Proposed Lemmatization Algorithm

Our algorithm requires a dictionary of the language concerned. For Bengali, we have used the dictionary available with the University of Chicago 1 and processed it. There are 47, 189 distinct root words in the dictionary. Atfirst, we have created a trie structure using the dictionary root words. Each node in the trie corresponds to an unicode Bengali character and the nodes that end with the final character of any root word are marked as final nodes. The rest nodes are marked as non-final nodes. To find the lemma of a surface word, the trie is navigated starting from the initial node in the trie and navigation ends when either the word is completely found in the trie or after some portion of the word there is no path present in the trie to navigate. While navigating, some situations may occur, depending on which we take decision to determine the lemma. Those situations are described below.

1. The surface word is a root word. In that case, the surface word itself is the lemma. 2. The surface word is not a root word. In that case, the trie is navigated upto that node where the surface word completely ends or there is no path to navigate in the trie. We call this node as the end node. Now again two different cases may occur here. (a) In the path from the initial node to the end node, if one or more than one root words are found i.e. if one or more final nodes are present in the path, then pick that final node which is closest to the end node. We measure the distance between two nodes by the number of edges between them. The word represented by the path from initial node to the picked final node is considered as the lemma. For example, consider two dictionary root words ‘অংশ’/‘angsha’ and ‘অংশীদার’/‘angshidar’ and take two inflected words ‘অংেশর’/‘angsher’ and ‘অংশীদােরর’/‘angshidarer’.

‘অংশ’ = ‘অ’ + ‘◌ং’ + ‘শ’. ‘অংশীদার’ = ‘অংশ’ + ‘◌ী’ + ‘দ + ‘◌া’ + ‘র’. ‘অংেশর’ = ‘অংশ’ + ‘ে◌’ + ‘র’. ‘অংশীদােরর’ = ‘অংশীদার’ + ‘ে◌’ + ‘র’.

The trie is built using the two root words ‘অংশ’/‘angsha’ and ‘অংশীদার’/‘angshidar’. For ‘অংেশর’/‘angsher’, the end node is ‘শ’/‘sh’ node and for the word ‘অংশীদােরর’/‘angshidarer’, the first ‘র’/‘r’ node from left is the end node. For ‘অংেশর’/‘angsher’, ‘শ’/‘sh’ node is the only final node present between the initial node and the end node and hence ‘অংশ’/‘angsha’ is taken as the lemma. In the case of ‘অংশীদােরর’/‘angshidarer, ‘শ’/‘sh’ and ‘র’/‘r’ are two final nodes present between the the initial node and the end node. As ‘র’/‘r’ node is closer to the end node than ‘শ’/‘sh’ node, so we take ‘অংশীদার’/‘angshidar’ as the lemma for ‘অংশীদােরর’/‘angshidarer. (b) If no root word is found in the path from the intial node to the end node, then find the final node in the trie which is closest to the end node. If more than one final nodes are found at the closest distance then pick all of them. Now, generate the root word(s) which is/are represented by the path from initial node to those picked final node(s). Finally among the generated root word(s), pick

1http://dsal.uchicago.edu/dictionaries/list.html TOTAL. Precision TOTAL. Recall TOTAL. F1-measure: 56.19% 65.08% 60.31%

Table 1: Results of our lemmatization system on Bengali in MET FIRE 2014

the root word(s) which has/have maximum overlapping prefix length with the surface word. By the phrase ‘overlapping prefix length’ between two words, we mean the length of the longest common prefix between them. Even at this stage if more than one roots are selected, then select any one of them arbitrarily as the lemma as it is very rare to have more than one root words in this stage and if more than one root exist, then all are viable candidates. For example, consider the dictionary root words ‘শ‌ুনা’/‘shuna’, ‘শ‌ুনািন’/‘shunani’ and ‘শ‌ুনােনা’/‘shunano’ and took an inflected word ‘শ‌ুেন’/‘shune’.

‘শ‌ুনা’ = ‘শ’ + ‘◌ু ’ + ‘ন’ + ‘◌া’. ‘শ‌ুনািন’ = ‘শ‌ুনা’ + ‘ন’ + ‘ি◌’. ‘শ‌ুনােনা’ = ‘শ‌ুনা’ + ‘ন’ + ‘ে◌া’. ‘শ‌ুেন’ = ‘শ’ + ‘◌ু ’ + ‘ন’ + ‘ে◌’.

The trie is built using the three root words ‘শ‌ুনা’/‘shuna’, ‘শ‌ুনািন’/‘shunani’ and ‘শ‌ুনােনা’/‘shunano’. For the inflected word ‘শ‌ুেন’/‘shune’, the first ‘ন’/‘n’ node from the initial node in the trie is the end node. From the end node, ‘◌া’ is the closest final node in the trie and hence the corresponding root word, ‘শ‌ুনা’/‘shuna’, is considered as the lemma of ‘শ‌ুেন’/‘shune’.

3 Experimental Results

Based on the evaluation of the Morpheme Extraction Task - FIRE 2014, the results ob- tained on Bengali data using our lemmatization system are given in the Table 1. The process of linguistic evaluation goes as follows. Word pairs with same roots are compared and scores are calculated on the number of matched thus obtained. For obtaining preci- sion, a set of 1000 words are sampled from the result files generated by the lemmatization system. For each of the sampled words, another word having same root is chosen from the result file and these pairs are compared to the gold standard data. A point is giving for every word pair that has common root in the gold standard as well. The number of points for each word is normalized to 1. Now, precision is calculated as the ratio of total number of points obtained to the total number of sampled words. Similarly for calculating recall, a set of 1000 words are sampled, but this time from the gold standard data. For each of these words, another word having the same morpheme is chosen from the gold standard data randomly. The word pairs are then compared to the analyses in the results generated by our lemmatization systems. A point is given to each sampled word pair having common morpheme. This process is carried out several times and the average values of precision and recall are taken. The final score for the system in this evaluation is taken as the F1-measure i.e., the harmonic mean of precision and recall.

4 Conclusion and Future Work

We have designed our lemmatization algorithm in such a way so that it does not depend much on language-specific features. Inspite of that, it has some limitations which are as follows. Compound words and out-of-vocabulary words are not considered in our algorithm. Root words are taken from dictionary but if the coverage of the dictionary used is not good, then that will cause errors. However, as there is no such good language independent lemmatizer for Indian languages, we hope our effort is a positive contribution and we will try to overcome the drawbacks in future.

References Pushpak Bhattacharyya, Ankit Bahuguna, Lavita Talukdar and Bornali Phukan. 2014. Facilitat- ing Multi-Lingual Sense Annotation: Human Mediated Lemmatizer. Global WordNet Conference. 2014. Ljiljana Dolamic and Jacques Savoy. 2010. Comparative Study of Indexing and Search Strategies for the Hindi, Marathi and Bengali Languages. ACM Transactions on Asian Language Information Processing (TALIP), 9.3:11. Debasis Ganguly, Johannes Leveling and Gareth J. F. Jones. 2012. DCU@ FIRE 2012:Rule- Based Stemmers for Bengali and Hindi. Working Notes for the FIRE 2012 Workshop. Kimmo Koskenniemi. 1984. A General Computational Model for Word-Form Recognition and Production. Proceedings of the 10th International Conference on Computational Linguistics and 22nd annual meeting on Association for Computational Linguistics. Aki Loponen, Jiaul H. Paik and Kalervo Järvelin. 2013. UTA Stemming and Lemmatization Experiments in the FIRE Bengali Ad Hoc Task. Multilingual Information Access in South Asian Languages. Springer Berlin Heidelberg, 258-268. Prasenjit Majumder, Mandar Mitra, Swapan K. Parui, Gobinda Kole, Pabitra Mitra and Kalyankumar Dutta. 2007. Yass: Yet Another Suffix Stripper. ACM Transactions on Infor- mation Systems (TOIS) 25.4: 18. Jiaul H. Paik, Mandar Mitra, Swapan K. Parui, Kalervo Jarvelin. 2011a. GRAS: An Effective and Efficient Stemming Algorithm for Information Retrieval. ACM Transactions on on Information System (TOIS) 29.4 (2011): 19. Jiaul H. Paik and Swapan K. Parui. 2011b. A Fast Corpus-Based Stemmer. ACM Transactions on Asian Language Information Processing (TALIP) 10.2 (2011): 8. Sandipan Sarkar and Sivaji Bandyopadhyay. 2012. Morpheme Extraction Task Using Mulaadhaar – A Rule-Based Stemmer for Bengali : JU@FIRE MET 2012. Working Notes for the FIRE 2012 Workshop. AMRITA@FIRE-2014: Morpheme Extraction for Tamil using Machine Learning

Anand Kumar M, S Rajendran and K.P Soman

Centre for Excellence in Computational Engineering and Networking Amrita School of Engineering Amrita Vishwa Vidyapeetham Coimbatore-641 112

[email protected], [email protected] Abstract:

This article presents the working methodology of supervised Morpheme Extraction Task for Tamil language in Morpheme Extraction Task (MET) Task of FIRE-2014. In this attempt, Tamil Morphemes are extracted based on supervised machine learning algorithm, Support vector machines.

1 Introduction

The role of morphological analyzer is to identify root and morphemes of a word. Generally, rule based approaches are used for developing morphological analyzer system which are based on a set of rules and dictionary that contains root words and morphemes. Recently, machine learning approaches are found to be dominating the Natural Language Processing field. Machine learning is a branch of Artificial Intelligence () concerned with the design of algorithms that learn from the examples. Here, supervised learning based morphological analyzer is implemented for Tamil verbs and nouns and a suffix based method is followed for proper nouns and pronouns.

2 Methodologies

Tamil language is morphologically rich and agglutinative. Such a morphologically rich language needs deep analysis at the word level to capture the meaning of the word from its morphemes and its categories. Each root is affixed with several morphemes to generate a word. In general, Tamil language is postpositionally inflected to the root word. Each root word can take more than ten thousand inflected word forms. Tamil language takes both lexical and inflectional morphology. Lexical morphology changes the meaning of the word and its class by adding the derivational and compounding morphemes to the root. Inflectional morphology changes the form of the word and adds additional information to the word by adding the inflectional morphemes to the root.

The input to the morphological system is a POS tagged sentence. Initially, POS Tagged sentence is refined and divided according to the minimized POS tagset. Then morphologically analyzes the Noun (N) and Verb (V) forms using supervised machine learning algorithm. Pronouns (P) and Proper nouns (PN) are morphologically analyzed using suffix based method.

Tamil Sentence/ Words

Tamil POS Tagger

Noun/Verb AnalyzerPronoun AnalyzerProper Noun AnalyzerOther word Class Analyzers

Root + Morphemes

Figure 1: Framework of Tamil Morpheme Extraction

3 Machine learning based Morphological Analyzer for Nouns and Verbs

The sequence labeling is a significant generalization of the supervised classification problem. One can assign a single label to each input element in a sequence. The elements to be assigned are typically like parts of speech or syntactic chunk labels. Many tasks are formalized as sequence labeling problems in various fields such as natural language processing and bioinformatics. There are two types in sequence labeling approaches.

 Raw labeling.

 Joint segmentation and labeling. In raw labeling, each element gets a single tag whereas in joint segmentation and labeling, whole segments get a single label. In the morphological analyzer, sequence is usually a word and a character is an element. Input word is denoted as ‘W’, and, root word and inflections are denoted by ‘R’ and ‘I’ respectively.

[W]Noun/Verb = [R] Noun/Verb + [I] Noun/Verb

In turn, notation ‘I’ can be expressed as i1+ i2+…. + in where ‘n’ refers to the number of inflections or morphemes. Further ‘W’ is converted into a set of characters. Morphological analyzer accepts a sequence of characters as input and generates a sequence of characters as output. Let X be the finite set of input characters and Y be the finite set of output characters.

If the input string is ‘x’, it is segmented as x1x2....xn where each xn є X. Similarly, if y is an output string, it is segmented as y1y2...yn and yn є Y where ‘n’ is the number of segments.

Inputs: x = (x1, x2, x3…, xn)

Labels: y = (y1, y2, y3…, yn)

The main objective of sequence labeling approach is predicting y from the given ‘x’. In training data, the input sequence ‘x’ is mapped with output sequence ‘y’. Now the morphological analyzer problem is transformed into a sequence labeling problem. The information about the training data is explained in the following sub sections. Finally the morphological analysis is redefined as a classification task which is solved by using sequence labeling methodology.

3.1 Novel Data Modeling for Noun/Verb Morphological Analyzer

The goal of machine learning approach is to use the given examples and find out generalization and classification rules automatically. The data creation for the first phase of Noun/Verb Morphological analyzer system is done by preprocessing and mapping of characters. Data creation involved in the corpora development for morphological analyzer is classifying paradigms for verbs and nouns. The classification of Tamil verbs and nouns are based on tense markers and case markers respectively. Each paradigm will inflect with the same set of inflections. Paradigm provides information about all possible word forms of a root word in a particular word class. Tamil noun and verb paradigm classification is done based on its case and tense markers respectively. Number of paradigms for each word class (noun/verb) is defined. The next step is to collect the list of root words for all paradigms .

Input : ப羿鏍த鎾ன

ப羿 Output 鏍鏍 Trained Model-I ஆன <3SM

Preprocessing Morph Analyzer Morpheme Alignment

Preprocessing Postprocessing Preprocessing is an important step in data creation. It is involved in training stage as well as decoding stage. Figure 2 shows the preprocessing steps involved in the development of corpora. Morphological corpus which is usedTrained for machine Model-II learning is developed by the following steps. Romanization, Segmentation and Alignment.

Figure 3 Preprocessing Steps Table 1 Sample Data Format

I/P /P p-C P a-V A d-C D i-V i* th Th th-C th* A-V A n n* Segmentation of morpheme

In segmentation of morpheme process, words are segmented into morpheme according to their morpheme boundary. The input sequence is given to the trained Model-I. The trained model predicts each label to the input segments. This output sequence is aligned as a morpheme segments using alignment program. Morpho-syntactic tagging

The segmented morpheme sequence is given to the trained Model-II. It predicts grammatical categories to the each segment (morphemes) in the sequence.

4 Suffix based Morphological Analyzer for Pronouns and Propernouns

The morphological analyzer for Proper Noun is developed by using the suffixes. Proper noun word form is taken as input for proper noun morphological analysis. Proper noun word form is taken from minimized POS Tagged sentence. It is identified from a POS tag . Initially proper noun word form is converted into roman form for easy computation. This Roman conversion is done by using simple key pair mapping of Roman and Tamil characters. This mapping program recognizes each Tamil character unit and replace with corresponding roman character. This Roman form is given to the proper noun analyzer system. System compares the word with the suffix which is predefined. First, it identifies the suffix and replaced with the corresponding information in the proper noun suffix data set. The suffix data set is created using various proper noun inflection and their end characters. For example, the word “sithamparam”(சசதமமபரமம) is end with ‘m’(மம), and the other word “pANdisEri” (பபணமடசமசசரச) is end with ‘ri‘(ரச) , the possible inflections of both words are given in table. Morphological changes are differing for the proper noun based on the end characters. So end characters are used in creating rules. From the various inflections of the word-form the suffix is identified and the remaining part is stem. This suffix is mapped to the original morphological information. This algorithm replaces the encountered suffix with the morphological information in a suffix table.

5 Conclusion

This note explains the development of Morphological analyzer for Tamil language using Machine learning approach. Capturing the agglutinative structure of Tamil words by an automatic system is a challenging job. Generally, rule based approaches are used for building morphological analyzer. Tamil morphological analyzer for noun and verb is developed using the new and state of the art machine learning approach. Morphological analyzer problem is redefined as a classification problem. This approach is based on sequence labelling and training by kernel methods that captures the non linear relationships of the morphological features from training data samples in a better and simpler way. SVM based tool is used for training the system with the size of 6 lakh morphologically tagged verbs and nouns.

References

[1] John Goldsmith (2001), “Unsupervised Learning of the Morphology of a Natural Language”, Computational Linguistics, 27(2):153–198.

[2] Anandan. P, Ranjani Parthasarathy, Geetha T.V.2002. Morphological Analyzer for Tamil, ICON 2002, RCILTS-Tamil, Anna University, India.

[3] Daelemans Walter, G. Booij, Ch. Lehmann, and J. Mugdan (eds.)2004 , Morphology. A Handbook on Inflection and Word Formation, Berlin and New York: Walter De Gruyter, 1893-1900

[4] Dhanalakshmi V., Anand Kumar M., Rekha R.., Arun Kumar C., Soman K.P., Rajendran S., "Morphological Analyzer for Agglutinative Languages Using Machine Learning Approaches," artcom, pp.433-435, 2009 International Conference on Advances in Recent Technologies in Communication and Computing, 2009

[5] Hal Daume , 2006. http://nlpers.blogsp-ot.com/2006/11/-getting-started-in-sequence- labeling.html

[6] S.Rajendran, Arulmozi, S., Ramesh Kumar, Viswanathan, S. 2001. “Computational morphology of verbal complex “.Language in india Volume 3 : 4 April 2003.

[7] Anand Kumar, M., et al. "A sequence labeling approach to morphological analyzer for Tamil language." IJCSE) International Journal on Computer Science and Engineering 2.06 (2010): 1944-195.

Morpheme Extraction Task at FIRE 2014

Nilotpala Gandhi1, Kanika Mehta2, Prashasti Kapadia2, Adda Roshni2, Vaibhavi Sonavane2 and Prasenjit Majumder2

1 Gujarat University, Ahmedabad, Gujarat 2 DA-IICT, Gandhinagar, Gujarat

Abstract. The Morpheme Extraction task (MET) was organized for the third time this year after introducing it in FIRE 2012. Participating systems were required to provide morphemes of given term lists. The track was offered in three languages viz., Bengali, Gujarati and Tamil. The evaluation exercise comprised of a linguistic evaluation for the submitted systems. This overview paper describes the goals, data, tasks, participants, evaluation process, and obtained results.

1 Introduction

Morphology is the study of words and their grammatical structure. Simply put, it is actually the study of `morphemes', the basic meaningful units of a language which combine to form more complex words. When morphemes exist freely on their own, they are known as the root words. Those morphemes which do not have an independent existence are known as bound morphemes or axes.[1] For example, the word `sleeping' is made of two morphemes: sleep and -ing. Here sleep is a free morpheme while -ing is a bound morpheme. The goal of morphological analysis could be either to understand the structure of a language and use this understanding in various interesting tasks in machine translation, natural language processing etc. or to simply improve the retrieval performance in that particular language. Extracting morphemes from a language many times involves an important process called `Stemming' or affix removal. Three different kinds of stemmers are found in literature. The first kind, known as supervised stemmers, depend on known grammatical rules of a language whereas, the second kind, i.e., unsupervised stemmers are statistical, algorithmic and do not need any language-specific information. The third kind lie between the first two and are known as semi- supervised stemmers.[2]

Many Indian languages are morphologically rich because of the presence of huge amount of different word forms. The vast amount of various in inflected forms in which the words appear, poses an important challenge for information retrieval experiments in Indian languages. Hence, morpheme analysis becomes an important step for Information retrieval in Indian languages. In MET 2014, both language dependent and language independent systems are evaluated for Indian languages. Linguistic evaluation was carried out for the submitted systems. These, along with the results obtained, will be discussed in detail in the following pages. The task was closely modeled based on the Morphochallenge 2010 conducted by Department of Computer Science, Aalto University, Finland. [3] (http://research.ics.aalto./events/morphochallenge2010/ )

2 Data

The FIRE ad hoc corpora were used for all languages. The word list for each language has been constructed by collecting all the terms that occur in the corpus after filtering it of English terms, numerical digits, punctuation marks etc. The term lists were made available on the FIRE website.

Some of the statistics about the data used are shown in Table 1.

Table 1. DATA STATISTICS

Language No. of words Bengali 1315518 Gujarati 1932607 Tamil 666683

Table 2 shows the number of words of gold standard data used in Bengali and Tamil for linguistic evaluation.

Table 2. GOLD STANDARD STATISTICS

Language Training Data Test Data Total No. of Words Bengali 600 1074 1674 Tamil 1000 1938 2938

3 Task Linguistic Evaluation

A sample of the proposed morpheme analyses of the systems were compared against a sample of the gold standard data (which contains manual morpheme analyses). This experiment was repeated over several samples and the average was treated as the final score. This task was performed for Bengali and Tamil languages, as these are the only languages for which some gold standard data is available.

While comparing the two analyses, it cannot be expected that the algorithm of the system comes with the morphemes that exactly correspond to the ones in the gold standard data. So word pairs with same morphemes are compared and scores are calculated on the number of matched thus obtained.

For obtaining Precision, a set of 1000 words are sampled from the result files generated by the morph analysis systems. For each of the sampled words, another word having same morpheme is chosen from the result file and these pairs are compared to the gold standard data. A point is giving for every word pair that has common morpheme in the gold standard as well. The number of points for each word is normalized to 1. Now, precision is calculated as the ratio of total number of points obtained to the total number of sampled words.

Similarly for calculating Recall, a set of 1000 words are sampled, but this time from the gold standard data. For each of these words another word having the same morpheme is chosen from the gold standard data randomly. The word pairs are then compared to the analyses in the results generated by the morphanalysis systems. A point is giving for each sampled word pair having common morpheme. Recall is calculated as follows:

This process is carried out several times and the average values of Precision and Recall are taken. The final score for the system in this evaluation is taken as the F-measure i.e., the harmonic mean of Precision and Recall,

The code, used for Morphochallenge 2010, which implements all the above steps is used for this evaluation.[3] 4 Participants

Although 4 participants registered for the task, only two of them submitted their run. Table 3 shows the participant names and the languages supported by their system.

Table 3. LIST OF PARTICIPANTS

Institute Participants Language ISI Kolkata Abhishek Chakrabarty, Bengali Sharod Roy Choudhary, Dr. Utpal Garain Amrita Vishwa Dr. M. Anand Kumar, Tamil Vidhyapeetham Dr. K. P. Soman

5 Results

Table 4 shows the Precision, Recall and F-measure scores for ISI's stemmer and Amrita University's Tamil morphological analyser. The scores have been computed as described in the Task section. (The evaluation of Amrita University's Tamil morphological analyser was done on 500 words so far. So the following results are for 500 (out of 2938) words only)

Table 4. RESULTS

System Language Precision Recall F-measure ISI Kolkata Bengali 56.19% 65.08% 60.31% Amrita Vishwa Tamil 10.86% 40.90% 17.16% Vidhyapeetham

ISI's system performs pretty well on recall as well as on precision. The results from Amrita University aren't very satisfactory but the analysis on the complete test data is yet to be done.

6 Conclusion

Morpheme Extraction task successfully evaluates some of the latest systems for stemming and morphological analysis in Indian languages. Its main goal has been to encourage participants to experiment with different methods to improve their systems and obtain better scores. ISI's system has shown some improvement as compared to previous year's results. The results from Amrita University aren't very promising so far but better results are expected when tested on the larger test data. The improved systems will be useful in information retrieval, text understanding, machine learning and language modeling 7 Acknowledgment

We are grateful to the team of FIRE organizers and the IR lab team at DA-IICT for their support and assistance. We especially wish to thank Parth Mehta and Ayan Bandyopadhay for their invaluable help in maintaining the MET webpage. The gold standard analyses data provided by IIT-Kharagpur and AUKBC is also gratefully acknowledged. Lastly, we also thank all the participants of this task for their enthusiasm in submitting their system.

References

[1] Christopher D. Manning, Prabhakar Raghavan, Hinrich Sch utze: An Introduction to Information Retrieval, Online edition (c) 2009 Cambridge UP, Draft of April 1, 2009. [2] Harald Hammarstrőm, Radboud Universiteit and Max Planck Institute for Evolutionary Anthropology; Lars Borin, University of Gothenburg: Survey Article- Unsupervised Learning of Morphology,(c) 2011 Association for Computational Linguistics, 4 October, 2010. [3] Mikko Kurimo, Sami Virpioja and Ville T. Turunen: PROCEEDINGS OF THE MORPHO CHALLENGE, TKK-ICS-R37, TKK Reports in Information and Computer Science Espoo 2010, Aalto School of Science and Technology http://research.ics.aalto./events/morphochallenge2010/ Overview of FIRE 2014 Track on Transliterated Search

Monojit Choudhury, Gokul Chittaranjan Parth Gupta Amitava Das Microsoft Research Lab India Technical University of Univ. of North Texas, USA {monojitc,t-gochit}@microsoft.com Valencia, Spain [email protected] [email protected]

ABSTRACT label the words as E or L or NE depending on whether it an English The Transliterated Search track has been organized for the second word, or a transliterated L-language word [2], or a named-entity. year in FIRE. The track has two subtasks. Subtask 1 on language Named entities are further typed as person, location, organization, labeling of words in code-mixed text fragments was conducted for and acronym. Further, for some language pairs, words of E can be 6 Indian languages: Bangla, Gujarati, Hindi, Malayalam, Tamil, inflected with suffix of L or vice versa. In such cases, they have Telugu, mixed with English. In Subtask 2 on retrieval of Hindi film to be tagged as MIX. We also introduced a tag O=others for words lyrics, along with transliterated queries in Roman script, this year which cannot be classified as L, E, NE or MIX. These could be we also had queries. A total of 54 runs were submitted punctuations, numbers, emoticons or foreign language words. from 18 teams, of which 35 runs for subtask 1 and 7 runs for sub- For each transliterated word (i.e. words with tag L), the correct task 2 has been evaluated. Performance of the runs is comparable transliteration has to be provided in the native script (i.e., the script to that of last year’s. which is used for writing L). We added three language pairs to this subtask this year, amount- ing to a total of six language pairs – English-Bangla, English-Gujarati, 1. INTRODUCTION English-Hindi, English-Kannada, English-Malayalam and English- The shared task on transliterated search was introduced last year Tamil. Furthermore, last year the labeling task was restricted to in FIRE 2013. There were two subtasks on labeling of the query queries or very short text fragments. This year, most of our sen- words and ad hoc retrieval for transliterated lyrics queries [1]. This tences were acquired from social media posts (public) and blogs. year, we hosted the same two subtasks, but with additional features With a large number of spelling variations and contractions hap- and language pairs. A large number of teams participated in the pening over social media, we believe the task this year was more shared task. challenging than last year’s. We provide here an overview of the transliterated search track at the sixth Forum for Information Retrieval Conference 2014 (FIRE 2.2 Subtask 2: Mixed-Script Ad hoc Retrieval ’14). First, we present a description of the shared tasks in Sec. 2. for Hindi Song Lyrics Next, we describe the datasets associated with these tasks in Sec. 3. Given a query in Roman or Devanagari script, the system has Sec. 4 records task participation information. We discuss results in to retrieve the top-k documents that are either in Devnagari or in Sec. 5 and conclude with a summary in Sec. 6. Roman transliterated form or in both the scripts (mixed-script doc- uments). Like last year, we used the Bollywood song lyrics corpus 2. TASKS and song queries as our dataset, but two new concepts were in- Our track on transliterated search contains two major sub-tasks. troduced this year. First, the queries could also be in Devanagari. Therefore, in the rest of the paper, we divide our descriptions, Second, Roman queries could have splitting or joining of words. results, and analyses into two parts, one for each of these sub- For instance, “main pal do palka shayar hun” (where tasks. Details of these tasks can also be found on the website the words pal and has been joined incorrectly), or “madhu http://bit.ly/1q6rb6h. ban ki sugandh” (where the word madhuban has been in- correctly split incorrectly). We observed that in the Bing logs, a 2.1 Subtask 1: Language Identification and large number of lyrics queries featured this kind of noise, which is Transliteration probably due to lack of formal education in Hindi. Indeed, Bolly-

Suppose that s :< w1w2w3 . . . wn >, is a sentence is written in wood’s popularity stretch far beyond the Hindi speakers. Hence, Roman script. The words, w1, w2, w3, ..., wn, could be English we chose to introduce these types of challenging queries this year. words or transliterated from another language L. The task is to 3. DATASETS In this section, we describe the datasets that have been released for the tasks described in the previous section and those that could Permission to make digital or hard copies of all or part of this work for be generally useful for solving transliteration tasks. While the for- personal or classroom use is granted without fee provided that copies are mer have been carefully constructed by us using manual and - not made or distributed for profit or commercial advantage and that copies tomated techniques and have been made available to participants bear this notice and the full citation on the first page. To copy otherwise, to through email requests, the latter are external resources freely avail- republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. able online. Information about these are available at the website Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. http://bit.ly/194bOTT. 3.1 General 5. General These datasets can generally useful for a variety of translitera- Leipzig corpora collection tion tasks. These include word frequency lists, word transliteration (a) This dataset has several large pairs, miscellaneous tools and corpora for various languages. corpora for multiple languages. The word frequency lists for English, Hindi and Gujarati have been con- 1. English structed from Leipzig corpora. Please cite the paper mentioned on the site [5] in your working notes. (a) English word frequency list This dataset is available in a plain tab-separated text format. It contains the stan- (b) ITRANS to UTF-8 converter for Bangla This tool has dard dictionary of English words followed by their fre- been developed by IIT Kharagpur. Look for “Windows quencies computed from a large corpus. It contains −> Stand-alone Application Available Modules". One some noise (very low frequency entries) as it is con- can register on the site for free and download the appli- structed from a news corpora. cation. 2. Hindi 3.2 Subtask 1 We split the data we collected for all the 6 language pairs from (a) Hindi word frequency list This dataset is available in a various publicly available sources. For the Hindi-English langauge plain tab-separated text format. It contains Hindi words pair, data was procured from the last year’s shared task and newly (in Devanagari script) followed by their frequency com- annotated data from our more recent work [6, 7]. For the Bangla- puted from a large Leipzig corpus (see below). English language pair, similarly, data from last year’s shared task (b) Hindi word transliteration pairs 1 This is available was combined with data from [7]. Gujarati-English pair data re- in a plain tab-separated text format. It contains 30, 823 mained the same from the previous year and lastly, the 3 other transliterated Hindi words (Roman script) followed by language pairs have been introduced this year and it is based out the same word in Devanagari. It contains Roman spelling of publicly available data from the Internet. Details of the sources variations for the same Hindi word (transliteration pairs from which the data was collected is given in table 3. We also made found using alignment of Bollywood song lyrics). It it mandatory for the participants to sign a data usage agreement to does not contain frequency of occurrence of a particu- ensure that all participants used the data only for the purposes of lar word transliteration pair [3]. this shared task, do not share it, and would delete all copies of it (c) Hindi word transliteration pairs 2 This dataset con- after they make their submissions. tains annotations (Hindi word transliteration pairs) col- The labeled data, from all language pairs, except Kannada-English lected from different users in mulitple setups – chat, and Tamil-English, were split into development and test sets. Since dictation and other scenarios. These may be collated we had few labeled data samples from Kannada-English and Tamil- into a single resource file if desired; it also provides the English pairs, we used them entirely for as our test set. frequency of occurrence of a particular word transliter- The number of sentences, tokens of each kind and translitera- ation pair [4]. tions available for each language pair for the development set is given in table 1. Since the amount of data was moderate, we did 3. Bangla not recommend its use for training algorithms, but rather as a de- (a) Bangla word frequency list This is available in a plain velopment set for tuning model parameters and to understand and tab-separated text format. It contains Bangla words analyze word transliteration pairs. This data was provided as UTF- (Roman script, ITRANS format) followed by their fre- 8 encoded text files. For the Tamil and Kannada language pairs, quency computed from a large Anandabazar Patrika cor- as mentioned before, because we had less than 150 labeled sen- pus1. The ITRANS to UTF-8 converter below can be tences, we released 2-3 labeled examples for participants to famil- used for obtaining the words in Bangla script. iarize themselves with the annotation format. (b) Bangla word transliteration pairs This dataset con- For the test set, we combined the labeled data we had procured tains annotations (Bangla word transliteration pairs) col- with unlabeled data using publicly available content from the In- lected from different users in mulitple setups – chat, ternet. This procedure has enabled us to obtain labels for a large dictation and other scenarios. These may be collated quantity of unlabeled data from the submissions made to the shared into a single resource file if desired; it will also pro- task this year. We intend to use this to easily obtain ground truth vide the frequency of occurrence of a particular word for a larger development/test set for shared tasks to be conducted transliteration pair [4]. in the future. Details about the number of sentences in the test set and the size of the subset of sentences that were labeled for each 4. Gujarati language pair are given in table 2. (a) Gujarati word frequency list This is available in a 3.3 Subtask 2 plain tab-separated text format. It contains Gujarati We first released a development (tuning) data for the IR system – words (in Gujarati script) followed by their frequency 25 queries, associated relevance judgments (qrels) and the corpus. computed from a large Leipzig corpus (see below). The queries were Bollywood song lyrics. The corpus consisted of (b) Gujarati word transliteration pairs This is available 62, 894 documents which contained song titles and lyrics in Roman in a plain tab-separated text format. It contains translit- (ITRANS or plain format), Devanagari and mixed scripts. The test erated Gujarati words (Roman script) followed by the set consisted of thirty five queries in either Roman or Devanagari same word in Gujarati script. Due to the poor avail- script. On an average, there were 65.48 qrels per query with av- ability of Gujarati resources, this is a small list of 546 erage relevant documents per query to be 7.37 and cross-script2 entries created from our training data. 2Those documents which contain songs in both the scripts are ig- 1http://www.anandabazar.com/ nored. Lang2 Sentences Tokens E-tags L-tags MIX O NEs Translits Bangla 800 20,648 8,786 7,617 0 3,783 462 364 Gujarati 150 937 47 890 0 0 0 890 Hindi 1,230 27,614 11,486 11,989 0 3,371 768 2,420 Malayalam 150 1,914 326 1,139 65 292 92 0

Table 1: Number of sentences and tags provided for each language pair in the development set. . English was the one of the languages in all language pairs. Lang2 refers to the other language in the pair.

Lang2 Test-set Subset Tokens E-tags L-tags MIX O NEs Translits Size with labels Bangla 1,000 739 17,305 7,215 6,392 0 3,236 462 397 Gujarati 1,000 150 1,078 12 1,050 0 0 16 1,064 Hindi 1,273 1,270 32,111 12,434 13,676 0 4,815 1,186 2,542 Kannada 1,000 119 1,271 280 812 3 138 38 815 Malayalam 1,000 120 1,473 243 885 37 233 75 885 Tamil 1,000 49 974 460 399 0 115 0 0

Table 2: Number of sentences and tags on which language pairs was evaluated on, in the test set. English was the one of the languages in all language pairs. Lang2 refers to the other language in the pair.

Number of teams who made a submission 18 Number of accepted teams (based on their output 16 conforming to our output format and submitting a working note) Number of runs received 54 Number of runs accepted 39

Table 5: Participation details for all the subtasks; numbers in the table indicate the number of runs submitted.

Lang2 Data Sources relevant documents to be 3.26. The mean query length was 4.57 Bangla http://www.facebook.com/JuConfessions, words. The song lyrics documents were created by crawling sev- http://www.gutenberg.org/ebookx/18581 dhingana musicmaza hindilyrix Gujarati http://www.gutenberg.org/ebookx/18581, eral popular domains like , and . http://songslyricserver.blogspot.com/ p/blog-page_19289.html Hindi https://www.facebook.com/Confessions. 4. SUBMISSIONS OVERVIEW IITB, https://www.facebook.com/ A total of 18 teams made 54 submissions for both subtasks. Of DUConfess1, Some manually curated data from these, 39 runs were declared valid as they conformed to the output Facebook public pages format. These runs came from 16 unique teams. These details are Kannada http://kannadalyric.blogspot.in, http://www.gutenberg.org/ebookx/18581 given in table 5. Of the accepted runs, in subtask-1, Hindi-English Malayalam https://www.facebook.com/ was the most common language pair (constituting 17 runs), fol- keralatourismofficial, https:// lowed by Bangla-English (8 runs). Gujarati-English and Kannada- www.facebook.com/mathrubhumicom, English had 3 runs and Malaylam-English and Tamil-English had https://www.facebook.com/asianetnews, 2 runs, respectively. Half of teams submitted a single run for a lan- http:///www.facebook.com/AsianetNews, guage pair. IITP-TS, JU-NLP-LAB, PESIT-CS-FIRE-IR, L1, and http://filmsonglyrics.wordpress.com, http://www.gutenberg.org/ebookx/18581 BITS-Lipyantran submitted multiple runs. A total of 7 runs were Tamil Various public blogs with Tamil content, http://www. submitted for subtask-2 by 4 teams. Details are given in table 6. gutenberg.org/ebookx/18581 Most of the submissions made by the teams for subtask-1 have utilized character based n-gram features with additional token-level Table 3: Sources from which data was obtained for the test and features along with a supervised classifier for language identifi- development sets, across different language pairs. cation and dictionaries along with rules to obtain transliteration. Two submissions (Asterisk and BITS-Lipyantran) has used a read- ily available API (Google transliteration API) for the transliteration task. Four teams, BITS-Lipyantran, IIITH, IITP-TS, and JU-NLP have gone beyond using token and character level features, by us- ing contextual information or a sequence tagger. A brief summary of all the systems is given in table 4. For subtask 2, the teams followed two different strategies; some teams use a single operating script, either Devanagari or Roman, and then transliterate the documents and queries which are in the other script to the operating script. Other teams have generated Team Character Token fea- Rules Dictionary Context Classifier Transliteration n-grams tures Asterisk - -   - - Google transliteration API BTS-Lipyantran  -    SVM + Naive Google Transliteration API Bayes BMS-Brainz  -   - - Rule-based DA-IR  - - - - Path-matching Rule-based with Hindi as the base lan- guage I1 -  -  - Naive Bayes - IIITH   - -  SVM + Linear ID3 classifier + Indic-converter Kernel IITP-TS   --  SVM, Random Rules along with probability distribu- forests, decision tions tree ISI - -   - - Uses a dictionary and rule-based ISM-D  - - - - MaxEnt Character mapping rules and dictionary JU-NLP   - - - CRF Phrase-based statistical transliteration tool PESIT-CS-FIRE  - - - - SVM and Naive Uses a dictionary and rule-based Bayes Salazar - -   - - Uses a dictionary and rule-based Sparkplug - -   - - Uses a dictionary and rule-based

Table 4: Description of systems for subtask-1.

Team Subtask-1 Subtask- matching with some threshold. The tokens of index were word Name 2 uni and bi-grams. bn hi gu ml kn ta BIT prepared an initial word bi-gram query to contain terms from asterisk - 1 - - - - - both scripts using Google Transliterate. They retrieved a first doc- BIT ------2 ument and enriched query from the word n-grams from this first BITS- - 2 - - - - 2 retrieved document. This expanded query is then used for retrieval. Lipyantran IIIT-H considered Roman as the operating script i.e. all the doc- BMS- 1 - 1 1 1 1 - uments and query were converted in Roman script using a translit- Brainz eration strategy. They applied some normalisation rules like repe- DA-IR - 1 1 - - - - DCU ------2 tition of the same character was replaced by single occurence. The I1 - 2 - - 1 - - documents were divided based on fields like title, first line, first IIITH 1 1 1 1 1 1 1 stanza, artist name, body etc during indexing. The term variation in IITP-TS 3 3 - - - - - Roman script was handled using edit-distance with some pruning. ISI 1 ------ISMD - 3 - - - - - JU-NLP- 2 ------5. RESULTS LAB The ideal way to measure the effectiveness of an algorithm out- PESIT-CS- - 2 - - - - - put on subtask 1 is not an obvious choice. We try to be as thor- FIRE-IR ough as possible to reward or penalize in all the different aspects Salazar - 1 - - - - - Sparkplug - 1 - - - - - of the labeling task, and try to adapt traditional metrics wherever applicable. Subtask 2, on the other hand, can be easily evaluated Total 8 17 3 2 3 2 7 using standard IR metrics. In this section, we first precisely define the metrics used for evaluating the runs submitted to subtasks one Table 6: Participation details for all the subtasks; numbers in and two. We then tabulate the performance of all the participating the table indicate the number of runs submitted. teams. 5.1 Evaluation metrics cross-script equivalents for indexing the documents as well as match- ing the queries. Here is a brief overview of the approaches used by 5.1.1 Subtask 1 the teams for subtask 2. We used the following metrics for evaluating Subtask 1. Our BITS-LIPYANTARAN back-transliterated the queries and doc- metrics reflect various degrees of strictness, including the strictest uments to Devanagari using Google Transliteration engine. They (Exact Query Match Fraction) to the most lenient (Labeling Accu- removed vowels as part of the normalising step and indexed char- racy) metrics. acter n-grams as tokens with n∈{3,4,5,6}. They also supplied some hand-tailored rules for consonants mappings. Exact query match fraction (EQMF) = DCU generated a dictionary of cross-script equivalents from the #(Quer. for which lang. labels and translits. match exactly) (1) documents in the corpus which contained the song in both scripts. #(All queries) For the out-of-vocabulary (OOV) terms, some hand-tailored rules and an Transliteration engine was used. They divided documents in Exact query match fraction LI only (EQMF2) = fields like title and body and indexed them according to the script #(Quer. for which lang. labels match exactly) (2) separately. For equivalents, they used edit-distance based term #(All queries) 5.1.2 Subtask 2 Exact transliteration pair match (ETPM) = For evaluating subtask 2, we used the well-established IR metrics #(Pairs for which translits. match exactly) (3) of normalized Discounted Cumulative Gain (nDCG) [9], Mean Av- #(Pairs for which both o/p and reference labels are L) erage Precision (MAP) [10] and Mean Reciprocal Rank (MRR) [11]. The value of this ratio can be treated as a measure of translitera- We used the following process for computing nDCG. The for- tion precision, but the absolute values of the numerator and denom- mula used for DCG@p was as follows inator are also important. For example, when there are 2000 true p L words in the reference annotations, it is possible that a method X reli DCGp = rel1 + (11) can detect 5 of these and produce the correct transliterations for log2i i=2 each, and have a ratio value of 1.0. Another method can detect 200 of these, and produce correct transliterations for 150, and obtain where p is the rank at which we are computing DCG and reli is a value of 0.75. We treat the second method as a better one. We the graded relevance of the document at rank i. For IDCG@p, we note that, as Knight and Graehl [8] point out, back-transliteration sort the RJs for a particular query in the pool in descending order is less forgiving than forward transliteration for there may be many and take the top−p from the pool, and compute DCG@p for that ways to transliterate a word in another script (forward translitera- list (since that is the best possible (ideal) ranking for that query). tion) but there is only one way in which a transliterated word can Then, as usual, we have be rendered back in its native form (back-transliteration). Our task DCG@p thus requires the algorithm to only perform back-transliteration and nDCG@p = (12) thus there is only one correct transliteration answer for a word in a IDCG@p given context. Along these lines, we also compute the translitera- nDCG was computed after looking at the first five and the first tion precision, recall and F-score as below. ten retrieved documents (nDCG@5 and nDCG@10). For computing MAP, we first compute average precision AveP Transliteration precision (TP) = for every query, where AveP is given by

#(Correct transliterations) (4) n P (P (k) × rel(k)) #(Generated transliterations) AveP = k=1 (13) No. of relevant documents where where k is the rank in the sequence of retrieved docu- Transliteration recall (TR) = ments, n is the number of retrieved documents, P (k) is the preci- #(Correct transliterations) (5) sion at cut-off k in the list and relk is an indicator function equaling #(Reference transliterations) 1 if the item at rank k is a relevant document, zero otherwise. Then,

Transliteration F−score (TF) = PQ AveP (q) MAP = q=1 (14) 2 × TP × TR (6) Q TP + TR where Q is the number of queries. In our case, we consider rel- evance judgments 1 and 2 as non-relevant, and 3, 4, and 5 as rel- Labelling accuracy () = evant. MAP was computed after looking at the first ten retrieved #(Correct label pairs) (7) documents. Total#pairs The reciprocal rank of a query response is the multiplicative in- verse of the rank of the first correct answer (ranki). MRR is the #(E−E pairs) average of the reciprocal ranks of results for a sample of queries Q English precision (EP) = (8) #(E−X pairs) |Q| 1 X 1 #(E−E pairs) MRR = (15) English recall (ER) = (9) |Q| ranki #(X−E pairs) i=1 In our case, we consider relevance judgments 1 and 2 as incorrect 2 × EP × ER answers, and 3, 4 and 5 as correct answers. MRR was computed EnglishF−Score(EF) = (10) EP + ER after looking at the first ten retrieved documents. We observed that minor changes in these conventions do not alter the general trends Here, an A-B pair refers to a word that is labeled by the system of the results. as A, whereas the actual label (i.e., the ground truth) is B. X is a We also introduced two new metrics this year. First is Recall@10 wildcard that stands for any category label. Thus, E-E pair is a word or R@10, defined as follows: that is of English and also labeled by the system as E, whereas E- X pair consists of all those words which are labeled as English by #(relevant doc retrieved for qi in top 10 ranks) R@10(qi) = the system irrespective of the ground truth. Like E-precision etc., max(#(relevant doc for qi), 10) we have L-precision, L-recall, and L−F-Score for L where L. We (16) also measured precision, recall and F-score of the other classes, i.e., |Q| NE, MIX and O, but they have not been reported here. 1 X In our transliteration evaluation strategy we relaxed certain con- R@10 = R@10(qi) (17) |Q| straints for string matching. We handle certain cases of unicode i=1 normalization, and do not penalize mistakes made on homorganic The second metric, Cross-Script Recall or csR@10 measures the nasal – chandrabindu replaced by bindu and the non-obligatory use R@10 while considering only those documents which are in a dif- of the nukta. ferent script than that of the query. Mixed-script documents are ignored while computing csR@10. Thus, if the query is in Roman large room for improvement in back-transliteration for Indic script, we compute the recall on a subset of relevant documents for languages. the query which are only in Devanagari, and vice versa. csR@10 indicates the cross-script retrieval power of a system, which is not 5.3 Subtask 2 explicitly captured by any of the other aforementioned metrics. The test collection for Subtask 2 contained 35 queries in Roman and Devanagari scripts. The Devenagari queries were intention- 5.2 Subtask 1 ally kept simple and relatively unambiguous, whereas the Roman Tables 7 and 8 list out the results of all the submissions for queries were of varying difficulty. They featured world level join- subtask-1 that were well formatted. We had several other runs ing and splitting as well as highly ambiguous short queries. Ta- which could not be evaluated because of formatting issues3. One ble 5.3 presents the results of the 7 runs received. We observe team, though had well formatted runs and corresponding evalua- that the two runs from BITS-Lipyantran performs best across all tion scores, decided to withdrew their submission and hence is not the metrics. Using Devanagari as the working script, and mapping included in this report. both the queries and documents to Devanagari must have helped We observe that there is no single run which has the best score them because in the native script their is usually one single correct across all the metrics. Therefore, we decided to use the average of spelling. Moreover, use of Google Transliteration API and word 5 different metrics: LF, EF, TF, LA and ETPM (normalized by the level n-grams for indexing and matching must have helped in im- max. ETPM for all submissions for that language pair) to rank the proving the precision. For all the systems, csR@10 has the lowest runs. The best performing run for each language pair according to absolute value, which implies that there is a scope for improving this score is marked with an asterisk in Table 7. Note that some the cross-script retrieval. Systems are performing reasonably well teams did not generate the transliterations, and their ETPM score when the scripts of the query and the document are the same. Prob- was assumed to be 0. ably errors introduced during transliteration or cross-script word Interestingly, five different teams have topped in the different mapping accumulates and eventually brings down the cross-script language pairs. Some teams, such as JU-NLP-LAB, DA-IR and recall. IITP-TS, participated in one or two language pairs. They seem to have fine-tuned their system for those languages and performed 6. SUMMARY very well in the respective language tracks. These three teams The transliterated search shared task was introduced last year in respectively topped in the Bangla-English, Gujarati-English and FIRE 2013; we had received 17 runs from 5 teams in subtask 1 and Hindi-English tracks. Only two teams, IIITH and BMS-Brainz, 8 runs from 3 teams for subtask 2. This year, we had nearly a 3 fold participated in all the language pairs. Some salient observations increase in the teams and runs for subtask 1. In subtask 2, we had about the performance of the different systems are described be- similar number of runs and teams. This clearly shows that the track low. is gaining popularity and has been successful in building a research • Two teams (Asterish and BITS-Lipyantaran) used Google community. transliteration API for Hindi, and they have the highest TF Subtask 1 is a very fundamental task and has applications much scores. IITP-TS comes close with their indigenous translit- wider than transliterated search. While language labeling seems eration system. like an easy and solved task, the performance of the team in the shared task shows that for some languages like Hindi and Bangla, • The teams which used machine learning on token based and the best systems are only 90% accurate, which leaves a lot of room n-gram features have higher labeling accuracy than the teams for research and improvement. Even for the other languages, where which only relied on dictionaries and rules. Team Salazar is a the accuracy of some systems have gone up to 98%, we believe this notable exception, though. Their LA is comparable to, if not is due to the nature of the datasets rather than inherent simplicity better than, most of the other runs that use machine learning. of the problem. In the coming years, we would definitely like to create a bigger repository of annotated data and include more Indic • It is difficult to estimate the inherent hardness associated with languages and if possible, a few other languages. Transliteration is the languages, if any, for this task because the datasets are far from a solved problem and we need more awareness and data not uniform and the amount of data released is also differ- around it. ent. Moreover, most of the teams submitted in only one or In subtask 2, this year we introduced Devanagari queries as well two languages and a fair comparison across languages cannot as more challenging Roman transliterated queries with realistic er- be done. Nevertheless, we see very high LA scores for the rors commonly seen in Web search queries. Here also we see that three Dravidian languages in spite of the fact that no training cross-script retrieval performance is still lagging behind the same- data was released for these languages. On the other hand, script retrieval performance. Furthermore, native script retrieval Hindi seemed to be the hardest language to tackle for this seems to be easier (e.g., average NDCG@5 over all the runs for the task. This might be because Dravidian languages have a rich Devanagari queries 0.722, and for the transliterated Roman queries morphology and large number inflections which can help one is 0.511). Among the Roman transliterated queries, the average to detect the words easily. Bangla and Gujarati has fewer in- NDCG@5 for queries with splitting or joining of words is 0.286, flections and Hindi has still fewer. whereas for other transliterated queries the average is 0.617. Thus, developping a practical search engine for lyrics or transliterated • Transliteration seems to be a hard problem in general. The search in general is very challenging, and there is a lot of scope best TF score, which is for Hindi, is only 0.304. There is a for research and innovation. As a final remark, we would like to mention that it is not possi- 3The participants were intimated about the formatting errors and al- lowed to resubmit several times till the automatic evaluation script ble to freely distribute the data collected for subtask 1 because it has could handle the submission. Nevertheless, some teams did not been taken from various social media websites and blogs with var- or could not submit the run in the required format, and had to be ious privacy policies. Right now the data is only available to those discarded. who participate in this shared task, and can be used only for this Team Run NDCG@1 NDCG@5 NDCG@10 MAP MRR R@10 csR@10 BIT 1 0.5024 0.3967 0.3612 0.2698 0.5243 0.4343 0.2193 BIT 2 0.6452 0.4918 0.4572 0.3415 0.6271 0.4822 0.1898 BITS-Lipyantran 1 0.7500 0.7817 0.6822 0.6263 0.7929 0.6818 0.4144 BITS-Lipyantran* 2 0.7708 0.7954 0.6977 0.6421 0.8171 0.6918 0.4430 DCU 1 0.5786 0.5924 0.5626 0.4112 0.6269 0.4943 0.3483 DCU 2 0.4143 0.3933 0.3710 0.2063 0.3979 0.2807 0.3035 IIITH 1 0.6429 0.5262 0.5105 0.4120 0.6730 0.5806 0.3407

Table 9: Results for subtask II. The highest scoring team has been marked *. shared task. We are trying our best to take the required permissions to make this data freely available for the community. Team Run LF EF LA EQMF2 ID Acknowledgments Bangla-English We would like to than Rishiraj Saha Roy, Adobe Research Lab BMS-Brainz 1 0.701 0.781 0.776 0.29 IIITH 1 0.833 0.861 0.85 0.383 Bangalore, for valuable suggestions and help in conducting the IITP-TS 1 0.88 0.907 0.886 0.411 workshop on the shared task. We are also grateful to all the peo- IITP-TS 2 0.881 0.907 0.886 0.41 ple who have voluntarily contributed to the datasets: Yesha Shah, IITP-TS 3 0.861 0.888 0.87 0.379 Swati Jhawar, Ria Gupta, Prof Dinesh Babu J, Kumaresh Krishnan, ISI 1 0.835 0.882 0.862 0.378 P. S. Srinivasan, Rekha Vaidyanathan, Dr.Shambhavi B R, Dr.Sagar JU-NLP-LAB∗ 1 0.899 0.92 0.905 0.444 B M, Sandesh, Shwetha Kulkarni, and Abhishek J. JU-NLP-LAB 2 0.899 0.92 0.905 0.444 Gujarati-English 7. REFERENCES BMS-Brainz 1 0.856 0.071 0.746 0.173 ∗ [1] Roy, R.S., Choudhury, M., Majumder, P., Agarwal, K.: DA-IR 1 0.981 0.2 0.963 0.847 IIITH 1 0.923 0.145 0.856 0.387 Overview and datasets of fire 2013 shared task on transliterated search. In: Working notes of FIRE. (2013) Hindi-English [2] King, B., Abney, S.: Labeling the languages of words in asterisk 1 0.782 0.803 0.654 0.126 mixed-language documents using weakly supervised BITS- 1 0.835 0.827 0.838 0.205 methods. In: Proceedings of NAACL-HLT. (2013) Lipyantaran BITS- 2 0.82 0.813 0.826 0.177 1110–1119 Lipyantaran [3] Gupta, K., Choudhury, M., Bali, K.: Mining hindi-english DA-IR 1 0.778 0.75 0.771 0.153 transliteration pairs from online hindi lyrics. In: LREC. I1 1 0.806 0.797 0.807 0.195 (2012) 2459–2465 I1 2 0.756 0.664 0.738 0.165 [4] Sowmya, V., Choudhury, M., Bali, K., Dasgupta, T., Basu, IIITH 1 0.787 0.794 0.792 0.143 IITP-TS∗ 1 0.908 0.899 0.879 0.269 A.: Resource creation for training and testing of IITP-TS 2 0.907 0.899 0.878 0.265 transliteration systems for indian languages. In: LREC. IITP-TS 3 0.885 0.873 0.857 0.209 (2010) ISMD 1 0.895 0.878 0.872 0.269 [5] Quasthoff, U., Richter, M., Biemann, C.: Corpus portal for ISMD 2 0.911 0.901 0.886 0.276 search in monolingual corpora. In: Proceedings of the fifth ISMD 3 0.911 0.901 0.886 0.276 PESIT-CS-FIRE- 1 0.81 0.782 0.654 0.157 international conference on language resources and IR evaluation. (2006) 1799–1802 PESIT-CS-FIRE- 2 0.812 0.782 0.656 0.158 [6] Vyas, Y., Gella, S., Sharma, J., Bali, K., Choudhury, M.: Pos IR tagging of english-hindi code-mixed social media content. Salazar 1 0.883 0.857 0.855 0.231 In: EMNLP’14. (2014) 974–979 Sparkplug 1 0.693 0.641 0.599 0.053 [7] Barman, U., Das, A., Wagner, J., Foster, J.: Code mixing: A Kannada-English challenge for language identification in the language of BMS-Brainz∗ 1 0.894 0.681 0.836 0.218 social media. (2014) 13–23 I1 1 0.892 0.757 0.848 0.269 [8] Knight, K., Graehl, J.: Machine transliteration. IIITH 1 0.932 0.854 0.9 0.429 Computational Linguistics 24(4) (1998) 599–612 Malayalam-English [9] Järvelin, K., Kekäläinen, J.: Cumulated gain-based BMS-Brainz 1 0.851 0.588 0.785 0.217 evaluation of IR techniques. ACM Trans. Inf. Syst. 20 IIITH∗ 1 0.928 0.86 0.891 0.383 (October 2002) 422–446 Tamil-English [10] Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, Inc. (1986) BMS-Brainz 1 0.705 0.816 0.799 0.122 IIITH∗ 1 0.985 0.986 0.986 0.714 [11] Voorhees, E.M., Tice, D.M.: The TREC-8 Question Answering Track Evaluation. In: TREC-8. (1999) 83–105 Table 7: Subtask 1, language identification: Performance of submissions. ∗ indicates the best performing team for each lan- guage pair. Team Run TF EQMF ETPM ID Bangla-English IIITH 1 0.021 0.004 72/288 IITP-TS 1 0.073 0.005 228/337 IITP-TS 2 0.073 0.005 228/337 IITP-TS 3 0.071 0.005 231/344 ISI 1 0.053 0.004 174/309 JU-NLP-LAB 1 0.062 0.005 227/364 JU-NLP-LAB 2 0.037 0.004 134/364 Gujarati-English DA-IR 1 0.463 0.02 492/1035 IIITH 1 0.261 0.007 259/911 Hindi-English Asterisk 1 0.304 0.002 1605/1936 BITS- 1 0.258 0.005 1923/2156 Lipyantaran BITS- 2 0.252 0.004 1876/2109 Lipyantaran DA-IR 1 0.163 0.001 1330/2153 IIITH 1 0.122 0.001 907/2004 IITP-TS 1 0.244 0.005 1933/2306 IITP-TS 2 0.244 0.004 1931/2301 IITP-TS 3 0.24 0.004 1871/2226 ISMD 1 0.217 0.001 1616/2203 ISMD 2 0.118 0.001 924/2251 ISMD 3 0.204 0 1596/2251 PESIT-CS- 1 0.112 0 895/2157 FIRE-IR PESIT-CS- 2 0.152 0.001 1238/2158 FIRE-IR Salazar 1 0.15 0 1086/2235 Sparkplug 1 0.208 0.001 1214/1333 Kannada-English BMS-Brainz 1 0.525 0 433/732 IIITH 1 0 0 0/751 Malayalam-English IIITH 1 0.098 0.008 90/852

Table 8: Subtask 1, transliteration: Performance of submis- sions A List-searching based approach for Language Identification in Bilingual Text: Shared Task report by Asterisk

Abhinav Raj Sankha Karfa M. S. Ramaiah Institute of Technology M. S. Ramaiah Institute of Technology Bangalore, India Bangalore, India e-mail: abhinav [email protected] e-mail: [email protected]

Abstract 2 Approach

Its a dictionary based approach. We started with Hindi En- In this paper, we describe a List Searching based system glish Mixed language. English and Hindi words were labelled for word-level language identification of mixed text. Our separately along with some of the names and known places. method uses List searching and minimum edit distance, Different list are made for English corpus, Hindi corpus, Per- therefore, can easily be implemented on most languages. son’s name, Location and map items, common organisations. Its performance is carried on the test sets provided by the Each word is checked against the list and if found, corre- shared task on language Identification for English Hindi sponding label is marked with it. If a ambiguous word is (En-Hi) Pair. The experimental results show a consistent encountered(which is found in both Hindi and English), we performance with with high precision. check the neighbouring words in sentence to make decision. If a new word(unknown) is found, its marked as Hindi Word. Keywords: Back-Transliteration, Language Labelling, Our model is able to identify the Hindi and English words Mixed-Text written in Roman language if they are written properly, with- out missing letters and with different variations. Ambiguous words are labelled on basis of neighbouring words. If both neighbouring word are belongs to language X, the ambiguous word is labelled X, else we we go to second neighbour and 1 Introduction continue.

Most of the languages are written using indigenous scripts, 3 Experiments i.e. Hindi is written in Devanagari. However, often the web- sites and the user generated content (such as tweets and Implementation blogs) in these languages are written using Roman script af- ter transliteration. Transliteration, the process of converting We are checking each word in named list followed by in En- words into Roman script, is used abundantly on the Web not glish and Hindi list. If the word is found, we are mapping it to only for documents, and user queries that are used to search respective language. If its a named entity, we are denoting it for these documents. In this paper we are proposing the so- by [WORD]\P for person and [WORD]\L for location. Hindi lution of language identification in bilingual text (Hindi and and English words are marked as WORD\H and WORD\E. English), and back-transliteration of Hindi Words. e.g Desh The organisations are labelled as (\O). ki population’ this is an Hindi English mixed message, where The model is implemented using python. Each line is read, Desh ki is in Hindi and the population is in English. We are and then tokens are separated with blank space (” ”), Using also targeting Named Entity in Indian Languages by marking RE the words are separated from sentence and newlines are proper nouns. A challenge faced while processing transliter- also marked . ated queries is because of extensive spelling variation. For in- stance, the word Dhanyavad (”thank you” in Hindi and many str=re.split(’ ’,line) other Indian languages) can be written in Roman script as dhanyavaad, dhanyvad, danyavad, danyavaad, dhanyavada, A regular expression is built for regex Search and find the dhanyabad and so on. We tackle this unique situation preva- words in the list and ignore in-text special characters while lent in Web search for users of many languages around the processing . Care has been taken to avoid sub-strings result world because this important problem that has received very return by checking the position of word (beginning and end- little attention till date. We are checking individual word in ing ) or neighbouring white spaces. For example search for list to label the words. Cat can return Cat and Concatatnation. The words are first

1 searched in English list followed by Hindi list and named en- EQMF All (No transliteration) 0.126 tity list. The words which aren’t found in any list are labelled EQMF without NE (No transliteration) 0.223 as Hindi. The named entities are expressed in square brack- EQMF without mix (No transliteration) 0.126 ets. The identified Hindi words written in Roman Script are EQMF without NE and Mix (No translit- 0.223 back transliterated to its own script using Google Translit- eration) eration API. The result is encoded UTF-8 format and the EQMF All 0.002 returned in output file. EQMF without NE 0.003 EQMF without Mix 0.002 EQMF without Mix and NE 0.003 Training Experiments ETPM 1605/1936 The new word is trained for the first time and then its auto- This is the the evaluation Result for run1. matically added in corresponding list. We have constructed ambiguous word dataset 5 Error analysis

Datasets Errors are there because of new words, creative words which couldn’t be found in Database and ambiguous words. We are Datasets for English words are taken various websites working on minimum edit distance algorithm and then com- including from Oxford Dictionary database, with each pare the token to predict the language . This will decrease source having 3000 to 100,000 words in them . The the error rate. More errors could be because of spelling mis- complied English dataset contains 452 abbreviations and takes and containing creative spellings (gr8 for great), word 504543 unique words including medical and scientific play (goooood for good), abbreviations (OMG for Oh my terms. Data sources include SIL International: (www- God!). We are also working to make labelling decision more 01.sil.org/linguistics/wordlists/english/ ). accurate. Some words are taken from Leslie Foster, Dept. of Mathematics, SAN JOS STATE UNIVERSITY, (http://www.math.sjsu.edu/ foster/dictionary.txt) 6 Conclusion and Future Work and http://www.nicklea.com/articles/wordlist.txt. The data are also taken from In this paper, we described a method of labelling and map- http://zyzzyva.net/ , ping words from a mixed bilingual text, and back transliter- http://wordlist.aspell.net , ate Hindi words into native script using list based searching. http://dreamsteep.com/projects/the-english-open-word- This model can be further used in shared task for building list.html , and http://www.mieliestronk.com/wordlist.html new design / product in field of artificial reality as well. The machine can be trained to be more like human and it can Hindi Datasets are obtained from Dicts.info, ”FIRE2013 give intelligent response. The model is aiming a very impor- Track on Transliterated Search” and training material pro- tant part of language processing which is rarely addressed vided by FIRE organisers giving us a unique list of 36253 .We will be making a smarter model soon with improved ef- words.The Name of person was obtained through baby- ficiency and increased accuracy. name.org, deron.meranda.us/data/ , State CET results, We are targeting a predictive model now, which can work on MicrosoftStudentPartner /MicrosoftStudentAssociates Se- informal words (nt= NOT), as they arent found in dictio- lection List (total 20398 names from both language). The nary. Also we are working on words which are combination name of places were obtained through encyclopedia and DB- of two or more individual words, for both language. (e.g. pedia (1407 locations). List of Organisation were obtained . Parmeshware= param + eshwara, lejayenge= le+ jayenge through wikipedia database with 71 entries. ).We plan to do this parallel to increase our processing speed. We will be coming with tries model for faster search ,or- ganised data-structure, and training for new words.The new 4 Results word will be asked for correct marking for once, and then it will be added in the list of that language. We are looking for A test data of 1270 lines for Hi-En pair was run for the model, context based model to tackle ambiguous words for higher with total 27296 tokens(en-12324, Hindi-13676, NE- 1186), accuracy in a Smarter model. and it was evaluated on precision, recall and f-measure for Hindi and English and label accuracy. Final scores were on basis of Exact transliterated pair match.

LP LR LF EP ER EF TP TR TF LA 0.861 0.717 0.782 0.728 0.897 0.803 0.2 0.631 0.304 0.654

2 Metrics Definition LP, LR, LF Token level precision, recall and F-measure for the Indian language in the language pair EP, ER, EF Token level precision, recall and F-measure for English tokens TP, TR, TF Token level transliteration precision, recall, and F-measure LA Token level language labeling accuracy = correct label pairs/(correct label pairs + incorrect label pairs) EQMF EQMF (Exact query match fraction) as defined in [1] EQMF(without EQMF as defined in [1], but only considering language identification transliteration) ETPM Exact transliterated pair match as defined in [1]

7 acknowledgement

We thank Mr. Krishna Raj PM, Assistant Professor, Infor- mation Science and Engineering, MSRIT for guidance and support. We also thank Archit Vyas(BMSCE), Prakruthi Deepak -(PESIT), Tanu Rampal (PESIT) and Harshitha Bidappa-(BMSIT)for help in corpus creation .

8 References

• FIRE 2013 Shared Task detailed descrip- tion: FAQ retrieval using noisy queries, http://www.isical.ac.in/˜fire/faq-retrieval/2013/faq- retrieval.html (URL verified 19 Nov. 2013) • Umair Z Ahmed, Kalika Bali, Monojit Choudhury, and Sowmya V. B., Challenges in Designing Input Method Editors for Indian Languages: The Role of Word-Origin and Context, in Proceedings of IJCNLP Workshop on Advances in Text Input Methods , Association for Com- putational Linguistics, November 2011 • P. J. Antony and K. P. Soman, Machine Translitera- tion for Indian Languages: A Literature Survey. In In- ternational Journal of Scientific Engineering Research, Volume 2, Issue 12, December-2011 • B King, S Abney, Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods In Proceedings of NAACL-HLT, 2013.

• Spandana Gella, Jatin Sharma, Kalika Bali Query word labeling and Back Transliteration for Indian Languages: Shared task system description • Sujoy Das, Aarti Kumar Performance Evaluation of Dic- tionary Based CLIR Strategies for Cross Language News Story Search

3 A Relevance feedback based approach for mixed script transliterated text search: Shared Task report by BIT Mesra, India Amit Prakash Sujan Kumar Saha Department of Computer Science & Department of Computer Science & Engineering Engineering BIT Mesra, India BIT Mesra, India [email protected] [email protected]

ABSTRACT transliterated text too. This multilingual space of the web creates an additional challenge for language technology. Targeting this This paper describes the experiments carried out as part of the multilingual population with a multilingual search facility is a participation in FIRE-2014 Transliterated Search Shared task. We major challenge because queries written in either the native or participated in subtask-2 and submitted two results generated by transliterated form need to be matched to the documents written in systems based on relevant feedback approach. Given a collection both the scripts. Information retrieval in such multi-script space is of documents in mixed script, the task is to retrieve relevant referred as Multi-Script IR (MSIR). The user generated texts not documents using queries in either script. The spelling variation always follow any standard mapping between native and non between different versions of transliterated text results in a query native scripts. This result into extensive spelling variations in mismatch problem which makes this task quite challenging. The transliterated documents as well as queries, and makes this basic idea behind our approach is the observation that small problem more serious. A Hindi word can be transliterated in words posses very little spelling variation while transliterating it Roman script in many ways. For example the Hindi word word into a non native script. We proposed two n-gram approaches for “ ” can be written as hamaare, hamare, humare, humaare, query expansion based on number of characters in n-gram and their frequent occurrence. Using relevant feedback we got 14% hamarey, and so on in Roman script. Pronunciation variation gain in NDCG@1 and an improvement of 10% in MRR. The often encountered in daily conversations in spoken language has a results suggest that our approach is quite helpful in retrieving serious impact on transliteration. Other issues that results into relevant documents from mixed script document collection. various transliteration forms include loan words, language of origin, spelling variations and pattern of suffixes. So, in order to Keywords perform search in a multilingual space we need to map our queries MSIR, n-gram, Transliteration, Song lyrics, Information retrieval to match all the versions of the transliterated text. So far, there has been noticeable development in the area of 1. INTRODUCTION machine transliteration involving Roman script and Devanagari. India is a multi-language, multi-script country with 22 But searching documents from a multilingual space containing official languages and 11 written script forms. About a billion Roman and Devanagari texts have not got much attention. Since people use these languages as their first language. Most of the 2013 Microsoft Research India in conjunction with FIRE (Forum Indian languages have been evolved from ancient Brahmi script for Information Retrieval Evaluation) organizing a shared task on and now a day referred as “Indic-Scripts”. Out of these transliteration search from multilingual script space. This shared Devanagari is the most popular script which is also used to write task provides a real world scenario and standardized evaluation Hindi, a major language with about 487 million speakers and framework for researchers to develop and evaluate their systems. ranked 4th among the most spoken languages in the world. We participated in subtask-2 of Shared Task on Transliterated Unicode has enabled Hindi users to read and write Hindi in Search where the task is to retrieve a ranked list of documents Devanagari script but due to lack of standard Devanagari from a corpus of Hindi film lyrics containing documents in keyboards and advanced Devanagari text processing systems as Devanagari and Roman transliterated form using query written in well as familiarity with English and QWERTY keyboards, it Devanagari script or its Roman transliteration. The systems become popular to write Hindi language text in Roman script [6]. evaluated using standard IR metrics which include NGCG, MAP, At present Hindi in Roman script is being used on large scale for MRR and recall. catch lines of advertisements, Facebook posts, Blog writings, We performed various experiments on training data provided Hindi slogans and Short messages (SMS). There is a Roman script and developed two systems to handle this task. The main purpose Hindi Newspaper published from Fiji Islands. The concept of of our experiments is to examine the use of query expansion using Roman script for Hindi and Urdu is not new. The British technique based on relevance feedback approach for solving this rulers successfully used it in 19th century. type of task. The relevance feedback approach involves user in the The process of phonetically writing text of a language in a retrieval process so as to improve the final result set. First we non-native script is referred as transliteration. Due to exponential issues a simple set of query that returns initial set of results. We growth of internet users in recent years, the content of World extract terms from the most relevant document based on n-grams Wide Web also increased drastically. These contents include the for query expansion. We again submit the new set of query to text written in native script as well as a large amount of retrieve the final results. The results after applying query expansion on test data shows a noticeable gain in each Lucene has a highly expressive search API that takes a search performance measures with respect to initial set of results. We query and returns a set of documents ranked by relevancy with achieved a gain of 14% in NGCG@1 with respect to our base documents most similar to the query having the highest score. result. The following sections give the detail of methods used in the development of our system. 3.2 Initial Query Formation For system development we have used a training set of 25 queries 2. RELATED WORK along with relevance judgment (qrels) provided by FIRE 2014 Due to rapid growth of Internet in recent years the organizers. Queries are written in Devanagari script or its Roman WebPages are not limited to English only, other language web transliterated form of a (possibly partial or incorrect) Hindi song content increasing rapidly. Now a days WebPages can be found in title or some part of the lyrics. First of all we detect the script of every popular non-English language including various European, query. We transliterate the Roman script queries into Devanagari Asian, and Middle East languages. Users increasingly desire to using Google transliteration service. Sometimes, Google returns explore documents that were written in either their native more than one transliteration of a query, in such case we keep the language or some other languages. Therefore the IR tasks are not first transliteration returned by Google. For transliterating restricted to only documents in one language but also in multiple Devanagari queries we rely upon transliterated pairs mined from a languages. The need for handling multiple languages, introduce a corpus of Bollywood song lyrics [3]. We transliterate Devanagari new area of IR, called cross language information retrieval query on word by word basis. If a word has more than one (CLIR). The basic idea behind the cross-lingual IR is to retrieve corresponding Roman transliteration, we keep the first documents in a target language different from the query or source transliteration only. Then we formulate phrasal query as word 2- language. To handle this problem we need to map the language of gram from native and transliterated query terms. The region query and documents. Translations are performed on the query, behind formulating phrasal query of word 2-gram is to capture the the documents or in both documents and query. But, for out of word sequence in song lyrics. The query formations for Roman vocabulary words such as, named entities and pronouns we and Devanagari quires are explained by example in table 1 and depend on transliteration. The transliterated terms possesses lots table 2 respectively. of spelling variation. Dhore et al. found sixteen issues in transliteration of Hindi to English named entities [5]. However Table 1. Initial Query Formation for Roman Query the spelling variation of transliterated terms is very less explored Original Roman doori sahi jaye naa in CLIR. Query The research challenges of Multi-Script IR (MSIR) was first Transliterated Query introduced by Parth et al. in 2013 SIGIR conference [1]. According to their analysis on search log of Bing, 6% of queries Formulated Query “doori sahi”, “sahi jaye”, “jaye na”, found containing one or more transliterated terms out of total 13 Keywords billion queries fired from India. Most of the transliterated queries “ found from entertainment domain and hence, searching Hindi song lyrics emerged as a perfect research area for MSIR. Later a shared task on transliterated search started at the fifth Forum for Table 2. Initial Query Formation for Devanagari Query Information Retrieval Conference 2013 (FIRE ’13). The track was Original Devanagari coordinated jointly by Microsoft Research India, IIT Kharagpur अब क and DAIICT Gandhinagar. Query Selected The spelling variation in transliterated terms along with => tere (terey, terre, teere, tera) mixed script text is the major challenge of MSIR. Ahmed et al. transliteration of each shows the spelling variation in transliterated terms is due to the query keyword (other => mere (mera, merey, meere, lack of standard for mapping native language words in Roman possible meree) script [2]. transliterations) => sapane (sapne, 3. METHODOLOGY The proposed method consists of 3 modules: Document indexing, sapanen) initial query formulation, term extraction and query expansion. अब => ab (abb)

3.1 Document Indexing क => ek (eka) For the experiment we have used Hindi song lyrics corpus of 62,888 mixed script documents, provided by FIRE 2014. The => rang (ranga, rango, rangi) documents are in plain text format and contain some noise due to crawling from web. First of all we preprocess the documents to Formulated Query अब eliminate noise and format text for next level processing. We Keywords 1 index the preprocessed text using Lucene , a search library written अब क क rere mere in Java. mere sapane sapane ab eb ______ek ek rang 1http://www.lucene.apache.org Before submitting the queries to Lucene search engine we create a suited for retrieving the first result more accurately. MAP and Boolean query containing all 2-gram phrases joined by OR MRR also got a hike of 6% and 7% respectively. operator. For further processing, we retrieve only first result Run-2 achieved the best performance in terms of NDCG, MAP which is the most relevant with respect to query submitted. and MRR. When we compare results of Run-2 with Run-1 we notice a little improvement in performance, while with respect to 3.3 Query Expansion base run, NDCG@1 increased by 14% and MRR by 10% in Run- We extract the n-gram phrases from most relevant document 2. retrieved by initial query. These phrases are then attached with initial query for retrieving final set of documents. We use two approaches for phrase extraction based on observation on spelling 5. CONCLUSION In this paper we have examined the effectiveness of relevance variations in Hindi song lyrics. Close analysis of lyrics in Roman feedback approach in retrieving the relevant documents from a script shows that the small words posses a very little spelling mixed script documents collection. Mapping spelling variations variation but larger words are having significant spelling across scripts are the major challenge in retrieving documents variations. For example, we consider two different lyrics written from such collection. We proposed two approaches of query in Roman script for a particular song titled tadap tadap ke and expansion and results shows the proposed method effectively observe that the longer words like furqat, khuda are having handle the spelling variation across the scripts. multiple variations like furaqat, kudaa. From this observation we employ our first query expansion strategy. We extract the 3-gram There are certain possibilities to improve our system performance. having minimum character as a phrase to add with the initial The effectiveness of relevance feedback approach highly depends query. Examples of such 3-grams are is dil se. on results retrieved by initial query. In this case a better performance can be achieved by forming an effective initial query. In second approach we have used term frequency, a well known However our proposed approach for n-gram extraction can also be strategy for term extraction. We extract the most frequent 2-gram applied for plagiarism detection among Hindi song lyrics. from first document retrieved by initial query. In both the term extraction strategies we use a manually created list of most frequent bi-gram and tri-gram found in Hindi song lyrics for 6. REFERENCES dropping most frequent phrases found during term extraction. For [1] P Gupta, K Bali, R E Banchs, M Choudhury, P Rosso. Query example, phrases like la la la, oo oo oo, na na etc. were dropped. Expansion for Mixed-script Information Retrieval. In Proceedings of the 37th international ACM SIGIR 4. EXPERIMENTS AND RESULT conference on Research & development in information retrieval, 2014, 677-686. To evaluate the effectiveness of relevance feedback approach in MSIR, we consider the result obtained by initial query as baseline [2] U Z Ahmed, K Bali, M Choudhury, V B Sowmya. result. We developed two systems based on relevance feedback Challenges in Designing Input Method Editors for Indian approach to compare with the baseline system. The first system Languages: The Role of Word-Origin and Context. In uses only the 3-gram based query expansion. The second system Proceedings of IJCNLP Workshop on Advances in Text Input considers most frequent 2-grams in addition to the 3-grams. The Methods, Association for Computational Linguistics, retrieval performance of these three systems measured in terms of November 2011, 1-9. standard Information retrieval metrics like NDCG, MAP, MRR [3] K Gupta, M Choudhury, K Bali. Mining Hindi-English and recall [4]. In Table 3 we have summarized the results. Transliteration Pairs from Online Hindi Lyrics. In Table 3. Results Proceedings of the Eight International Conference on Meas NDCG NDCG NDCG Language Resources and Evaluation (LREC'12), 2012, MAP MRR Recall ures @1 @5 @10 2459-2465. Base [4] R S Roy, M Choudhury, P Majumder, K Agarwal. Overview 0.5024 0.3967 0.3612 0.2890 0.5243 0.4343 Run and Datasets of FIRE 2013. Track on Transliterated Search. Run-1 0.6250 0.5013 0.4564 0.3472 0.5956 0.4909 In Proceedings of the Fifth Forum for Information Retrieval Evaluation. FIRE ’13, India, Information Retrieval Society Run-2 0.6452 0.4918 0.4572 0.3578 0.6271 0.4822 of India, 2013. [5] M L Dhore, S K Dixit and Ruchi M Dhore. Issues in Hindi After successfully testing our systems on training data, we to English and Marathi to English Machine Transliteration of evaluated our system on test data released by organizers. The Named Entities. International Journal of Computer systems were evaluated on results obtained by submitting 35 Applications, 2012, 37-44. queries from a pool of 215 queries. [6] A Tripathi. Problems and prospects of Indian language The base run shows the result obtained by submitting initial search and text processing. Annals of library and query. Run-1 is the result after applying query expansion using information studies, 2012, 219-222. first approach discussed earlier and in Run-2 we used both the [7] D Pal, P Majumder, M Mitra, S Mitra, A Sen. Issues in expansions approaches at the same time. We got satisfactory searching for Indian language web content. In Proceedings improvement in all the evaluation metrics after applying our query of iNEWS, 2008, 93-96. expansion approached. In Run-1 NDCG@1 increased by 12 % [8] A Kumaran, M M Khapra, H Li. Report of news 2010 with respect to base run. This clearly shows our approach is well transliteration mining shared task. In Proceedings of NEWS, 2010, 21-28. Mixed-script query labelling using supervised learning and Ad hoc retrieval using sub word indexing: Shared task report by BITS Pilani, Hyderabad

Abhinav Mukherjee Kaustav Datta Anirudh Ravi Student, BITS Pilani Hyderabad Student, BITS Pilani Hyderabad Student, BITS Pilani Hyderabad [email protected] [email protected] [email protected]

ABSTRACT 2. TASK DESCRIPTION

Much of the user generated content on the internet is written in their transliterated form instead of in their indigenous script. Due The main aim of this task is to retrieve relevant documents in to this search engines receive a large number of transliterated indigenous script as well as Roman transliterated script given search queries. This paper presents our approach to handle queries in Devanagari script or its Roman transliterated form. The labelling of queries and ad hoc retrieval of documents based on task is divided in to two sub-tasks: these queries, as part of the FIRE2014 shared task on Transliterated Search. Implementation of query labeling of the mixed script content was done using a supervised learning Subtask 1: Query Word Labeling approach. For the mixed-script information retrieval, back In this subtask participants are given queries in Roman transliteration and subword indexing were carried out. transliterated form and have to identify if the words are English [E] words or are the transliterated form of an Indian language [L] Keywords (in our case we only took part for Hindi language identification). Natural Language Processing, Language Identification, Machine Further Named Entity [NE] recognition and Other tokens (neither Learning, Support Vector Machines, Naïve Bayes, Named Entity NE nor E/L labelled words) identification also had to be carried Recognition, Transliteration, Information Retrieval, Subword out. If the word was identified as an Indian language then Indexing backward transliteration to the indigenous script is to be carried out.

1. INTRODUCTION Subtask 2: Mixed-script Ad hoc retrieval for Hindi Song Lyrics There are a large number of indigenous scripts in the world that In this subtask, the participants are given a collection of are widely used. By indigenous scripts, we are referring to any documents that contain Bollywood song lyrics in Roman and language written in a script that is not Roman. Due to Devanagari script and queries also in either Roman script or technological reasons such as a lack of standard keyboards for Devanagari script. The task is to retrieve a ranked list of songs non-Roman script, the popularity of the QWERTY keyboard and from the corpus of Hindi film lyrics. familiarity with the English language, much of the user generated For more details on the task description, please read the entire content on the internet is written in transliterated form. description of the FIRE shared task on transliterated search1. Transliteration is the process of phonetically representing the words of a language in a nonnative script. For example, many times to represent a colloquialism such as ठीक है (okay) in Hindi, users will write their transliterated form. Search engines get a 3. SUBTASK 1 large number of transliterated search queries daily – the challenge in processing these queries is the spelling variation of the 3.1 Approach transliterated form of these search queries. For example the hindi Our approach for the language identification task of subtask 1 uses machine learning. For the purpose of language identification, word can be written as ‘khana’, ‘khaana’, ‘khaanna’ and so खाना we used Support Vector Machines to predict the language of a on. This particular problem involves the following: (1) Taking particular word. This basically maps the data points on to a high care of spelling variations due to transliteration and (2) dimensional feature space where the two types of data (in this Forward/Backward transliteration.

1 FIRE2014 Shared Task on Transliterated Search - Microsoft Research. (2014, January 1). Retrieved November 11, 2014.

case, Hindi and English) can be linearly separable, using a kernel to ( ), bad ( ), me ( ), do ( ), b ( ), use ( ) function. We used a linear kernel, and adjusted the parameters तो बाद मᴂ दो भी उसे accordingly to achieve a good amount of accuracy. We considered context as well, so words that can belong to both For these words, we considered the language of the surrounding Hindi and English when looked at individually, were classified words to determine which language class they belong to. This of correctly. For this we used a Naive Bayes classifier and course works under the assumption that the surrounding words considered the language of the surrounding words to be able to were classified correctly. For this purpose we found such words in predict the language of the ambiguous word. the training set which were ambiguously classified and trained a Naive Bayes classifier on them considering the language class of 3.2 Experiments the surrounding words, and then tested the model on those words We used the data given to us which included labelled queries from if they were present in the test set. Facebook conversations, annotated data from last year’s shared task and Hindi-English Transliteration pairs, along with an For transliteration, we used Google’s online transliteration API external dataset of 5000 frequent English words, to build our which uses dictionary based approach internally in order to training data set. We submitted two runs, where in Run 1 we used perform backward transliteration. char 1,2,3,4-grams as features, and in Run 2 we used char 1,2,3- Named entity recognition was done based on a lookup based grams. Our training data consisted of 8923 words, comprising method that would classify words as named entities in the test set roughly an equal number of both English and Hindi words. if they were found in the training set. This was done because the training set for named entities was too small to create a machine- learned NER classifier. We added the character ‘^’ before and after every word, to signify the beginning and end of the word. The training data set was built For words which contained multiple words separated by a ‘\’ with as a sparse ARFF(Attribute Relation File Format), so the size mixed language classes, we were required to label the word as occupied by the data set in memory would be very less. The MIX. following is an example of an instance in this training set:

{52 1, 71 1, 78 1, 87 1, 101 1, 105 1, 115 1, 4. SUBTASK 2 487 1, 1652 1, 2018 1, 2234 1, 2492 1, 2509 1, 2804 1, 2818 1, 2868 1, 3002 1, 4212 1, 5788 1, 5830 1, 7328 1, 10132 1, 13198 1, 13431 1, 13697 4.1 Approach The second subtask was to retrieve Hindi Bollywood song lyrics 1, 17488 1, 17844 1, 18971 1, 19525 english} documents when the input query was written in either Devanagari script or its Roman transliterated form. The Hindi song lyrics corpus was a mixture of documents written in either or both the The above line represents a single word in the training set, scripts (Devanagari and Roman). There were two major problems belonging to the English language. This is an example from Run 1 that were to be tackled to get better results: where there were a total of 19525 character 1,2,3,4-grams of the words, hence there were 19525 features in the training set. What 1. Spelling variations in transliterated Hindi queries and this line means is that the first 52 features (or character n grams) documents; and also in some cases Hindi words written are not present in this word, the 53rd one is. Similarly, the 72nd, in different formats. 79th, 88th, 102nd and so on, character n grams from the list of 2. Breaking and joining of words, which causes a major 19525 char n grams, are present in the word. This way of variation in spellings in transliterated search. representing the data is similar to putting a 0 for those features that are not present in the word, and putting a 1 for those that are. The approach used to tackle these problems is discussed in detail Except that this representation is more memory efficient. in the following sub sections.

Due to the small size, we could also load the data set easily onto Weka, which is the toolkit we used for machine learning. 4.2 Backward transliteration The first problem in mixed-script information retrieval is

normalization of spelling variations. This problem can be For language identification, we used the LibSVM implementation approached in two ways. The first way would be to work in the of the Support Vector Machines algorithm. On performing cross mixed-script space and work with projection of each word along validation on our training set, we decided to use a linear kernel as each script (in this case two scripts) to find a match. The second it was giving the highest accuracy. We tried out different approach is to bring all the words in a mono-script space. The first parameters and chose the configuration most optimal for our approach in this scenario would require expansion of query to training data. The test set was constructed in a similar ARFF form incorporate projection of word in both scripts along with spelling and we tested our model on it. variations that can occur. We decided to go with the second approach due to one advantage it provides, which is lesser variations in the spellings which leads to better normalization. There are many Hindi words whose transliterated forms can be classified as English too. Examples of these words are:

This consonant pattern of a document was indexed with character n-grams (n=3, 4, 5, 6).

4.4 Minor spelling variation adjustments Apart from the above methods to tackle spelling variations some minor changes were implemented to make the system more resistant to spelling variations specifically in Hindi.

 The letter “ह” is trimmed from the suffix of the words ending with that letter, as this could result in spelling variations. For example,

“मुह” and “मु”,

The dictionary based back transliteration creates less variation as “ना यह चााँद होगा” and “ना ये चााँद होगा” the word from foreign script is mapped to the nearest word in the dictionary of the native script. We used Google’s online  The words ending with “ये”, “यै”, “यो”, and “यौ” are transliteration which uses dictionary based transliteration internally2. often written with “ए”, “ऐ”, “ओ”, “औ” instead. For Instead of performing both forward and backward transliteration example, on the query to find relevant documents, we opt for backward transliteration on both query and documents. “आइये” can be written as “आइए” etc.

4.3 Sub-word indexing  On many occasions, when the vowels “इ” or “ई” or The second problem in MSIR (Mixed-script Information combination of both occur on consecutive consonants, Retrieval) is breaking and joining of transliterated words. For the later vowel is sometimes ignored. For example, example: “बिररयानी” and “बियाानी” aja ai bahar (instead of a ai bahar)  Apart from these variations which are mostly related to lejayenge lejayenge dilwale dulhaniya lejayenge variations in vowels, to add a resistance for variations in (instead of le jayenge) consonants of the word, each word in the query is expanded into words with consonants altered. The We use sub-word indexing to get around this issue. After following mapping table is used to replace consonants: transliterating back to Devanagari we expand the document with sub words. In Hindi, each word is a combination of Vowels (वर) ("क" -> "ख") ("ख" -> "क") and Consonants (핍यंजन). We derive the base of the word by removing the vowels. After deriving the base of the word, we ("ग" -> "घ") ("घ" -> "ग") concatenate all the base words of the document. (" > " (" > " So, the lines च" - छ") छ" - च") lejayenge lejayenge dilwale dulhaniya lejayenge ("ज" -> "झ") ("झ" -> "ज") le jayenge le jayenge dilwale dulhaniya le jayenge ("त" -> "ट") ("ट" -> "त") Both give the same base (after transliteration), which is: ("ठ" -> "थ") ("थ" -> "ठ") ल-ज-ग-ल-ज-ग-द-ल-व-ल-द-ल-ह-न-य-ल-ज-ग (" > " (" > " This creation of base skeleton solves two problems, it द" - ध") ध" - द") normalizes broken and joint words in queries and (" > " (" > " documents, and it also increases the resistance of the system न" - ण") ण" - न") to errors made in vowels. Choice of base skeleton to be made using consonants was made because almost all ("ि" -> "भ") ("भ" -> "ि") breaking and joining of words in Hindi is done such that the consonants stay intact. Although the dictionary based backward transliteration and expansion took care of most of the spelling variations, these adjustments added to the accuracy of the system by making it 2 http://en.wikipedia.org/wiki/Google_transliteration more resistant.

5.2 Subtask 2 The results of our subtask two are shown below. 5. RESULTS

Run NDCG@1 NDCG@5 MAP MRR RECALL 5.1 Subtask 1 Run1 0.7500 0.7817 0.6263 0.7929 0.6818

Run2 0.7708 0.7954 0.6421 0.8171 0.6918 The results of our system in subtask 1, for each metric, are listed as follows, along with the maximum score and the median score in that metric: Legend: NDCG Normalized Discounted Cumulative Gain Metric Run 1 Run 2 Max Med MAP Mean Average Precision Score Score MRR Mean Reciprocal Rank EQMF All 0.005 0.004 0.005 0.001

EQMF without NE 0.010 0.009 0.010 0.003 EQMF without Mix 0.005 0.004 0.005 0.001 6. CONCLUSIONS EQMF without Mix and NE 0.010 0.009 0.010 0.003 EQMF All (No transliteration) 0.205 0.177 0.276 0.194 For subtask 1, it is clear that our system works better when char-1 EQMF without NE (No 0.285 0.257 0.427 0.285 transliteration) to 4-grams are considered (Run 1) as compared to char-1 to 3- grams (Run 2). With this supervised learning approach, including EQMF without MIX (No 0.205 0.177 0.276 0.194 transliteration) the consideration of context, we were able to account for words which are spelt differently in slang (like ‘tym’ and ‘wer’), and EQMF without Mix and NE (No 0.285 0.257 0.427 0.285 transliteration) native script words that can be forward transliterated into multiple words, varied in their spelling. ETPM 1923/2156 1876/2109 NA NA H- Precision 0.879 0.863 0.942 0.853 For subtask 2, both the runs implemented the approach mentioned in the above sections, except query expansion to tackle variation H- Recall 0.794 0.781 0.917 0.861 in consonants using a mapping table (section 4.3), which was H- F Score 0.835 0.820 0.911 0.810 implemented in run 2. It is clear that it improved the results with E- Precision 0.780 0.767 0.895 0.767 respect to run 1. E- Recall 0.881 0.865 0.987 0.881 E- F Score 0.827 0.813 0.901 0.797 Transliteration Precision 0.156 0.152 0.200 0.109 7. REFERENCES Transliteration Recall 0.756 0.738 0.760 0.6335 [1] King, Ben, and Steven P. Abney. "Labeling the Languages of Transliteration F Score 0.258 0.252 0.304 0.1835 Words in Mixed-Language Documents using Weakly Labelling Accuracy 0.838 0.826 0.886 0.792 Supervised Methods." HLT-NAACL. 2013. [2] Gupta, Parth, et al. "Query Expansion for Multi-script Information Retrieval." (2014). SIGIR. 2014. Legend: [3] Leveling, Johannes, and Gareth JF Jones. "Sub-word EQMF Exact Query Match Fraction indexing and blind relevance feedback for English, Bengali, ETPM Exact Transliterated Pair Hindi, and Marathi IR." ACM Transactions on Asian Match Language Information Processing (TALIP) 9.3 (2010): 12.

Word-Level Language Identification and Back Transliteration of Romanized Text: A Shared Task Report by BMSCE Royal Denzil Sequiera Shashank S Rao Shambavi B R Dept. of Information Science Dept. of Information Science Dept. of Information Science B. M. S. College of Engineering B. M. S. College of Engineering B. M. S. College of Engineering Bangalore-560019 Bangalore-560019 Bangalore-560019 [email protected] [email protected] [email protected]

ABSTRACT Tamil-English language pairs and reverse transliteration for This paper presents the BMSCE team’s participation in ‘FIRE Kannada. We use n-grams to identify the label of the language Shared Task on Transliterated Search subtask-1’.Our Language and a rule based approach is applied to obtain the back Identification system is based on the n-grams approach and uses a transliterations of Romanized Kannada text. tri-gram language identifier trained over a shared and collected In n-grams approach, we use the tri-gram approach proposed by training set to classify the language of document at the word level. Canvar [4] to identify the language of the word. We make use of a We use a rule based approach blended with simple dictionary dictionary based approach to back transliterate a word into its search to back transliterate the Romanized Kannada word. We native script. Approximate string matching algorithms have also participated in the Bengali-English, Guajarati-English, Kannada- been employed to select the best matching transliteration. English, Malayalam-English and Tamil-English language tracks and have obtained 70-80% accuracy for the language pairs. 2. LANGUAGE IDENTIFICATION Categories and Subject Descriptors 2.1 Data Set The training data with the words being annotated with their I.2.7 [Natural Language Processing]: Language parsing and language was given as a part of the subtask. However, since understanding languages like Kannada and Tamil did not have sufficient amount of training data, for each language, we collected monolingual Keywords corpora for English and the Indic languages from several Word level language classification, transliteration, n-grams Wikipedia pages, Facebook public pages and different websites and blogs. Only the specific language Unicode was extracted from 1. INTRODUCTION these websites by indicating the range of valid Unicode block for A large part of the data needed for NLP comes from the web that language. For example, while collecting monolingual corpora documents. But, many web documents are romanized due to for Hindi, Unicode block for Gujarati was specified to be \u0A81- various socio-technical reasons. Indic languages for instance, are \u0AE3 and \u0AF0-\u0AF1. By this method, the word majorly written in the Roman script. Thus, identifying the embedding of other languages, punctuations and symbols was language of the underlying text turns out to be a necessitous work avoided in the training data. Table 1 shows the amount of training before beginning with any language processing. However, the data was used. existence of Code Switches [1, 2], Code Mixes [3] and variations in spelling in the orthography of the words in user content poses Table 1. Training data statistics serious problems. English Word Indic Language Word Language Pair This paper discusses our approach to the 'FIRE Shared Task on Count Count Transliterated Search subtask-1'. The task is to label the words of Bengali-English 9645 22661 a query with its underlying language in a multilingual document. The document is said to have only two languages: an Indic language and English with the text being romanized. Once the Gujarati-English 9645 9352 language of the word has been identified, if the word is in the Indic language, the word should be transliterated to its native Kannada-English 25452 31899 script. Malayalam- 9645 22607 We submitted language classification runs for Bengali-English, English Guajarati-English, Kannada-English, Malayalam-English and Tamil-English 9645 10967 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are 2.2 Normalization not made or distributed for profit or commercial advantage and that As Indic languages can be romanized in many ways, the user copies bear this notice and the full citation on the first page. To copy content is often subject to variations in spellings. Therefore, the otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. collected documents in the native language were romanized to a Conference’10, Month 1–2, 2010, City, State, Country. standard form by using a Python module called unidecode. Since Copyright 2010 ACM 1-58113-000-0/00/0010 …$15.00.

the module maps each native language Unicode to an alphabet in Table 2. Algorithm for back transliterating a Romanized English, the vowel following immediately after the consonant is Kannada word Input: Word, KannadaDictionary deleted. For instance, க in Tamil becomes k instead of ka. Such problems were handled with a specific Python script. Output: Back Transliteration of the word into its script

begin 2.3 Implementation rawTransliteration ← rawTransliterate(Word) The standardized data is then used as the training data and a Append rawTransliteration to the TransliterationsList model is created by using tri-gram Language Identifier. All Replace all ‘t’s with ‘T’s and replace all ‘d’s with ‘D’ in the Word unigrams are labelled to be English words. A set for most and append them to TransliterationsList frequent bigrams for every language was maintained and words were labelled accordingly. for every X in the TransliterationsList Derive all the combination of X by capitalizing the vowels, 3. Back Transliteration one at a time. 3.1 Data Set Append the resulting words to the TransliterationsList We back transliterated the romanized native language words only endfor for Kannada-English word pairs. A list of all the unique words in the corresponding language file was maintained and this for every X in the TransliterationList dictionary was referred during the back transliteration procedure. if X is present in KannadaDictionary The native language of the words was preserved. A dictionary of Output X 3709 words was employed. else closeMatches ← get close matches in the 3.2 Implementation KanndaDictionary with a probability We use handcrafted rules to map the roman alphabets to Kannada of 0.8 and above Unicode characters which give a coarsely transliterated word. The if closeMatches is empty raw transliteration will then be fine-tuned using the dictionary append X to closeMatches approach. Table 2 shows the algorithm for back transliterating the end romanized Kannada word. rawTransliterate() is a method that implements the crude transliteration of the word based on the pre- end determined mapping. endfor As variations in the spellings while transliterating a Kannada text into English is common, we try to recreate a possible list of words bestMatch ← Among all the words in closeMatches list, select that the user would have intended. For example, the user would the word with the highest fuzzy ratio have intended the word kADu but would have transliterated word match ← get the Longest Common Sequence of the bestMatch as kadu. To tackle this problem, we capitalize the ‘t’s and ‘d’s as and the rawTransliteration the spelling variations are frequent for these consonants. The residue ← Delete the match part of rawTransliteration resulting words are stored in a list. The vowels within the words of the list are capitalized, one at a time and the resulting words are if rawTransliteration begins with match added to the list. The list obtained by the end of all the iterations, TransliteratedWord ← match + residue represents a possible set of words the user had meant. else TraslieratedWord ← residue + match An approximation string matching algorithm is then utilized to end obtain a set of close matches in the dictionary with the specified or higher probability. We use the get_close_matches() method Output TranslieratedWord from the difflib module in Python. If no close match is found, the end word itself is assumed to be the close match. The best match is obtained by calculating the fuzzy ratio of the close matches with the coarsely transliterated word and the word with highest fuzzy 5. ERROR ANALYSIS ratio is chosen as the best match. The longest common sequence Tri-gram approach helps identifying the language of the word to between the best match and the raw transliterated word is found. some extent. But, the algorithm fails when the word belongs to The unmatched part is also found. The matched and unmatched both the languages. For example, the most frequently used part together is output as the back transliteration of the word. bigrams in English such as me, in, is, us, etc. are also the 4. EXPERIMENT RESULTS commonly used bigrams in the Indic languages. The n-grams approach does not consider the context cues such as the previous A single run for six different language pairs was submitted and the word and the next word and thus fails to recognize the code mixes results of language classification for all the language pairs are in the document. Also, as the tri-gram approach converts the recapitulated in Table 3. words into lowercase, no useful information can be drawn about The back transliteration was implemented only for Kannada- the Named Entities. Due to these reasons, the accuracy for the English language pair. Table 4 shows the transliteration accuracy some of the language pairs has decreased. A sequence classifier for Romanized Kannada words in the Kannada-English language that considers the context cues would improve the accuracy of the pair. system.

Table 3. Language label analysis for all five language pairs

Language LP LR LF EP ER EF LA EQMF All Pair

Bengali- 0.721 0.652 0.685 0.697 0.867 0.773 0.754 0.249 English

Gujarati- 0.985 0.756 0.856 0.037 0.833 0.071 0.746 0.173 English

Kannada- 0.881 0.906 0.894 0.691 0.671 0.681 0.806 0.218 English

Malayalam- 0.865 0.836 0.851 0.48 0.757 0.588 0.696 0.217 English

Tamil- 0.912 0.574 0.705 0.719 0.943 0.816 0.799 0.122 English

LP -Token level Precision, LR-Recall and LF- F-Measure for the Indic languages in the language pair EP -Token level Precision, ER-Recall and EF- F-Measure for English in the language pair EQMF-Exact Query Match Fraction

Table 4. Transliteration accuracy for Kannada-English language pair

Language Pair TP TR TF ETPM

Kannada-English 0.512 0.531 0.521 433/732

TP-Token level Precision, TR-Recall, TF- F-Measure for Transliteration ETPM- Exact Transliteration Pair Match as defined [5]

The back transliterations works well if the word or the closest Although a tri-gram based language identifier gives reasonable words exist in the dictionary. But, as languages like Kannada are accuracy, it fails to use the context cues to identify the language of agglutinative in nature and can have several inflections [6] which the word. We plan to implement a sequence based classifier that will be difficult to implement. Due to the lack of resources the would classify the word based on the previous and the next word. transliteration accuracy is not remarkable. As not all combinations Instead of using only tri-grams, we improvise on using of consonant replacements have been employed, transliteration {1,2,3,4,5,6}-grams trained on different machine learning errors occur frequently. For example, the replacement of the ‘n’s algorithms such as MaxEnt, Naïve Bayes, SVM etc. We also set with ‘N’s has not been employed, the user intended kaNNu word our sight on developing a model that uses the linguistic theory of cannot be obtained. Code Switches [7]. 6. CONCLUSION AND FUTURE WORK Our back transliteration system is prone to errors due to insufficient data. Instead of employing hand crafted rules to In this paper, we discussed the n-gram approach to identify the transliterate a word, we plan to implement a learned model to language of the word and a rule based approach to back determine the mapping of the Roman alphabets to Unicode of the transliterate a romanized Kannada word to its native script. native language.

on Document Analysis and Information Retrieval (SDAIR- 7. ACKNOWLEDGMENTS 94), 161-175, 1994. We thank Technical Education Quality Improvement Programme (TEQIP) phase II, BMSCE and State Project Facilitation Unit [5] R.S. Roy, M. Choudhury, P. Majumder, K. Agarwal. (SPFU), Karnataka for supporting our research work. Overview and Datasets of FIRE 2013 Track on Transliterated Search. FIRE @ ISM, 2013. 8. REFERENCES [6] Royal Denzil Sequiera and Shambhavi B R. Morphological Analysis and Generation of Kannada Text. In Proceedings of [1] Donald Winford, 2003. Code Switching: Linguistic Aspects, International Conference on Innovations in Engineering and chapter 5, 126–167. Blackwell Publishing, Malden, MA. Technology, 259-265, 2014. [2] John J. Gumperz. Hindi-punjabi code-switching in Delhi. In [7] Gokul Chittaranjan, Yogarshi Vyas, Kalika Bali, and Proceedings of the Ninth International Congress of Monojit Choudhury. A framework to label code-mixed Linguistics. Mouton: The Hague, 1964. sentences in social media. In Proceedings of the First [3] Carol Myers-Scotton. Dueling Languages:Grammatical Workshop on Computational Approaches to Code-Switching, Structure in Code-Switching. Clarendon, Oxford, 1993. Doha, Qatar, October.ACL, 2014. [4] William B. Cavnar and John M. Trenkle. Ngram based text categorization. In Proceedings of the 3rd Annual Symposium

LIGA and Syllabification Approach for Language Identification and Back Transliteration : Shared Task Report by DAIICT

Shraddha Patel and Vaibhavi Desai Dhirubhai Ambani Institute of Information and Communication Technology Gandhinagar, Gujarat, India [email protected], [email protected]

ABSTRACT : Hindi - H, Gujarati - G. The task is to label the words as This paper aims to address the solution for the Subtask E or H/G depending on whether it is an English word, or a 1 of Shared Task on transliterated search,a task in FIRE transliterated language word. And then, for each transliter- ’14. The task addresses the problem of data containing En- ated word, provide the correct transliteration in the native glish words and transliterated words of Indian languages in script. English.The task calls for language identification and subse- quent back transliteration into the native Indian scripts.The system proposed herewith implements Language Identifica- 2. RESEARCH AND EXPERIMENT tion Graph Approach to label the words with their language The system uses LIGA approach for language identification markers and Rule based transliteration using syllabification and syllabification for back transliteration. to obtain back-transliterated word in its native script.Results obtained for Gujarati are as follows: labelling accuracy-0.963 The original Language Identification Graph Approach(LIGA) and f measure back transliteration-0.463. Results obtained proposed the construction of character tri-grams on the text. for Hindi are as follows: labelling accuracy-0.771 and f mea- The approach has been adopted to suit our problem state- 3 sure for back transliteration-0.163 ment. Initially, the system constructed tri-grams for ev- ery word of the data separately. After an initiative to ex- periment with three models - bi-gram only, tri-gram only, Keywords bi-gram and tri-gram combined, the results obtained for bi- Transliteration, LIGA, Syllabification, n-grams, Rule based gram and tri-gram combined model were the best. Hence, a modified LIGA approach has been adopted wherein the 1. INTRODUCTION combined model is incorporated. This paper is organized as follows :Research and experiment in Section 2,System description in Section 3,Experiments The second improvement to enhance the labeling perfor- in Section 4,Results in Section 5,Error analysis in Section mance was the removal of proper nouns from the training 6,Conclusions and future scope in Section 7 and References dataset. These terms carry a great amount of ambiguity re- in Section 8. lated to their classification into a particular language, and hence, contributed to erroneous results.

Table 1: Examples The third modification is introduced in the proposed algo- INPUT-QUERY OUTPUT rithm for labeling to deal with the words containing num- paneer recipe paneer\H=pnFr bers. Appropriate method has been designed to classify such recipe\E alpha-numerical strings. iguazu water fall iguazu\E water\E fall\E 3. SYSTEM DESCRIPTION

Suppose that q: w1 w2 w3 . . . wn, is a query is writ- 3.1 System overview ten Roman script. The words, w1 w2 etc., could be stan- The system first constructs bi-grams and tri-grams on the dard English words or transliterated from another language words of the training data provided as explained in Section 3.2. These grams are then used to construct graphs for both the languages accroding to the LIGA approach explained further in Section 3.3. The test data is then fed to the system which is labeled and further, back transliterated by dynamic conversion using syllabification which is explained in Section 3.3.

3any reference to n-grams henceforth refers to character n- grams 3.2 Character n-grams A charcter n-gram of a word is a contiguous n-lettered sub- sequence of the word. All the n-grams of a word are ex- tracted by iterating over all the indices within the word. This model uses only bi-grams and tri-grams. As an exam- ple, consider bi-grams and tri - grams for the word ”SAM- PLE”. bi-gram : SA, AM, MP, PL, LE tri-gram : SAM, AMP, MPL, PLE

3.3 Language Identification Graph Approach (LIGA) LIGA is a graph based approach that uses n-grams to cap- ture the word structure. This approach not only takes into Figure 1: A LIGA graph for words : ”apple”, ”ap- account the occurrence of the n-gram but also, the ordering ply”, ”applied” where the numbers over the vertices and transitions between them. A graph will be constructed and edges represent their weights respectively and for all the n-grams of all the words for every language sep- text within the vertex represents the label of the arately. Since, the model uses bi-grams and tri-grams, two vertex. graphs - one for bi-grams and the other for tri-grams, will be constructed for each language. which will be a ’simple’ path. This path is superimposed on 3.3.1 Graph Representation the training graphs, GE and GL and the normalised weights We define a graph G = (V, E) comprising a set V of vertices of the edges and vertices encountered while following the or nodes together with a set E of edges or lines connecting path are added to obtain the corresponding path matching these nodes. Here, we are considering the construction of scores, PME and PML. In case, any of the edge or vertex graph on tri-grams. Similar approach can be followed to does not exist in the training graph, its weight is considered obtain the graph on bi-grams. Every unique tri-gram will zero. The word is classified as belonging to that language be designated as a unique label, l and will form a node, which has the highest path matching score. v ∈ V , in the graph. Hence, we can define a labeling func- X wv X we tion F : V → L where L is a universal set of labels. PM = + L |V | |E| Construct an ordered list, O, for each word of a language v∈Vw e∈Ew such that every bi-gram in the list is placed positionally between its previous bi-gram and next bi-gram. Hence, it- where Vw and Ew are the vertex and edge sets of the test erating over this set sequentially will generate unique edges. word,w respectively and V and E are the vertex and edge Each edge, e ∈ E, signifies a transition from one vertex, sets of the language L for which PM is being calculated. v1 ∈ O to the next vertex v2 ∈ O. Also, Each edge e ∈ E and vertex v ∈ V carries a weight, We and Wv respectively. Every time a new vertex v is added to ( the graph, its weight Wv is set to one. In case, this vertex Wv, if v ∈ V of L wv = is re-encountered in any of the ordered list/s, its weight is 0, if v∈ / V of L incremented by one. Similarly, when a transition from one vertex to another is encountered for the first time, the cor- ( We, if e ∈ E of L responding edge, e’s weight We is set to one. If the edge we = repeats again, the weight is incremented by one. 0, if e∈ / E of L

( Wv + 1, if v is repeated As an example, if the test word is ”applies”, a LIGA graph Wv = 1, if v is new can be constructed which will produce the following simple path : app→ppl→pli→lie→ies ( Calculation of PM for a language having training LIGA We + 1, if e is repeated graph as shown in Figure 1 can be done as follows : We = 1, if e is new Total no. of vertices = 11 Total no. of edges = 8 An example has been shown in Figure 1. 3 3 1 1 3 1 1 PM = 11 + 11 + 11 + 11 + 0 + 8 + 8 + 8 + 0

3.3.2 Classification of Word PM = 1.352 The classification of an input test word, w is done using the training graphs of both the languages to compute the path matching scores for both. A LIGA graph is developed for w 3.4 Rule based transliteration using Syllabifi- ishes indicating the referenced word is not a possible match. cation However, if the letter belongs to ambiguous letters,all the Back-transliteration has been achieved for Hindi and Gu- letters in the ambiguous list for this letter are checked in jarati language. This process makes use of Hindi and Gu- the referenced word. If a match is found, the check moves jarati dictionaries,their corresponding scripts referred to as to the next word else the check finishes. If all the letters are D henceforth. checked and matched according to the above procedure, the reference word is added to the list of possible output strings Every transliterated language word supplied by the user is L(o). first syllabified. The syllabified word is then back-transliterated into a naive Hindi word based on the rules and mapping Now from the list of output string L(o), the closest match defined for the transliterator. For Gujarati, the naive Hindi is found out. If the naive word belongs to the dictionary word is converted to Gujarati through one-to-one mapping of directly, then it is considered the nearest match. Else for each letter. This naive word is then dynamically compared every letter of the word under consideration, if the letter to the Dictionary words to obtain the nearest neighbours, matches exactly, some amount of weights is assigned. All which forms the output of the system. such weights are added to obtain a final count. The word with the maximum count is considered the answer. This procedure is applicable in both Gujarati and Hindi. 3.4.1 Syllabification The transliterated word,w is syllabified using the following 4. EXPERIMENTS rules. The word is broken down into syllables considering vowels followed by consonants as delimiters. Each time a 4.1 Training Datasets vowel, v ∈ V where V is the vowel set : {a,e,i,o,u} occurs The training datasets have been compiled from various sources before a consonant, c ∈ C where C is the consonant set of and some are constructed form scratch. English, a new syllable is formed. An exception is consid- ered for the last syllable where it can end in a consonant/s. 4.1.1 LIGA As an example : For Hindi and Gujarati,corresponding datasets were consr- tucted containing words in hindi/gujarati written in English. This was done by taking a dictionary of hindi/gujarati words and then transliterating it into English through a suitable Table 2: Syllabification examples script. This training data was used to construct the graphtr Word Syllables tri-grams used in the LIGA approach. For English, we used karma ka + rma the dictionary list as the training dataset for LIGA. Also,for all the three languages individually, we constructed a default palak + la + k wordlist containing some of the most frequent words used in khubsoorat khu + bsoo + + t the language with the exception-including only those words with length less than or equal to 4.Parts of speech covered sanskaar sa + nskaa + r in these datasets were pronouns and some adverbs. For in- stance, words like tera in Hindi. Such words if encountered are directly labelled according to their corresponding lan- 3.5 Rule based transliteration guage. After the syllables have been obtained, a mapping is used to back-transliterate them into naive hindi strings. Let the 4.1.2 Syllabification mapping be defined as M : sE → oH where sE denotes the For syllabification, we have used Hindi and Gujarati word syllable in English and oH denotes the corresponding output string in Hindi. M is a one to one mapping. All the out- lists in their native scripts.The word list is compiled by run- put strings corresponding to all the syllables of the word are ning scripts on various sources including word lists,websites concatenated sequentially to produce a single string, which hosting song lyrics and manual observation. Some words not we will denote as h. This is the naive Devanagari represen- found in the dictionary(pronouns and the inflected forms of tation of the transliterated Hindi word. For Gujarati, this other POS specifically) were added manually to create an naive Hindi word is converted into a naive Gujarati word exhaustive list for both the languages. by converting each character individually from Hindi to Gu- jarati using a character map. This word is then used to find 4.2 Training Experiments the closest resembling match/es in the Hindi dictionary, D. The system was first tested on the subtask test data of the same task in ’13 only for labelling .Named entities were re- A separate mapping list, M’ is also maintained which is a list moved in order to get an accurate view of the working of the of non-singular sets. For example, syllables like ”sha” match system. more than one Hindi letters : ”q / f”. Such letters will be regarded as ambiguous letters as their mapping is indefinite. Table 3 shows the results of the training experiments.

For every letter in the naive string, except for the letter - Following the table, are the results for back-transliteration lant, a comparison with the reference words r ∈ D is done.A carried out on all the Hindi words in training data of the letter by letter comparison is made between reference word subtask of FIRE ’14.This data contained words with their and the search string.For every letter which matches, the back transliterations mentioned. check moves forward to the next letter else the check fin- Eg. me which should be mein and hau which should be hai Table 3: Labeling results for required output English Hindi English Gujarati Some words might be spelled identically but may be differ- Precision 84.57 95 89.90 93.62 ent. However the system returns a list of all such possible outputs. Recall 84.57 95 88.12 94.62 Eg. kila [ kFl,EklA ] F - score 84.57 95 89.01 94.12 Certain words which are transliterated into English in a dif- ferent manner may not be back transliterated in the exact Labeling Accuracy 92.46 92.33 correct word. Eg. lai [lAyF,lAI ] Words having different phonetic representations but same # Correct transliterations = 1825 Roman representation may not be back transliterated effi- # Incorrect transliterations = 507 ciently. Exact Transliteration Pairs Match = 78.26%

7. CONCLUSIONS AND FUTURE SCOPE 5. RESULTS In this paper, the methodology followed for classification Tables and 4 give the labelling results,back transliteration of query words into their respective languages and their results on test data. back-transliteration into native scripts, has been presented in detail. It is quite evident that this approach involving the construction of character n-grams, counting of their fre- Table 4: Labeling results quency of occurrence in all the concerned languages (En- English Hindi English Gujarati glish, Hindi, Bangla, Gujarati), and considering the fre- Precision 77.3 71.0 97.6 16.7 quency of transition of a character n-gram to another n- gram, really helped in efficient classification of words into Recall 78.2 79.5 98.6 25.0 their languages. Proper error-analysis has been done for the F - score 77.8 75.0 98.1 20.0 proposed approach that clearly indicates the strengths and limitations of the system. Labeling Accuracy 77.1 96.3 As future work, more exploration can be done to enhance the performance of the stipulated system. The system can Table 5: Backtransliteration results be made to work for the back-transliteration of inflected Hindi Gujarati words for the more than one Indian languages,as it does for non-inflected words of Hindi and Gujarati languages. Precision 9.6 46.4 Recall 52.3 46.2 8. REFERENCES F - score 16.3 46.3 [1] W. B. Cavnar and J. M. Trenkle N-gram-based text categorization Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information 6. ERROR ANALYSIS Retrieval, 1994. 6.1 Error Analysis for Labeling [2] E. Tromp and M. Pechenizkiy Graph-Based N-gram The system can not classify words for which no bi/tri grams Language Identification on Short Texts Proceedings of exist and hence, no vertices or edges in the training data the 20th Machine Learning conference of Belgium and graph. The Netherlands, 2011. Eg.ud [3] B. Ahmed et. al Language Identification from Text It is difficult to classify some words having same Roman rep- Using N-gram Based Cumulative Frequency Addition resentation and are valid in both the languages. Proceedings of Student/Faculty Research Day, CSIS, Eg. the, deep exist both in English and Hindi. Pace University,2004 The provided dataset is erroneous in the sense that certain [4] R. Saha Roy , M. Choudhury, P. Majumder et. al English words are classified wrongly as belonging to the In- Overview and Datasets of FIRE 2013 Track on dian language. Transliterated Search http://www.isical.ac.in/, 2013 . Eg. life[\H] out[\H] of[\H] control[\H] [5] Unicode Inc. Devanagari For short words, the system results aren’t effective. http://www.unicode.org/charts/PDF/U0900.pdf Eg. I, a [6] A. Agarwal Syllabification approach for transliteration For proper nouns, since sufficient bi/tri-grams are absent in http://www.cfilt.iitb.ac.in/tools/Ankit DDP Transliteration 2010.zip, the training data graph, their classification is rather erro- 2010 neous. Eg. Sachin Tendulkar

6.2 Error Analysis for Transliteration Erroneous transliterations - The system does not give proper outputs for erroneous transliterations. DCU@FIRE-2014: Fuzzy Queries with Rule-based Normalization for Mixed Script Information Retrieval

Debasis Ganguly Santanu Pal Gareth J.F. Jones Centre for Global Intelligent Institute for Applied Linguistics Centre for Global Intelligent Computing (CNGL) University of Saarland Computing (CNGL) School of Computing Saarbruecken, Germany School of Computing Dublin City University santanu.pal@uni- Dublin City University Dublin, Ireland saarland.de Dublin, Ireland [email protected] [email protected]

ABSTRACT Keywords We describe the participation of Dublin City University (DCU) Fuzzy Query, Rule-based Normalization, Statistical Machine in the FIRE-2014 transliteration search task (TST). The Transliteration TST involves an ad-hoc search over a collection of Hindi film song lyrics. The Hindi language content of each document 1. INTRODUCTION in the collection is either written in the native Devanagari Generally speaking, mixed script information retrieval (MSIR) script or transliterated in Roman script or a combination of refers to the problem of retrieving relevant documents for ad- both. The queries can be in mixed script as well. The task hoc search queries, where the textual content of the docu- is challenging primarily because of the vocabulary mismatch ments in the collection and of the queries can be represented which may arise due to the multiple transliteration alterna- with more than one script, one native to the language of tives. We attempt to address the vocabulary mismatch prob- the document while the others being non-native [1]. Cross lem both during the indexing and retrieval stages. During script information retrieval (CSIR) represents a special case indexing, we apply a rule-based normalization on some char- for MSIR where the queries and the documents are in a sin- acter sequences of the transliterated words in order to have a gle script but different from each other. The transliterated single representation in the index for the multiple transliter- search task (TST) at FIRE (Forum of Information Retrieval ation alternatives. During the retrieval phase, we make use Evaluation) 2014 is a shared task to establish benchmark re- of prefix matched fuzzy query terms to account for the mor- trieval methodologies for the MSIR research problem. The phological variations of the transliterated words. The results document collection for TST comprises Hindi song lyrics show significant improvement over a standard baseline query written both in the Devanagari script and the Roman script. likelihood language modelling (LM) approach. Additionally, Queries are keyword based and can also be either in Devana- we also apply statistical machine transliteration to train a gari or Roman script. transliteration model in order to predict the transliteration The problem of MSIR is particularly hard because of the of out-of-vocabulary words. Surprisingly, even with satis- following reasons. First, the presence of script mixing in factory transliteration accuracy, we found that automatic the documents and the queries may necessitate different in- transliteration of query terms degraded retrieval effective- dexing and retrieval strategies for the terms in two different ness. types of scripts. For example, the process of stemming to address the morphological variations of the terms will be Categories and Subject Descriptors different for the native and the foreign script. H.3.1 [INFORMATION STORAGE AND RETRIE- Second, the transliteration process of a term in the foreign VAL]: Content Analysis and Indexing—Abstracting meth- script usually involves multiple alternatives due to linguis- ods; H.3.3 [INFORMATION STORAGE AND RETRIE- tic and cultural differences. For example, the Hindi word VAL]: Information Search and Retrieval—Retrieval models, “पहला” (EN: “first”)1 written in Devanagari script, the na- Relevance Feedback, Query formulation tive script of Hindi, may be transliterated into the Roman script as “pehla”, “pehlaa”, “pahla”, “pahlaa” etc. These General Terms multiple alternatives can give rise to a vocabulary mismatch problem between queries and documents, e.g. if one of the Theory, Experimentation query terms is “पहला”, transliterating this into “pehla” will not be able to retrieve documents that contain any other variants, e.g. “pahla”. Permission to make digital or hard copies of all or part of this work for An overview of our approach to mitigate this vocabulary personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear mismatch problem is as follows. First, we use a rule-based this notice and the full citation on the first page. Copyrights for components character sequence normalization to normalize ambiguous of this work owned by others than ACM must be honored. Abstracting with character sequences into one unique representation across credit is permitted. To copy otherwise, or republish, to post on servers or to 1 redistribute to lists, requires prior specific permission and/or a fee. Request Throughout this paper, we write the English meaning cor- permissions from [email protected]. responding to a non-English word within a pair of parenthe- Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. sis following that word prefixed with “EN:”. the collection. As an example, the alternative options for aa.Ndhii chalii to naqshekafepaa nahii.n milaa transliterating the diacritic Hindi vowel “◌ा are “a”, “A” or dil jisase mil gayaa wo dubaara nahii.n milaa “aa”. Case folding leaves open two options eliminating one. ham a.njuman me.n sabakii taraf dekhate rahe This ambiguity can be handled by a rule which replaces all apanii tarah se ko_ii akelaa nahii.n milaa ocurrences of “aa” in the corpus to “a” as a pre-processing aawaaz ko to kaun samjhataa ki duur duur step before indexing. This character sequence normalization Kaamoshiyo.n kaa dardshanaasaa nahii.n milaa is also performed over the query terms during the retrieval kachche .De ne jiit lii nadii .Dhii huii phase. majabuut kashtiyo.n ko kinaaraa nahii.n milaa Second, during retrieval time, we make use of fuzzy query आँधी चली तो नक़् शएकफ़एपा नहीं िमला terms to help alleviate the vocabulary mismatch. Specifi- िदल िजससे िमल गया वो दबारु नहीं िमला cally, a fuzzy query term allows at most 2 suffix characters हम अंजुमन म सबकी तरफ़ देखते रहे to be different in order to consider a match as a valid one, अपनी तरह से कोई अके ला नहीं िमला e.g. “tera” (EN: yours) matches with “tere” (EN: your), आवाज़ को तो कौन सझता िक दरू दरू whereas “mera” (EN: my) does not match with “tera” since ख़ामोिशयॲ का ददशनासा नहीं िमला the mismatch occurs in the prefix instead of the suffix. We कचे घड़े ने जीत ली नदी चढ़ी हईु hypothesize that in the absence of a stemmer for the translit- मजबूत कितयॲ को िकनारा नहीं िमला erated words, such a scheme of prefix-biased approximate matching may potentially work well to bridge the vocabu- Figure 1: An example mixed script document from lary gap. the TST collection which shows the same content The rest of the paper is organized as follows. In Section 2, written in both native and foreign script. we describe the different indexing approaches to bridge the vocabulary gap including the dictionary construction and the rule based normalization. Section 3 describes our re- inherent transliteration ambiguity, some of the native script trieval phase processing including query expansion, approx- words in our dictionary may point to a set of corresponding imate matching and query transliteration. This is followed foreign script words instead of a single one. For example, by Section 4, where we investigate the individual contri- the dictionary entry “पहला” (EN: first) would point to the butions from each of the proposed approaches and present set {“pehla”, “pehlaa”, “pahla”, “pahlaa”}. We ensure that various results obtained with on the set of training queries. in such a case, we always choose the lexicographically least We also provide an overview of the official results of the TST one, which for this particular example is “pahla”. We also 2014 obtained with the test set queries. Finally, Section 5 convert each occurrence of the other transliteration variants concludes the paper with directions for future work. into this representative class member, e.g. “pehla” is con- verted to “pahla” and so on. This dictionary-based cross- 2. INDEXING APPROACH script term expansion is conducted on the query side as well. In this section, we describe the two approaches using which we hoped to alleviate the vocabulary mismatch problem in Table 1: An extract from the automatically con- MSIR. We first describe a dictionary based document ex- structed dictionary using a simple one-one mapping. pansion approach, following which we describe a rule based character sequence normalization method. Devanagari Roman 2.1 Dictionary based Term Expansion रखकर rakhakar The document collection in the TST’14 task is comprised चारासाज़ chaaraasaaz of about 60K documents constituted of song lyrics. The text पके patke in a large number of these documents is simply a concatena- सजाना sajaanaa tion of the song lyrics in two scripts, as a result of which it is खैरात khairaat a straight forward process to construct a bi-directional dic- पखावज pakhaavaj tionary of words from the native script to the foreign script पका patkaa representation of a word and vice versa. An example of such 1 kamasiin a document with mixed script content is shown in Figure 1. कमसीन A one-one mapping can hence be constructed in a straight- कज़े kabze forward way using the information from these mixed script सजाने sajaane documents, e.g. “aa.Ndhii” is mapped to “आँधी”, “chalii” is mapped to “चली” and so on. The dictionary that we con- structed in this way has 21, 018 word mappings. Table 1 2.2 Rule-based Normalization shows 10 words from this dictionary. Note that in the ab- The dictionary is only able to handle those words which sence of such a well aligned structure of documents, a general can be aligned due to the presence of transliterated content approach would be to use a pre-built dictionary. However, within the same document. To reduce the term mismatch for this particular task this approach of constructing the problem for out-of-dictionary words, we employed a sim- dictionary makes sense. ple rule-based character sequence normalization method to The dictionary is then used to expand a document as fol- create a single equivalent representation for the ambiguous lows. For every word of a document in native or foreign character sequences in the index. This process is also per- script, we check if its counterpart exists in the dictionary; formed on the query side. if it does, we add this dictionary word in the complemen- Table 2 lists the rules that we applied while construct- tary script to the document index. Note that due to the ing the index along with illustrative examples. All the rules are applied iteratively, as a result of which a word can go unigrams we also index the word bi-grams with the help of through a sequence of multiple intermediate transformations the ShingleFilter3 utility of Lucene. before eventually reaching the final normal form, e.g. ac- Table 3 shows the index statistics for each index con- cording to the rules of Table 2, “dhoom” → doom → dum structed without and with the pre-processing steps described → dam. in Section 2.1 and 2.2. It can be observed that the vocab- ulary is much more diverse for the transliterated content (EN_Title and EN_Body) than the native script content Table 2: Rules applied for normalizing alternative (HN_Title and HN_Body) due to the inherent ambiguities transliterations. in the transliteration. Character sequence normalization Found Substituted Example helps to reduce this diversity by grouping together words aa a laagan → lagan into equivalent classes and using only a single class member ay ai sapnay → sapnai from each of these equivalent classes as the indexing units. ae ai sapnae → sapnai Transliteration helps to expand the document lengths by ii i mahii → mahi adding equivalent cross-script word representations. The ee i mahee → mahi indexing strategy undertaken in the last row of Table 3 is oo u pooja → puja expected to provide the best results due to the combination uu u huzuur → huzur of the two strategies for mitigating vocabulary mismatch. q k qayamat → kayamat ia dooria → doorya 3. RETRIEVAL APPROACH hh h chhaya → chaya The retrieval model that we use for all the IR experiments v w havas → hawas reported in this paper is the language model (LM) query bh b bharat → barat likelihood [2]. The parameter λ (the importance of term cch c iccha → ica presence) was set to 0.3 after tuning on the development ch c chaya → caya set queries. The subsequent retrieval experiments use this gh g ghungru → gungru value of λ, i.e. 0.3. In the following sections, we describe jh j jharoka → jarokha two retrieval time techniques of mitigating the vocabulary sh s shaan → saan mismatch between documents and queries. th t hathi → hati dh d dhoom → doom 3.1 Approximate Query Term Matching um am hum → ham Stemming plays an important role in normalizing the mor- ain ai main → mai phological variations of a word, e.g. “friendly” and “friends” dh ai main → mai are represented by the root word “friend” [5]. However in the context of MSIR, stemming words written in the for- eign script is difficult because of the multiple alternatives 2.3 Index Generation inherent in these suffixes. We therefore undertake a simple Previous research in probabilistic and language models of approach of approximately matching the query terms with IR has shown that using field specific models for IR usu- the index terms in the inverted list instead of exact matches. The utility class of Lucene that we make use of for this is ally out-performs flat ones, mainly because the document 4 term frequency maximum likelihood estimates are more re- the “FuzzyQuery” . For efficiency reasons, edit distances liable on a per-field basis than on a global one [6, 2]. In of up to 2 are allowed in the approximate matching. The prefix length value that must match for two terms to match our case, since a large number of documents in the collec- × tion are comprised of mixed script content, separating the approximately is set to τ max(len(td, tq)), where td is an index term that we seek to match with the query term tq, content of the document into two individual fields, one for ∈ the native script and the other for the foreign enables us to len(.) is the length function and τ is a parameter (τ [0, 1]). compute document term frequency and collection statistics We set the value of τ = 0.7 for all the experiments reported relative to each individual field. Moreover, each document in this paper involving fuzzy term matching. in the collection has a title and a body. It is fairly intuitive 3.2 Automatic Transliteration that a match of a query term in the title of a song lyric may For some query terms we expect to find an equivalent bear more importance than a match in the body of the song. cross-script representation exists in the dictionary. This To achieve this, we further partitioned the content of each cross-script representation for the current query term is added script field (native or foreign) of each document in two more to the query. For out-of-dictionary terms, we employ a sta- additional fields, namely the title and the body. tistical machine translation (SMT) approach to automati- A document in our index thus manifests itself as a set of cally predict the transliteration of unseen words. The sim- four individual fields, namely the Devanagari title (HN_Title), ilarity of transliteration with translation can be observed Devanagari body (HN_Body), Roman title (EN_Title) and from the fact that while SMT involves translation of a source Roman body (EN_Body). The index is generated with the language sentence to a target language one, automatic translit- help of Lucene2, a widely used freely available Java toolkit eration on the other hand involves transforming a source for indexing and retrieval. In the case of song lyric search the order of the query terms may in fact be quite important, e.g. 3http://lucene.apache.org/core/4_9_0/ “diwana dil” and “dil diwana” refers to two different songs. analyzers-common/org/apache/lucene/analysis/ In order to take the term order into account, along with the shingle/ShingleFilter.html 4https://lucene.apache.org/core/4_6_0/core/org/ 2http://lucene.apache.org/ apache/lucene/search/FuzzyQuery.html Table 3: Collection statistics from indexes obtained with different pre-processing. #Collection Frequency Pre-processing before indexing EN_Title HN_Title EN_Body HN_Body No pre-processing 106,697 51,074 1,211,136 473,144 Normalization 78,331 51,074 804,834 473,144 Transliteration + Normalization 89,855 51,115 832,460 474,058

Table 4: Retrieval results obtained on the development set of 34 queries. Parameters Evaluation Metrics Term Expansion Normalization Fuzzy Query Document-side Query-side MAP MRR Recall BPREF no no no no 0.3623 0.7688 0.5576 0.5688 no no yes no 0.2718 0.5907 0.6057 0.6628 no no no yes 0.3829 0.7760 0.6201 0.6029 no no yes yes 0.3294 0.6523 0.6923 0.7186 yes no no no 0.4371 0.7593 0.6394 0.6153 yes no yes no 0.4495 0.7898 0.6826 0.7160 yes no no yes 0.4390 0.7455 0.6538 0.6365 yes no yes yes 0.4546 0.7689 0.6971 0.7381 yes yes yes no 0.5641 0.8400 0.7211 0.7430 yes yes yes yes 0.5671 0.8467 0.7548 0.7757 yes yes yes (Dict. + Moses) no 0.3988 0.6417 0.6737 0.7134 yes yes yes (Dict. + Moses) yes 0.4131 0.6769 0.6875 0.7346 script character sequence to a target script character se- script transliterated word, we transliterate this word back quence. to the native script and add this term to the query. Equivalent cross-script words from our dictionary (see Sec- The column named “Character” in Table 5 indicates that tion 2.1) were used to constitute a list of parallel corpora the output character sequences are evaluated with respect (aligned at the character level), which in turn was used to to the character sequences of the reference (test set) data. train a transliteration model. For example, the word pair The “Word” column on the other hand indicates that the (“khushbuu”, “खु श ब”)ू from the dictionary is converted to evaluation is conducted at the word level that is by removing the following character sequence pair. the intermediates spaces between the character sequences of k h u s h b u u → ख ◌ु श ब ◌ू the output and the reference words. The goal of the alignment function is to learn the correct mapping between the character sequences, which in this case is “k h” → ख, “u” to ◌ु, “s h” to श and “b u u” to ब.ू Table 5: 10-fold cross validation results on translit- Our experimental settings for training the transliteration erated word prediction. model were: EN-HI HI-EN Character Word Character Word 1. Log-linear phrase based statistical machine translation 5 Precision 0.9177 0.8806 0.9312 0.8109 (PB-SMT) model [3], implemented in “Moses” . Recall 0.9293 0.8780 0.9073 0.8109 2. Maximum phrase length of 5 in translation model and F-score 0.9235 0.8793 0.9191 0.8109 3-gram (character sequence) for language model. BLEU 0.7494 0.6185 0.6470 0.4295 3. GIZA++ implementation of IBM alignment model with grow-diagonal-final-and heuristics for performing source target character alignment and phrase-extraction [4]. 4. No reordering performed and distortion parameter word penalty set to zero. 4. RESULTS In this section, we report the results of our experiments. The accuracy of our transliteration model was measured In order to investigate the effectiveness of various indexing with the help of 10-fold cross validation. The average pre- and retrieval approaches proposed in Sections 2 and 3, we cession, recall and F measure values are shown in Table 5. first report the results obtained on the training (develop- Table 5 shows that we achieve satisfactory transliteration ment) set of queries with an incremental application of each accuracy (measured by the BLEU score), the results being of these approaches. We then select the best experimental better from foreign to native script than the other way round settings and apply them to the test set, the approach which because of the inherent transliteration ambiguity. Since, we we in fact used for the official submissions. The software obtain more accurate transliterations from the foreign to used for the indexing and retrieval experiments reported in the native script, we use only this direction for query ex- this paper is made publicly available6. pansion. More precisely, given an out-of-dictionary Roman 5http://www.statmt.org/moses/ 6https://github.com/gdebasis/msir/ 4.1 Development Set Results Table 7: Evaluation on the test set with 10 and 1000 In order to compare our results with that of [1], we use documents retrieved. the binary relevance judgments for evaluation. Similar to [1], #ret=10 #ret=1000 we consider the documents with manually assessed scores of over 2 (i.e. 3 and 4) out of a 5-point scale from 0 to 4 as Run Description MRR MAP MRR MAP relevant. We retrieve 1000 documents for each query and Normalization + Doc. 0.6408 0.4040 0.6455 0.4701 all our reported results are calculated based on this ranked & Qry. exp. (dict. list of 1000 documents. The total number of queries in the based) + fuzzy qry. development set is 35, out of which one query has no relevant match documents (i.e. assessed scores of above 2). We therefore consider 34 queries for our evaluation. Normalization + Doc. 0.4265 0.2633 0.4315 0.3257 Table 4 shows the results. It can be observed that accord- & Qry. exp. (dict. ing to our hypothesis, the best results are obtained when based) + fuzzy qry. we use a combination of dictionary-based term expansion match + Qry. exp. on both the document and the query-side coupled with rule (Moses) based normalization and approximate query term matching. It is somewhat surprising to see that despite the good per- formance of the transliteration model trained with Moses, Table 8: Compounding effect causes poor results for expanding the out-of-dictionary query terms with automat- some queries. ically transliterated words degrades retrieval quality. Relevant Document In order to directly compare our results with [1], we com- Query # Title pare our best performing approach (best settings of Table 4) with the best results reported in [1]. For this, we eval- koi ek taara ek taara 1 इकतारा - ikataaraa uate our approach only on the first 25 queries constituting 2 ओ रे मनवा... इकतारा (मिहला वर) - of terms only in the foreign script (the same test set used o re manawaa... ikataaraa (mahi- for the TST’13 task at FIRE 2013 [7] and [1]). The results laa swar) are shown in Table 6. We see that our approach produces second best results only with a simple term based approach without modeling the equivalence of sub-word features, e.g. our system could not perform well on these types of queries character n-grams with the help of auto-encoders [1]. in general.

Table 6: Comparative evaluation on top 10 retrieved 5. CONCLUSIONS AND FUTURE WORK documents with previously reported results. We described our participatory work for the transliterated Evaluation Metrics search task (TST) at FIRE 2014. We undertook a purely IR Run Description MRR MAP approach to the search task. The main motivation for our TUVal-2 [7] 0.8440 0.4240 work was to mitigate the vocabulary mismatch between the Deep [1] 0.8740 0.5039 documents and queries where both may be represented in Our Best Approach (Table 4) 0.8500 0.4793 mixed scripts from an IR perspective. Firstly, we make use of a dictionary based transliteration for document and query expansion. Secondly, we apply rule-based normalization to 4.2 Test Set (Official) Results transform each equivalent character sequence into a single We submitted two runs in TST ’14. The first run applied representative of its equivalence class. Thirdly, we employ dictionary based alignment to expand index and query terms a fuzzy term matching to account for the absence of a stem- along with rule based normalization for indexing and the ming algorithm for words in the foreign script. Further, fuzzy query match for retrieval. The second run applied we also applied a PBSMT system to learn the correspond- PBSMT based query term transliteration (see Section 3.2). ing character sequence alignments across the source and the In Table 7, we report the results obtained on the 35 test target scripts. queries. The results showed that a pure IR approach yields com- An interesting observation is that the MAP and the MRR petitive results in comparison to a more sophisticated meth- values are considerably lower for the test set queries in com- ods for detecting character sequence level alignments, namely parison to the training set ones (c.f. Table 4 and Table 7). In using deep learning with auto-encoders [1]. Surprisingly, the order to investigate what might have caused this, we report results degraded with the use of PBSMT based query expan- the per-query retrieval results for the test set queries. We sion. In future, we plan to investigate the possible reasons observed that 8 out of 35 queries produced very low average for this degradation. precision (AP) values of less than 0.1. In fact two of the queries produced a zero AP value. On manual inspection of Acknowledgments one of these queries (shown in Table 8), we found that the This research is supported by Science Foundation Ireland reason for the bad performance is due to the compounding (SFI) as a part of the CNGL Centre for Global Intelligent effect (shown with underlined text). It can be seen that “ek Content at DCU (Grant No: 12/CE/I2267). taara” occurs as two distinct words in the query, whereas in both the relevant documents these terms have been jux- taposed together to form a compound, as a result of which 6. REFERENCES [1] P. Gupta, K. Bali, R. E. Banchs, M. Choudhury, and P. Rosso. Query expansion for mixed-script information retrieval. In The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’14, Gold Coast , QLD, Australia - July 06 - 11, 2014, pages 677–686. ACM, 2014. [2] D. Hiemstra. Using Language Models for Information Retrieval. PhD thesis, CTIT, AE Enschede, 2000. [3] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pages 177–180. Association for Computational Linguistics, 2007. [4] P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 48–54. Association for Computational Linguistics, 2003. [5] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980. [6] S. Robertson, H. Zaragoza, and M. Taylor. Simple bm25 extension to multiple weighted fields. In Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, CIKM ’04, pages 42–49. ACM, 2004. [7] R. S. Roy, M. Choudhury, P. Majumder, and K. Agarwal. Overview and datasets of fire 2013 track on transliterated search. In Working Notes on FIRE 2013, FIRE ’13, 2013. Language identification for transliterated forms of Indian language queries

Supriya Anand Bangalore, India supriya [dot] anandrao [at] gmail [dot] com

Abstract Language identification has a number of applications in natural language processing. N Gram analysis has been the traditional method for language identification. In this paper, we discuss the methods and results from our participation in the Shared Task on Transliterated Search track at Forum for Information Retrieval Evaluation, 2014. We describe a method that leverages the phonetical properties of query tokens and their effect on the grammatical construction of tokens. The system developed is found to have a token level language labelling accuracy of around eighty percent for each of the language pairs, Hindi-English and Kannada-English.

Keywords: language identification, vowel clusters, phonetics

1 Introduction

Transliteration plays an important role in cross lingual information retrieval. Transliteration into the Roman script is used frequently on the web not only for documents, but also for user queries that search for these documents. Grapheme based, phoneme based and rule based hybrid systems are three frequently used approaches to machine transliteration as discussed in [1]. Particularly for language identification, N gram approaches have yielded good results[2]. In this paper, we discuss a method that leverages alphabet similarity across major Indian languages and their influence on Romanized transliteration to identify the language and native script for individual query tokens. The structure of the paper is as follows. The task is defined in section 2 and the approach is described in section 3. Section 4 details the experiments and the main results. We analyse the results in Section 5 and conclude.

2 Definitions

In the following section, the task and the objects involved in the task are defined.

Query: white space separated list of words/tokens that identify keywords which will aid in information retrieval

Token: an individual word in the query that can either be an English word or a transliterated word written in Roman script

Task Description: Suppose that q: w1 w2 w3 wn, is a query written Roman script. The words, w1 w2 etc., could be standard English words or transliterated from another language L. The task is to label the words as E or L depending on whether it an English word, or a transliterated L-language word. And then, for each transliterated word, provide the correct 1 transliteration in the native script (i.e., the script which is used for writing L). Names of people and places in L should be considered as transliterated entries, whenever it is a native name. Thus, Arundhati Roy is a transliterated name, but Ruskin Bond is not.

Task Evaluation: The system annotated test data set is compared against a human an- notated gold standard set of queries. With this gold standard, Precision, Recall for both languages E and L, in addition to Exact query match fraction metrics are determined.

3 Method Description

Indian languages that borrow from the Devanagari script explicitly define long vowels as free standing independent vowels termed deergha swar. The system relies on this similarity in alphabets among Indian languages and their difference from the Roman alphabet in terms of a lack of defined long vowels to identify the native language for query tokens. The lack of defined long vowels in English creates vowel clusters when Indian language query tokens are transliterated to English as opposed to consonant clusters common to English query tokens. In particular, the system leverages certain phonetic features[3] that influence transliteration in addition to a few morphological rules of Indian languages[4]. Thus, towards building this system, we have mainly explored various token level features and used them in conjunction with a Naive Bayes Classifier to identify the native language. The features can be grouped into categories as described below: Capitalization features: This feature set is used to identify whether a particular token is a Named Entity or an acronym. In the context of identifying Indian languages, this is not of much value and is meaningful only in cases where languages make case distinction. Special character features: These features capture punctuation marks, special characters and numbers in the tokens. In the context of queries targeted towards retrieving information from social media posts, they also capture smileys, hash tags and so on. Token level features: This set of features has been mainly used to identify Indian lan- guages. They constitute checks on token endings in terms of presence of certain charac- ters. The motivation behind this idea is that non native English individuals conceptualize thoughts in their native language and these transliterations depend on how those tokens are pronounced in the native language. In addition, vowel clustering to indicate lexical stress is found as an interesting property of Indian languages as opposed to Roman languages where consonant clustering is given prominence. Also, the presence of an anuswar, anunasik or the chandrabindu transliterates to the bigrams ’in’,’on’ or word endings of ’i’, ’n’ in particular to the Hindi language. In addition, this feature set includes frequency and word length of tokens. However word length as a feature in particular did not seem to add value to the system as determined during cross validation on the development set. This feature set also captures the presence of the token in a lexicon. This has been used to identify English words in the queries since only an English lexicon was available. In addition, stop words were also curated from NLTK’s Stopwords corpus[5] to enhance English word identification. The features are listed in table 1.

4 Experiments and Results

The development data set provided by the organizers included 1230 queries for the Hindi- English language pair and three queries for the Kannada-English language pair. The queries for the Hindi-English language pair included social media posts in addition to regular queries. Since the development set for the Kannada-English language pair was very minimal, addi- tional queries were scraped from the web and a few queries were added by hand to total upto 250 queries. An English word frequency list constructed from the Leipzig corpora[6] was also 2 Table 1: Description of features used

Feature description Features used in the submissions English-Hindi English-Kannada Is first letter capitalized? X X Are all letters capitalized? X X Frequency X X Token Length X X Token is a number? X X Token is a punctuation? X X Token is a smiley/hashtag? X X Token is a special character? X X Token contains a special character? X X Token ends with a? X Token ends with u? X Token ends with e? X Token ends with in? X X Token ends with on? X X Token ends with i? X X Token ends with o? X X Token contains in? X X Sequence of vowels in token? X X Token present in English lexicon? X X

Table 2: Result metrics for the Hindi-English language pair

Run ID LP LR LF EP ER EF EQMF-All EQMF-w/o Mix and NE LA 1 0.741 0.883 0.806 0.849 0.751 0.797 0.195 0.308 0.807 2 0.644 0.917 0.756 0.863 0.539 0.664 0.165 0.235 0.738 LP, LR, LF Token level precision, recall and F-measure for the Indian language in the language pair EP, ER, EF Token level precision, recall and F-measure for English tokens EQMF Exact query match fraction used. Since the word list included words with very low frequency which constitutes noise, frequency was used as a feature when it’s value was in double digits for individual tokens. Cross validation on the data set revealed that the word length as a feature caused a drop in the labelling accuracy. Pre Processing: The queries were tokenized on white space. Tokens consisting of Latin characters were lower cased. Two runs were submitted for the Hindi-English language pair. In the first run, the complete feature set as described in Table 1 was used. For the second run, features to check for words ending with ’i’,’o’, or ’n’ were dropped. Correspondingly, both labelling accuracy and EQMF have gone down as reported in Table 2. A single run was submitted for the Kannada-English language pair. In addition to vowel clustering, additional features in terms of tokens ending with ’a’,’u’,’e’ which are found to qualify nouns/adjectives and/or alter their tense, number and in some cases alter the tense or number of pronouns were used. High precision has been reported but recall results are slightly in-efficient. The results have been detailed in Table 3.

3 Table 3: Result metrics for the Kannada-English language pair

Run ID LP LR LF EP ER EF EQMF-All EQMF-w/o Mix and NE LA 1 0.934 0.853 0.892 0.661 0.886 0.757 0.269 0.353 0.848

5 Error analysis

Vowel clustering and certain morphological rules of the Indian languages help identification to a certain extent. However it is observed that the recall for English tokens is not very pleasing. This decrease in performance is attributed to not using an N-gram approach for Roman language identification. Normalization: It is suspected that social media posts with higher syntactic complexities has resulted in a lower performance. For example, words such as ’back’ or ’through’ might have been written using shortened forms in social media posts as in ’bak’ for ’back’ and ’thru’ for ’through’. In such cases, more robust features to identify the language are needed. Named Entity recognition: Another point observed is that NE recognition in Indian languages is significantly tougher owing to a lack of case distinction and standardized spelling. Since EQMF without NE and Mix tags is higher than the EQMF with all query tokens, this system will require additional features to identify named entities in Indian languages.

6 Conclusion

In this paper, we describe a method to identify the language of query tokens leveraging certain features such as vowel clustering and bigrams or unigrams specifically positioned in the tokens. We also analyse the results and possible reasons for failures and identify normalization and named entity recognition as crucial areas for improvement.

References

[1] P. J. Antony and K. P. Soman, Machine Transliteration for Indian Languages: A Liter- ature Survey. In International Journal of Scientific and Engineering Research, Volume 2, Issue 12, December-2011

[2] Cavnar, W., Trenkle, J., N-gram-based text categorization. Proc. 3rd Symp. on Docu- ment Analysis and Information Retrieval (SDAIR-94)

[3] Dr. M Hanumanthappa, Rashmi S, Jyothi N M, Impact of Phonetics in Natural Lan- guage Processing: A Literature Survey, International Journal of Innovative Science, Engineering and Technology, Vol. 1 Issue 3, May 2014

[4] Shweta Vikram, Morphology: Indian Languages and European Languages, International Journal of Scientific and Research Publications, Volume 3, Issue 6, June 2013

[5] Bird, Steven, Ewan Klein, and Edward Loper (2009), Natural Language Processing with Python, O’Reilly Media

[6] Quasthoff, U., Richter, M., Biemann, C.: Corpus portal for search in monolingual corpora. In: Proceedings of the fifth international conference on language resources and evaluation. (2006) 17991802

4 IIIT-H System Submission for FIRE2014 Shared Task on Transliterated Search

Irshad Ahmad Bhat Vandan Mujadia Aniruddha Tammewar Govt. College of Engineering LTRC, IIIT-H LTRC, IIIT-H and Technology Hyderabad, Telengana Hyderabad, Telengana Jammu, Jammu and Kashmir [email protected] [email protected] [email protected] Riyaz Ahmad Bhat Manish Shrivastava LTRC, IIIT-H LTRC, IIIT-H Hyderabad, Telengana Hyderabad, Telengana [email protected] [email protected]

ABSTRACT guage Modeling, Perplexity This paper describes our submission for FIRE 2014 Shared Task on Transliterated Search. The shared task features two sub-tasks: 1. INTRODUCTION Query word labeling and Mixed-script Ad hoc retrieval for Hindi This paper describes our system1 for the FIRE 2014 Shared Task Song Lyrics. on Transliterated Search. The shared task features two sub-tasks: Query Word Labeling is on token level language identification of Query word labeling and Mixed-script Ad hoc retrieval for Hindi query words in code-mixed queries and the transliteration of identi- Song Lyrics. Subtask-I addresses the problem of language identi- fied Indian language words into their native scripts. We have devel- fication of query words in code-mixed queries and transliteration oped an SVM classifier for the token level language identification of non-English words into their native script. The task focuses on of query words and a decision tree classifier for transliteration. token level language identification in code-mixed queries in En- The second subtask for Mixed-script Ad hoc retrieval for Hindi glish and any of the given 6 Indian languages viz Bengali, Gujarati, Song Lyrics is to retrieve a ranked list of songs from a corpus of Hindi, Malayalam, Tamil and Kannada. The overall task is to 1) Hindi song lyrics given an input query in Devanagari or transliter- label words with any of the following categories: lang1, lang2, ated Roman script. We have used edit distance based query expan- mixed, Named Entities (NE), and other, and 2) transliterate iden- sion and language modeling based pruning followed by relevance tified Indic words into their native scripts. We submit predictions based re-ranking for the retrieval of relevant Hindi Song lyrics for and transliterations for the queries from the following language a given query. pairs: Hindi-English, Gujarati-English, Bengali-English, Kannada- We see that even though our approaches are not very sophis- English, Malayalam-English and Tamil-English. ticated, they perform reasonably well. Our results show that these Our language Identification system uses linear-kernel SVM clas- approaches may perform much better if more sophisticated features sifier trained on word-level features like character-based conditional or ranking is used. Both of our systems are available for download posterior probabilities (CPP) and dictionary presence for each query and can be used for research purposes. word. While for the transliteration of Indic words, we first translit- erate them to WX-notation2 (transliteration scheme for represent- Categories and Subject Descriptors ing Indian languages in ASCII) and then use Indic Converter (de- I.2.7 [Artificial Intelligence]: Natural Language Processing—Lan- veloped in-house) to convert these words to their native script. guage parsing and understanding The second subtask for Mixed-script Ad hoc retrieval for Hindi Song Lyrics is to retrieve a ranked list of songs from a corpus of Hindi song lyrics given an input query in Devanagari or translit- General Terms erated roman script. The documents used for retrieval may be in Experimentation, Languages Devanagari or in roman script. The task focuses on the retrieval of relevant documents irrespective of the script encoding of query or Keywords documents. This requires us to normalize the available documents and the input queries to a single script; in our case Roman. We also Language Identification, Transliteration, Information Retrieval, Lan- need to process the query to account for all the possible spelling variations for each query term. We used edit distance based query expansion followed by language modeling based pruning to tackle these issues. Finally, we retrieve Hindi Song lyrics for a query us- Permission to make digital or hard copies of all or part of this work for ing relevance based re-ranking. personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies 1 bear this notice and the full citation on the first page. To copy otherwise, to https://github.com/irshadbhat/Subtask-1-Language- republish, to post on servers or to redistribute to lists, requires prior specific Identification-and-Transliteration-in-Code-Mixed-Data permission and/or a fee. https://github.com/vmujadia/Subtask-2-Mixed-script-Ad-hoc- retrieval-for-Hindi-Song-Lyrics WOODSTOCK ’97 El Paso, Texas USA 2 Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. http://en.wikipedia.org/wiki/WX_notation As stated in the shared task description, extensive spell varia- to their native scripts. tion is the major challenge that search engines face while process- ing transliterated queries and documents. The word dhanyavad (‘thank you’ in Hindi and many other Indian languages), for in- 2.2.1 Language Identification stance, can be written in Roman script as dhanyavaad, dhanyvad, We experiment with linear kernel SVM classifiers using Liblin- danyavad, danyavaad, dhanyavada, dhanyabad and so on. Fur- ear [1]. Parameter optimization is performed using parameter tun- ther, the complexity of search may increase if faced with multiple ing based on cross-validation. We use Scikit-Learn3 package [7] to information sources having different script encoding. The mod- conduct all our experiments. ern Indian Internet user’s query might easily be interspersed with words from their native languages transliterated in Roman or in na- Our basic features for SVM classifier are: tive UTF-8 encoded scripts along with English words. Both the Dictionary Based Labels (D): Given the fact that each language problems are tackled separately in this shared task. Subtask-I deals pair consists of English words and any of the mentioned Indian lan- with the identification of language of origin of a query term for a guage words, the presence/absence of a word in large scale freely code-mixed query, while Subtask-II deals with the retrieval from available English dictionaries will be a potent factor in deciphering information sources encoded in different scripts. the source language of word. Thus, we use the presence of a word The rest of the paper is organized as follows: In Section 2, we in English dictionaries as a binary feature (0=Absence, 1=Pres- discuss the data, our methodology and the results for the Subtask-I. ence) in all our SVM classifiers. If a word is present in dictionary, Section 3 is dedicated to the Subtask-II. In this section, we discuss its dictionary flag is set to 1 denoting an English-word otherwise, our approach towards the task and present our results. We conclude flag is set to 0 representing an other language word in the given in Section 4 with some future directions. language pair. We use python’s PyEnchant-package4 with the fol- lowing dictionaries: 2. SUBTASK-I: QUERY WORD LABELING • en_GB: British English The subtask deals with code-mixed or code-switched Indian lan- guage data and primarily aims to address the problem of language • en_US: American English identification of words in transliterated code-mixed queries. The task features transliterated queries from 6 Indian languages namely • de_DE: German Bengali, Gujarati, Hindi, Kannada, Malayalam and Tamil code- • mixed with English. In the following, we discuss our approach and fr_FR: French experiments, and put into perspective our results on the language pairs that are featured in the shared task. The only flaw with this feature is its impotence to capture infor- mation in the following two scenarios: 2.1 Data • English words written in short-forms (quite prevalent in so- The Language Identification and Transliteration in the Code-Switched cial media), and (CS) data shared task is meant for language identification and translit- eration in 6 language pairs (henceforth LP) namely, Hindi-English • Indian language words with English like orthography e.g. (H-E), Gujarati-English (G-E), Bengali-English (B-E), Tamil-English Hindi pronoun in ‘these‘ resembles English preposition in. (T-E), Kannada-English (K-E) and Malayalam-English (M-E). For each language-L the following data is provided: Word-level CPP (P): We add two features by training separate smoothed letter-based n-gram language models for each language • Monolingual corpora of English, Hindi and Gujarati. in a language pair [4]. We compute the conditional probability us- ing these language models, to obtain lang1-CPP and lang2-CPP for • Word lists with corpus frequencies for English, Hindi, Ben- each word. gali and Gujarati. Given a word w, we compute the conditional probability corre- 5 • Word transliteration pairs for Hindi-English, Bengali-English sponding to k classes c1, c2, ... , ck as: and Gujarati-English which could be useful for training or p(ci|w) = p(w|ci) ∗ p(ci) (1) testing transliteration systems. The prior distribution p(c) of a class is estimated from the re- Apart from the above mentioned data, a development set of 1000 spective training sets show in Table 2. Each training set is used to transliterated code-mixed queries is provided for each language train a separate letter-based language model to estimate the prob- pair for tunning the parameters of the language identification and ability of word w. The language model p(w) is implemented as transliteration systems. Results are separately reported on test sets an n-gram model using the IRSTLM-Toolkit [2] with Kneser-Ney ∼ containing 1000 code-mixed queries for each language pair. Ta- smoothing. The language model is defined as: ble 1 shows some input queries and the desired outputs. ∏n 2.2 Methodology | i−1 p(w) = p(li li−k) (2) We worked out the task into two stages: Language Identifica- i=1 tion and Transliteration. In Language Identification stage, we used linear-kernel SVM classifier with context-independent (word, word- where l is a letter and k is a parameter indicating the amount of level conditional posterior probability (CPP) and dictionary pres- context used (e.g., k=4 means 5-gram model). ence/absence) features and context-sensitive (adding previous and 3http://scikit-learn.org/stable/ next word) bi-gram counts for post-processing. For Transliteration 4https://github.com/rfk/pyenchant/ of non-English words into their native scripts, we first transliterate 5 In our case value of k is 2 as there are 2 languages in a language pair them to WX-notation and then use Indic-converter to convert WX Input query Outputs sachin tendulkar number of centuries sachin\H tendulkar\H number\E of\E centuries\E palak paneer recipe palak\H=pAlk paneer\H=pnFr recipe\E mungeri lal ke haseen sapney mungeri\H=m\grF lal\H=lAl ke\H=k haseen\H=hsFn sapney\H=spn iguazu water fall argentina iguazu\E water\E fall\E argentina\E

Table 1: Input query with desired outputs, where L is Hindi and has to be labeled as H

Language Data Size Average Token Length features based on the F1-measure. Table (2.3) shows the differ- ent feature combinations for SVM with their F1-score. Using the Hindi 32,9091 9.19 English 94,514 4.78 optimal features, classifiers have been trained and labels predicted Gujarati 40,889 8.84 using weight matrices learned by these classifiers. Among the six Tamil 55,370 11.78 combinations of the feature sets, Table 2.3 shows that SVM clas- Malayalam 12,8118 13.18 sifier performs best with the features set P+D, achieving 92.29% Bengali 29,3240 11.08 accuracy. Kannada 579736 12.74 Finally, we transliterate non-English words to WX using deci- sion tree classifier and then convert WX to Indian native scripts for Table 2: Statistics of Training Data for Language Modeling each language pair, using the Indic-converter tool kit.

2.2.2 Post-processing with bigram-counts Features Accuracy (%) Features Accuracy (%) P 62.27 L+D 90.18 We post-process SVM results with bigram-counts, combining D 85.33 P+D 92.29 the current token with previous and next token. The bigram counts L+P 62.73 L+P+D 91.63 are learned from large English corpus for English language only. If the bigram counts of current token with previous and next tokens Table 4: Average cross-validation accuracies on the Hindi-English are high, then current token is most probably an English word and data set; P = word-level CPP and D = dictionary, L = wordlength non-English otherwise. The post-processing mainly helps in mak- ing the language identification consistent over a large sequence of 2.4 Results and Discussion text. Both the language identification and transliteration systems are 2.2.3 Transliteration evaluated against Testing data as mentioned in Section 2.1. All the results are provided on token level identification and transliter- Transliteration of Indian language words is carried out in two ation. Tables 3 show the results of our language identification and steps. First we transliterate these words to WX and then convert transliteration systems. WX to the native scripts. WX is a transliteration scheme for rep- Systems are evaluated separately for each tag in a language pair resenting Indian languages in ASCII. In WX every consonant and using Recall, Precision and F1-Score as the measures for the In- every vowel has a single mapping into Roman, which makes it easy dian language in the language pair, English tokens and translitera- to convert WX to Indian native scripts. tion. Systems are also evaluated on token level language labeling Transliterating non-English words to WX requires a training set accuracy using the relation ‘correct label pairs/(correct label pairs of Roman to WX transliteration pairs of Indian languages to built a + incorrect label pairs)’. Evaluation is also performed on EQMF transliteration model. In our training data we have word-transliteration (Exact query match fraction), EQMF but only considering language pairs for Hindi, Bengali and Gujarati but the transliterations are in identification and ETPM (Exact transliterated pair match). native scripts. We convert these native script transliterations to WX using Indic-converter. We then train an ID3 Decision tree classifier [8] for mapping non-English words to WX using the modified (Ro- 3. SUBTASK-II: HINDI SONG LYRICS RE- man to WX) word-transliterations pairs. The classifier is character TRIEVAL based, mapping each character in Roman script to WX based on its context (previous 3 and next 3 characters). After transliteration This section describes our system for the Subtask-II which is from Roman to WX, we again use Indic-converter to convert WX about the Hindi song lyrics retrieval. The proposed system is sim- to a given Indian native script. ilar to that of Gupta et al.[3]. They have used Apache Lucene IR platform for building their system, while we build our system from 2.3 Experiments the scratch following a typical architecture of a search engine. The overall system working and design is depicted in Figure 1. As with The stages mentioned in Section 2.2 are used for the language any search engine, our system also hinges on the posting lists nor- identification and transliteration task for all the language pairs. We malized using TF-IDF metric. The posting lists are created based carried out a series of experiments to built decision tree classifier on the structure of the provided songs lyrics documents. for transliterating Roman scripts to WX. As is stated in the Subtask-II, input query to the system can be In order to predict correct labels for Language Identification, we given in Devanagari script or it may be in transliterated Roman trained 6 SVM classifiers (one for each language pair). We ex- script with a possible spelling variation, e.g., repeated letters, re- perimented with different features in SVM to select the optimal curring syllables etc. Also, the documents (∼60000) that are pro- 6 LP, LR, LF: Token level precision, recall and F-measure for the Indian language vided for indexing contain lyrics both in Devanagari and Roman in the language pair. EP, ER, EF: Token level precision, recall and F-measure for scripts with some similar noise. Prior to document indexing, we English tokens. TP, TR, TF: Token level transliteration precision, recall, and F- measure. LA: Token level language labeling accuracy. EQMF: Exact query match had to run through a preprocessing step in order to tackle these is- fraction. −: without transliteration. ETPM: Exact transliterated pair match sues. In the following, we discuss the process of data cleaning and Language Pair BengaliEnglish GujaratiEnglish HindiEnglish KannadaEnglish MalayalamEnglish TamilEnglish LP 0.835 0.986 0.83 0.939 0.895 0.983 LR 0.83 0.868 0.749 0.926 0.963 0.987 LF 0.833 0.923 0.787 0.932 0.928 0.985 EP 0.819 0.078 0.718 0.804 0.796 0.991 ER 0.907 1 0.887 0.911 0.934 0.98 EF 0.861 0.145 0.794 0.854 0.86 0.986 TP 0.011 0.28 0.074 0 0.095 0 TR 0.181 0.243 0.357 0 0.102 0 TF 0.021 0.261 0.122 0 0.098 0 LA 0.85 0.856 0.792 0.9 0.891 0.986 EQMF All(NT) 0.383 0.387 0.143 0.429 0.383 0.714 EQMF−NE(NT) 0.479 0.413 0.255 0.555 0.525 0.714 EQMF−Mix(NT) 0.383 0.387 0.143 0.437 0.492 0.714 EQMF−Mix and NE(NT) 0.479 0.413 0.255 0.563 0.675 0.714 EQMF All 0.004 0.007 0.001 0 0.008 0 EQMF−NE 0.004 0.007 0.001 0 0.008 0 EQMF−Mix 0.004 0.007 0.001 0 0.008 0 EQMF−Mix and NE 0.004 0.007 0.001 0 0.008 0 ETPM 72/288 259/911 907/2004 0/751 90/852 0/0

Table 3: Token Level Results6

Figure 1: Mixed-script Ad hoc retrieval for Hindi Song Lyrics System Work flow data normalization. nahii.n, where symbol “.” signifies some phonetic value of it sur- rounding letters. In such documents even the case of a letter has a 3.1 Data Normalization phonetic essence. However, there are only few files with such ad- ditional information. The other documents may represent the word Data normalization is an important part of our retrieval system. jahaa.N simply as jaha or jahaan. We also observed that “D” It takes cares of noise that could affect our results. The goal of mostly corresponds to “dh” in normal script, but the mapping from normalization was to make the song lyrics uniform across docu- its uppercase to lowercase does not always need an insertion of an ments and to clean up the unnecessary and unwanted content like additional “h” following it (due to spell variations). So, we decided HTML/XML tags, punctuation marks, unwanted symbols etc. to ignore this information and removed all the special symbols and In some of the documents, song lyrics contain more information converted the whole data into lowercase. As we already mentioned, about the pronunciation like jahaa.N bharam nahii.n aur chaah the training documents are in both Devanagari and Roman scripts, As explained in the task description, there might be several ways for simplicity we converted all the Devanagari lyrics into Roman of writing the query with same intention. For instance, in Hindi using the decision tree based transliteration tool discussed in Sec- and many other Indian languages, the word dhanyavad “thank tion 2.2.3. you” can be written in Roman script as dhanyavaad, dhanyvad, The next step of the conversion handles the problem of charac- danyavad, danyavaad, dhanyavada, dhanyabad and so on. Such ters in a word repeating multiple times depending on the writing variation will have an adverse effect on the retrieval of relevant doc- styles. Consider the following two queries had kar di ap ne vs uments. So its in important, that such variation should be normal- hadd kar dii aap ne, both these queries are same but their writing ized, which is what is discussed in the next section. pattern is different. In order to handle such variation, we replace the letters which appear continuously with a single character represent- 3.3.1 Edit distance for Query expansion ing them e.g “mann” → “man”; “jahaa” → “jaha”. This also In order to address the above problem, we used Levenshtein dis- helps when some word in a song is elongated at the end eg. “saara tance, a fairly simple and commonly used technique for edit dis- jahaaaaaan”, but in the lyrics the elongation may not be present tance. First we created a dictionary of all the unique words present saara jahaan. So, to match these two, we converge both of them in the converted data. While processing a query, for each word we to sara jahan. find all possible words (biased by some rules), within an edit dis- tance of 2, present in the dictionary. To limit the number of possible 3.2 Posting list and Relevancy words, we apply some rules on edit operations which were created As we mentioned in the previous section, we transliterated all by a careful observation of the data. The general form of rules is the lyrics documents uniformly to Roman script. Once transliter- shown below: ated, these documents are then used to create posting list. In order to create the posting list, we first identify the structure of these If ‘x’ is present in a word, it can be replaced with ‘y’ , it documents and decide appropriate weights for each position in the can be deleted, or the word can be split at its index. structure. The terms in the title, for example, will receive the high- est weight, while the terms in the body will receive relatively lower We also consider word splitting in this module e.g. lejayenge weights. In posting list, we have indexed the following patterns can be split into le jayenge. After limiting the in-numerous possi- separately: bilities by using manually created rules, we still get 30 possibilities for each word on average. If we consider all the possibilities, even • title of the song, if the length of the query is small, say 5 words, the possible vari- ations of such query would grow up to 305, which is a very large • first line of song, number to process. To handle this problem, we apply another strat- • first line of each stanza, egy based on language modeling to limit and rank query expansion. The next section throws details on this strategy. • line with specific singer name, 3.4 Language Modeling • each line of chorus, In order to limit the number of possible queries generated using edit distance, we rank them in the order of importance by using • song line which has number at last(indication for no. of times scores given by a language model. repetition), and

• all other remaining lines. 3.4.1 Language model We use SRILM toolkit[9] to train the language Model on cleaned The appropriate weights to these patterns are assigned in the fol- lyrics corpus. We keep the language model loaded, and as we give lowing order: Title of the song > First line of song > First line a sentence to this loaded model, it returns probability and perplex- of stanzas > Each line of chorus > Song line which has number ity scores. We use the perplexity score on the generated queries at last (indication for number of times repetition) > line with spe- (variations of the original query) to rank them. cific singer name > All other remaining lines of the song. These To limit the number of new queries and to speed up the process weightes help us to compute relevance of a document to an input of ranking, we first break the input query into multiple parts. We query effectively. set the length of splits to be equal to 4 words. As mentioned ear- The identification and extraction of title field was quite trivial, lier, we get around 25 to 30 word suggestions on average for each as there are very few variations in the title field of each song in query term by edit distance method. Therefore, the number of pos- the corpus. There were, however, huge variations in other patterns sible variations for each split could reach up to 254 i.e., around or fields across documents like starting of new stanzas, prominent 400K variations for each 4-word part of the query. These generated singer name and number of times each line of a song is repeated. queries are then finally ranked using perplexity scores using the Finally, the term-document frequency counts in the posting list are language model. We observed that the overall scoring and ranking normalized using TF-IDF metric, which is a standard metric in IR process of around 400K variations takes less than a second. Then, to assign more mass to less frequent words [5]. for each part we consider top 20/25/30 variations, depending on the number of parts in the query. Using these top variations from 3.3 Query Expansion each part, we generated all the possible combinations, which are of Query expansion (QE) is the process of reformulating a seed length of the actual query. We further rank them using the language query to improve retrieval performance in information retrieval op- model to get the top 20 queries for search and retrieval. erations [10]. In the context of Indian language based transliterated song lyrics retrieval, query expansion includes identifying script of 3.5 Search seed query and expanding it in terms of spelling variation to im- After applying the query expansion techniques on the test query, prove the recall of the retrieval system. we take its top 15-20 spell variations. For each variation, song TEAM NDCG@1 NDCG@5 NDCG@5 Map MRR RECALL bits-run-2 0.7708 0.7954 0.6977 0.6421 0.8171 0.6918 iiith-run-1 0.6429 0.5262 0.5105 0.4346 0.673 0.5806 bit-run-2 0.6452 0.4918 0.4572 0.3578 0.6271 0.4822 dcu-run-2 0.4143 0.3933 0.371 ¯0.2063 ¯0.3979 0.2807

Table 5: Subtask-II Results retrieval system generates most relevant 20 song document ids ac- classification. The Journal of Machine Learning Research, cording to their relevance to the query. 9:1871–1874, 2008. For a given query, our searching module first searches for all [2] Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. the possible n-grams, where n ≤ length of the query, in posting Irstlm: an open source toolkit for handling large scale list. It then retrieves the results based on cosine distance of the language models. In Interspeech, pages 1618–1621, 2008. boolean and vector space retrieval model[5]. As per the sub task-2 [3] Parth Gupta, Kalika Bali, Rafael E Banchs, Monojit definition this searching module gives score from 0-4, 4 for title or Choudhury, and Paolo Rosso. Query expansion for first line phrase match with given query, 3 for exact phrase match mixed-script information retrieval. In Proceedings of the match with given query other than title, 2 for phrase match match 37th international ACM SIGIR conference on Research & with given query, 1 for any keyword match with given query and 0 development in information retrieval, pages 677–686. ACM, for irrelevant song. 2014. [4] Naman Jain, IIIT-H LTRC, and Riyaz Ahmad Bhat. 3.6 Results and Analysis Language identification in code-switching scenario. EMNLP Table 5 shows the comparative results of 4 teams using various 2014, page 87, 2014. metrics. Although, we tried basic techniques, we could still manage [5] Christopher D Manning, Prabhakar Raghavan, and Hinrich to achieve reasonable results. While doing error analysis, we found Schütze. Introduction to information retrieval, volume 1. that in the data cleaning we could not clean the whole data prop- Cambridge university press Cambridge, 2008. erly. There were some type of HTML/XML tags that could not be [6] Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar removed as they were attached with their neighboring words. We Burget, and J Cernocky. Rnnlm-recurrent neural network also observed that some words in query were very different in terms language modeling toolkit. In Proc. of the 2011 ASRU of orthography from the original words in the lyrics, but they sound Workshop, pages 196–201, 2011. similar in terms of their pronunciation. This could not be handled [7] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, by our restrictive (2 edits) edit distance approach. We tried to make Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu as many rules as possible to be used in the edit distance module, Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, but still they were not exhaustive enough to cover all the possibili- et al. Scikit-learn: Machine learning in python. The Journal ties. Another improvement in our system would be to use a better of Machine Learning Research, 12:2825–2830, 2011. Language Modeling toolkit such as RNNLM[6]. [8] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986. 4. CONCLUSIONS [9] Andreas Stolcke et al. Srilm-an extensible language In this paper, we described our systems for the two subtasks in modeling toolkit. In INTERSPEECH, 2002. the FIRE Shared task on Transliterated Search. For the Subtask-I [10] Olga Vechtomova and Ying Wang. A study of the effect of on language identification and transliteration, we have implemented term proximity on query expansion. Journal of Information an SVM based token level language identification system and a de- Science, 32(4):324–333, 2006. cision tree based transliteration system. SVM uses a set of naive easily computable features guaranteeing reasonable accuracies over multiple language pairs, while decision tree classifier uses the let- ter context to transliterate a Romanized Indic word to its native script. To summarize, we achieved reasonable accuracy with an SVM classifier by employing basic features only. We found that using dictionaries is the most important feature. Other than dictio- naries, word structure, which in our system is captured by n-gram posterior probabilities, is also very helpful. In Mixed-script Ad hoc retrieval Task we used very common techniques like edit distance, query expansion, Language modeling and the standard document retrieval algorithms. The most simple yet helpful method was the shortening of words which tackles the problem of recurring characters. Instead of these very basic tricks and methodologies, our system is competitive in terms of the re- sults.

5. REFERENCES [1] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear Machine Learning Approach for Language Identification & Transliteration: Shared Task Report of IITP-TS

Deepak Kumar Gupta Shubham Kumar Asif Ekbal Comp. Sc. & Engg. Deptt. Comp. Sc. & Engg. Deptt. Comp. Sc. & Engg. Deptt. IIT Patna, India IIT Patna, India IIT Patna, India [email protected] [email protected] [email protected]

ABSTRACT Transliteration, especially into Roman script, is used abun- In this paper, we describe the system that we developed as dantly on the Web not only for documents, but also for user part of our participation to the FIRE-2014 Shared Task on queries that intend to search for these documents. These Transliterated Search. We participated only for Subtask 1 problems were addressed in the FIRE-2013 Shared Task on that focused on labeling the query words. The entire pro- Transliteration [5]. More recent studies show that building cess consists of the following subtasks: language identifica- computational models for the social media content is more tion of each word in the text, named entity recognition and challenging because of the nature of the mixing as well as the classification (NERC) and transliteration of the Indian lan- presence of non-standard variations in spellings and gram- guage words written in non-native scripts to the correspond- mar, and transliteration [1]. The work that we present here ing native Indic scripts. The proposed methods of language is in connection to the shared task that is being conducted identification and NERC are based on the supervised ap- as an continuation to the previous year. proaches, where we use several machine learning algorithms. We develop a transliteration framework which is based on 1.1 Task Description the modified joint source channel model. Experiments on This year, two subtasks on Transliterated Search were con- the benchmark setup show that we achieve quite encourag- ducted: first one is the query labeling, and the second task ing performance for both pairs of languages. It is also to is related to the ad-hoc retrieval task for Hindi film lyrics. be noted that we did not make use of any deep domain- We participated for the first task which is described very specific resources and/or tools, and therefore this can be briefly as follows: easily adapted to the other domains and/or languages. Subtask 1: Query Word Labeling Suppose that a query q: w1 w2 w3 . . . wn is written in the Keywords Roman script. The words, w1 w2 etc., could be standard En- Language Identification, NERC, Transliteration, Ensemble, glish words or transliterated from another language L. The Modified Joint-Source Channel Model task is to label the words as E or L depending on whether it is an English word, or a transliterated L-language word. And then, for each transliterated word, provide the correct 1. INTRODUCTION transliteration in the native script (i.e., the script which is Recent decade has seen an upsurge in the social network- used for writing L). The task also required to identify and ing and e-commerce sector witnessing an enormous growth classify the named entities of types person, location, organi- in the volume of data flowing out of media networks, which zation and abbreviation. can be used by public and private organizations alike to gain valuable insights. New forms of communication, such as micro-blogging, Tweets, status, reviews and text messag- 2. METHODOLOGY ing have emerged and become ubiquitous. These messages The overall task for query labeling consists of three ma- often are written using Roman script due to various socio- jor components, viz. Language Identification, NERC and cultural and technological reasons [4]. Many languages such Transliteration. It is to be noted that we did not make use as South and South East Asian Languages, Arabic, Russian of any domain-specific resources and/or tools for the sake of etc. make use of indigenous scripts while writing in text their domain-independence. Below we describe the method- forms. The process of phonetically representing the words ologies that we followed for each of these individual modules. of a language in a non-native script is called transliteration. 2.1 Language Identification The problem of language identification concerns with de- termining the language of a given word. The task can be modeled as a classification problem, where each word has to be labeled either with one of the three classes, namely Hindi (or Bengali), English and Mixed (denotes the mixed characters of English and non-Latin language scripts). Our proposed method for language identification is supervised in nature. In particular we develop the systems based on four different classifiers, namely random forest, random tree, Here the task was to identify named entities (NEs) and clas- support vector machine and decision tree. We use the sify them into the following categories: Person, Organi- Weka implementations1 for these classifiers. In order to fur- zation, Location and Abbreviation. We use machine ther improve the performance we construct an ensemble by learning model to recognize the first three NE types, and combining the decisions of all the classifiers using majority for the last one we used heuristics. It is to be noted that voting. We followed the similar approach for both Hindi- there were many inconsistencies in annotation, and hence we English and Bangla-English pairs. The features that we im- pre-processed the datasets to maintain uniformity. In order plemented for language identification are described below in to denote the boundary of NEs we use the BIO encoding brief: scheme 2. We implement the following features for NERC.

1. Character n-gram: Character n-gram is a contigu- 1. Local context: Local contexts that span the preced- ous sequence of n character extracted from a given ing and following few tokens of the current word are word. We extract character n-grams of length one used as the features. Here we use the previous two and (unigram), two (bigram) and three (trigram), and use next two tokens as the features. these as features of the classifiers. 2. Character n-gram: Similar to the language identifi- 2. Context word: Local contexts help to identify the cation we use n-grams of length upto 5 as the features. type of the current word. We use the contexts of pre- vious two and next two words as features. 3. Prefix and Suffix: Prefix and suffix of fixed length character sequences (here, 3) are stripped from each 3. Word normalization : Words are normalized in or- token and used as the features of classifier. der to capture the similarity between two different words that share some common properties. Each cap- 4. Word normalization: This feature is defined exactly italized letter is replaced by ‘A’, small by ’a’ and num- in the same way as we did for language identification. ber by ’0’. 5. WordClassFeature: This feature was defined to en- 4. Gazetteer based feature : We compile a list of sure that the words having similar structures belong Hindi, Bengali and English words from the training to the same class. In the first step we normalize all datasets. A feature vector of length two (representing the words following the process as mentioned above. the respective gazetteer for the language pair: Hindi- Thereafter, consecutive same characters are squeezed English or Bangla-English). Now for each token we set into a single character. For example, the normalized a feature value equal to ‘1’ if it is present in the respec- word AAAaaa is converted to Aa. We found this fea- tive gazetteer, otherwise ‘0’. Hence for the words that ture to be effective for the biomedical domain, and we appear in both the gazetteers, the feature vector will directly adapted this without any modification. De- take the values of 1 in both the bit positions. Recent tailed sensitivity analysis might be useful to study its studies also suggest that gazetteer based features can effectiveness for the current domain. be effectively used for the language identification [3]. 6. Typographic features: We define a set of features 5. InitCap: This feature checks whether the current to- depending upon the Typographic constructions of the ken starts with a capital letter. words. We implement the following four features: All- Caps (whether the current word is made up of all cap- 6. InitPunDigit: We define a binary-valued feature that italized letters), AllSmall (word is constructed with checks whether the current token starts with a punc- only uncapitalized characters), InitCap (word starts tuation or digit. with a capital letter) and DigitAlpha (word contains 7. DigitAlpha: We define this feature in such a way that digits and alphabets). checks whether the current token is alphanumeric. 2.3 Transliteration 8. Contains# symbol: We define the feature that checks A transliteration system takes as input a character string in whether the word in the surrounding context contains the source language and generates a character string in the the symbol #. target language as output. The transliteration algorithm [2] that we used here is conceptualized as two levels of de- Last three features help to recognize the tokens which are coding: segmenting source and target language strings into mixed in nature (i.e., do not belong to Hindi, English and transliteration units (TUs); and defining appropriate map- Bangla). Some of the examples are: 2mar, #lol, (rozana ping between the source and target TUs by resolving differ- etc. ent combinations of alignments and unit mappings. The TU is defined based on the regular expression.

2.2 Named Entity Recognition and Classifica- For K alligned TUs (X: source TU and T: target TU), we tion have Named Entity Recognition and Classification (NERC) in an unstructured texts such as facebook, blogs etc. are more P (X,T ) = P (x1, x2 . . . xk, t1, t2 . . . tk) challenging compared to the traditional news-wire domains. 2B, I and O denote the beginning, intermediate and outside 1http://www.cs.waikato.ac.nz/ml/weka/ the NEs. = P (1, 2,... k) 2.1 T<- Model-II (t) k k−1 = Πk=1 P(k | 1 ) 2.2 If T contains null value We implement a number of transliteration models that can 2.2.1 T<- Model-II withAlignment(t) generate the original Indian language word (i.e., Indic script) from the given English transliteration written in Roman • Step 3: If T contains null value script. The Indic word is divided into TUs that have the 3.1 T<- Model-I (t) pattern C+M, where C represents a vowel or a consonant 3.2 If T contains null value or conjunct and M represents the vowel modifier or matra. 3.2.1 T<- Model-I withAlignment(t) An English word is divided into TUs that have the pattern C*V*, where C represents a consonant and V represents a • Step 4: return (T) vowel [2]. The process considers contextual information in both the source and target sides in the form of collocated TUs, and compute the probability of transliteration from 3. EXPERIMENTS AND DISCUSSIONS each source TU to various target candidate TUs and chooses 3.1 Datasets the one with maximum probability. The most appropriate We submitted runs for the two pair of languages, namely mappings between the source and target TUs are learned au- Hindi-English and Bangla-English. For language identifica- tomatically from the bilingual training corpus. The training tion the FIRE 2014 organizers provided three documents for process yields a kind of decision-tree list that outputs for Hindi-English pair and two documents for Bangla-English each source TU a set of possible target TUs along with the pair. For each of the language pairs, the individual docu- probability of each decision obtained from a training cor- ments are merged together into a single file for training. The pus. The transliteration of the input word is obtained using training sets consist of 1,004 (20,658 tokens) and 800 sen- direct orthographic mapping by identifying the equivalent tences (27,969 tokens) for Hindi-English and Bangla-English target TU for each source TU in the input and then placing language pairs, respectively. The test sets consist of 32,270 the target TUs in order. We implemented all the six mod- and 25,698 tokens for Hindi-English and Bangla-English, re- els as proposed in [2]. Based on some experiments on the spectively. For training of transliteration algorithm we make held out datasets we selected the following three models to use of 54,791 Hindi-English and 19,582 Bangla-English par- submit our runs: allel examples. Details of these datasets can be found in [6]. Model-I: This is a kind of monogram model where no con- text is considered, i.e. P(X,T) = Πk P( ) k=1 k 3.2 Results and Analysis Model-II: This model is built by considering next source In this section we report the results that we obtained for TU as context. query word labeling. We submitted three runs which are defined as below: k P(X,T) = Πk=1 P(k | xk+1 )

Model-III: This model incorporates the previous and the Run-1: For language identification and NERC, we con- next TUs in the source and the previous target TU as the struct the ensembles using majority voting. If the token context. is labeled as any native language (Hindi or Bangla) then we perform the transliteration for that token. k P(X,T) = Πk=1 P(k | k−1, xk+1) Run-2: In this run we perform language identification by The overall transliteration process attempts to produce the majority ensemble, and NERC by SMO. The word labeled best output for the given input word using Model-III. If the as native language is transliterated accordingly. transliteration is not obtained then we consult Model-II and then Model-I in sequence. If none of these models produces Run-3: In this run both the language identification and the output then we consider a literal transliteration model NERC are carried out using SMO. Transliteration is done developed using a dictionary. This process is shown below: following the same method. Input: Token (t) which is labeled as L3 in Language identification Output: Transliteration (T) of given token Run ID LP LR LF EP ER EF LA Run-1 0.920 0.843 0.880 0.883 0.932 0.907 0.886 Run-2 0.922 0.843 0.881 0.884 0.931 0.907 0.886 Run-3 0.882 0.841 0.861 0.88 0.896 0.888 0.870 • Step 1:T<- Model-III (t)4 1.1 If T contains null value 1.1.1 T<- Model-III withAlignment(t)5 Table 1: Result for language identification of Bangla-English. Here, LP-Language precision, LR- • Step 2: If T contains null value Language recall, LF-Language FScore, EP-English 3Denotes Hindi(H) or Bengali(B) precision, ER-English recall, EF-English FScore, 4each model takes input token and divide it into several non LA-Labelling accuracy native TU and give the native TU for each of them 5each model withAlignment takes input token and align the source TU with target TU Run ID LP LR LF EP ER EF LA Run ID EQMF ALL TP TR TF ETPM Run-1 0.921 0.895 0.908 0.89 0.908 0.899 0.879 Run-1 0.005 0.146 0.76 0.244 1933/2306 Run-2 0.921 0.893 0.907 0.89 0.908 0.899 0.878 Run-2 0.004 0.146 0.76 0.244 1931/2301 Run-3 0.905 0.865 0.885 0.86 0.886 0.873 0.857 Run-3 0.004 0.143 0.736 0.24 1871/2226

Table 2: Results for language identification of Hindi- Table 4: Results of transliteration for Hindi-English English

Type Words Predicted Reference Run ID EQMF ALL TP TR TF ETPM Short words thrgh H E Run-1 0.005 0.039 0.574 0.073 228/337 Ambiguous words the;ate E;E H;H Run-2 0.005 0.039 0.574 0.073 228/337 Erroneous words implemnt H E Run-3 0.005 0.038 0.582 0.071 231/344 Mixed Numerals Words 2mar O B

Table 3: Results of transliteration for Bangla- Table 5: Language labeling errors. Here, H-Hindi, English. Here, EQMF- ALL-Exact query match E-English, O-others fraction, TP-Transliteration precision, TR- Transliteration recall, TF-Transliteration F-Score, ETPM-Exact transliteration pair match In this paper we presented a brief overview of the system that we developed as part of our participation in the query labeling subtask of transliteration search. The proposed The systems were evaluated using the metrics as defined in method classifies the input query words to their native lan- [5]. Overall results for language identification are reported guage and back-transliterate non-English words to their orig- in Table1 and Table2 for Bangla-English and Hindi-English inal script. We have used several classification techniques for pairs, respectively. Evaluation shows that the first two runs solving the problem of language identification and NERC. achieve quite same level of F-scores. Experimental results For transliteration we have used a modified joint source- for transliteration are reported in Table3 and Table4 for channel model. We submitted three runs. Comparisons with Bangla-English and Hindi-English, respectively. Compar- the other systems show that our system achieves quite en- isons with the other submitted systems show that we achieve couraging performance for both the pairs, Hindi-English and the performance levels which are in the upper side. For Hindi Bangla-English. we obtain the precision, recall and F-score of 92.10%, 89.5% and 90.80%, which are very closer to the best performing Our detailed analysis suggest that language identification system (only inferior by less than one point F-score). For module suffers most due to the presence of very short word- English, our system attains the precision, recall and F-score forms, ambiguities and alphanumeric words. Errors encoun- values of 89.00%, 90.80% and 89.90%, respectively. This is tered in the transliteration model can be reduced a lot by lower than the best system only by 0.2% F-score points. For developing a method for spelling variations. Bangla-English pair our system also performs with impres- sive F-score values. In a separate experiment we evaluated 5. REFERENCES the model of NERC. We obtain the precision, recall and [1] K. Bali, J. Sharma, M. Choudhury, and Y. Vyas. “I am F-score values of 61.00%, 44.25% and 51.25%, respectively borrowing ya mixing?” An analysis of english-hindi for Hindi-English pair using the ensemble framework. For code mixing in facebook. In First Workshop on Bangla-English pair it yields the precision, recall and F-score Computational Approaches to Code Switching, EMNLP values of 54.25%, 43.25% and 48.25%, respectively. 2014, page 116, 2014. [2] A. Ekbal, S. K. Naskar, and S. Bandyopadhyay. A Our close investigation to the results show that our model modified joint source-channel model for transliteration. developed for language identification suffers due to very short Proceedings of the COLING/ACL, pages 191–198, 2006. forms of the words, ambiguities, errorneous words and mixed [3] C.-C. Lin, W. Ammar, L. Levin, and C. Dyer. The wordforms. Short words refer to the tokens which are writ- CMU submission for the shared task on language ten in their short forms. Ambiguities arise because of the identification in code-switched data. page 80, 2014. words, having meanings in both native and non-native scripts. [4] R. S. Roy, M. Choudhury, P. Majumder, and Errors are encountered because of the wrong spelling. Also, K. Agarwal. Challenges in designing input method the alphanumeric wordforms (i.e., mixed) often contribute editors for indian languages: The role of word-origin to the overall errors. The errors are shown in Table5. and context. In Advances in Text Input Methods (WTIM 2011), pages 1–9, 2011. The transliteration model makes most of the errors because [5] R. S. Roy, M. Choudhury, P. Majumder, and phAlA phlA of spelling variation (e.g., Pahalaa [ vs. ]). Incon- K. Agarwal. Overview and datasets of fire 2013 track sistent annotation is the another potential source of errors. on transliterated search. In FIRE-13 Workshop, 2013. [6] S. V.B., M. Choudhury, K. Bali, T. Dasgupta, and 4. CONCLUSION A. Basu. Resource creation for training and testing of transliteration systems for indian languages. pages 2902–2905. Indian Statistical Institute Kolkata at “Transliterated Search FIRE 2014”

Ayan Bandyopadhyay Dwaipayan Roy ISI Kolkata ISI Kolkata [email protected] [email protected]

1. INTRODUCTION • If not present both in the English dictionary and R2B, A large number of languages, including Arabic, Russian, and tag it as \B and use a rule based approach to translit- most of the South and South East Asian languages, are writ- erate. ten using in the websites and user generated content (such as tweets and blogs) in these languages are written using Roman script due to various socio-cultural and technologi- There are many tokens containing punctuation-marks and cal reasons such as familiarity with English and QWERTY number. And there are some tokens which are popular ab- key boards [1]. This process of phonetically representing breviations in microblogs. These tokens are not detected as the words of a language Roman script is called translitera- English or Bengali tokens and are tagged as other. Thus tion [3] [2]. In the social media, very often we use Roman some tokens, though a candidate Bengali token, are not scripts to express our thoughts in our native language which transliterated by our rule based approach. is non-English. It results in a text which is written in En- glish but is not an English text. In this paper, we present 2.1 Rule Based Approach our work on transliterated search on the dataset provided We have used a very simple rule based approach. by organizer of FIRE transliterated search track. We par- ticipated on the subtask-1 which consists of two jobs: given a text, 1. There are some blind rules, we will apply those rules blind when ever it occurs, e.g k ⇒ ক 1. to tag the actual non-English and English words; then kh ⇒ খ etc. 2. write the non-English words in the native language script. 2. Then some conditional rules applied, if we get any En- glish vowel or more than defined consecutive vowels we 2. OUR APPROACH will check if previous character or character sequence We created a simple two column dictionary map table for which follows previous blind rules or not. non-english tokens from the given training data set. The columns contain respectively a non-english Roman script and corresponding unicode Bengali character sets (R2B). • If it follows the blind rules the English vowel(s) Then we followed a two stage approach for this task. becomes Bengali vowel-modifier. e.g. kak ⇒ কাক. For each token in the test set: Here you can see for ``k'' has a defined blind rule so the English vowel ``a'' becomes vowel Bengali vowel-modifier. • Dictionary and R2B look-up: • If it does not follow the blind rules the English \ vowel(s) becomes Bengali vowel. e.g. aam ⇒ অাম. – If the token is in R2B, tag it as B and use its Here before ``aa'' there is no blind rules charac- transliteration. ter(s), so it does not follows any blind rules, that's – If the token is present in the English dictionary, why ``aa'' becomes Bengali vowel. tag it as \E. 3. If there are more than one blind rules sits beside each other then they follows the assigned blind rules individ- ually but between them one particular compound form- ing character inserts, that particular compound forming character is ``◌্''. That compound forming character makes them compound character. e.g pranam ⇒ রণাম.

3. RESULTS Below some used abbreviations are described: Language English NE MIX LP 0.848 Teams Runs Tokens Tokens Tokens Tokens LR 0.823 6 11 6392 7215 429 0 LF 0.835 EP 0.863 Table 1: EnglishBengali Language Collection Informa- ER 0.901 tion EF 0.882 TP 0.028 Number of rows TR 0.438 Total Number tokens on which evaluation TF 0.053 Tokens with transliterations was carried out LA 0.862 14036 397 739 EQMF All 0.378 (No transliteration) EQMF without NE Table 2: EnglishBengali Language Collection Informa- 0.484 (No transliteration) tion EQMF without Mix 0.378 (No transliteration) • LP, LR, LF: Token level precision, recall and F-measure EQMF without Mix and NE 0.484 for the Indian language in the language pair. (No transliteration) EQMF All 0.004 • EP, ER, EF: Token level precision, recall and F- EQMF without NE 0.004 measure for English tokens. EQMF without Mix 0.004 EQMF without • TP, TR, TF: Token level transliteration precision, 0.004 Mix and NE recall, and F-measure. 174 ETPM 309 • LA: Token level language labeling accuracy = correct label pairs/(correct label pairs + incorrect label pairs). Table 3: Our Results • EQMF: EQMF (Exact query match fraction) as de- fined in [4]. [4] R.S. Roy, M. Choudhury, P. Majumder, and K. Agarwal. Overview and datasets of fire 2013 track • EQMF (without transliteration): EQMF as de- on transliterated search. fire @ ism. 2013. Number 1, fined in, but only considering language identification. December 2013. • ETPM: Exact transliterated pair match as defined in [4]. • NE: Named entities. • MIX: MIX tags.

4. ERROR ANALYSIS There are lots of area can be improve. If any token consists more than one word and joined by punctuation. We have to process those words individually and if every word falls in one language group we might have en experiment by putting the whole token in that same language group. We can try other traditional translation strategies. Abbreviations recognition and frequently used SMS-microblogging term detection should be better. Spelling correction and/or pre- diction of non-English language improvement can improve overall performance. Invalid according to Bengali character sequence should be handled.

5. REFERENCES [1] Umair Z Ahmed, Kalika Bali, Monojit Choudhury, and Sowmya V. B. Challenges in designing input method editors for indian languages: The role of word-origin and context. In Proceedings of IJCNLP Workshop on Advances in Text Input Methods. Association for Computational Linguistics, November 2011. [2] Antony P J and Dr. Soman K P. Machine transliteration for indian languages: A literature survey. [3] Kevin Knight and Jonathan Graehl. Machine transliteration. Computational Linguistics, 24(4):599–612, 1998. ISM@FIRE-2014: Shared task on Transliterated Search

Dinesh Kumar Prabhakar Sukomal Pal Indian School of Mines Indian School of Mines Dhanbad, jharkhand Dhanbad, jharkhand India 826004 India 826004 [email protected] [email protected]

ABSTRACT ture work. This paper describes approaches that we have used for offi- cial submission of FIRE-2014, for the Shared Task on Translit- 2. TASK DESCRIPTION erated Search. Approaches solve the problem of word level Query Word Labeling labeling. The labeling classifies class of term with it be- Input:- Let Q be the query set containing n query word longs. MaxEnt a supervised classifier is used for classifi- w (1 ≤ i ≤ n) written in Roman script. The word w ∈ cation and labeling of a word. This subtask completion is i i Q (w1, w2, . . . , wn), could be a standard English word or followed by back-transliteration of H (Hindi) labeled word. transliterated form of a Hindi word, a Named Entity, an For the back-transliteration we have used generative and acronym or others(such as gr8, lol etc). Our approach the GEN-EXT(combination of generative and extraction) ap- system should give labeled words followed by transliteration proaches. From the evaluation it is been seen that Runs of Hindi words. has performed best in some metrics like LA, LP, LF, EF Output:- Result wlt is produce in such a way that Hindi etc. In some metrics runs performed 2nd and 3rd best, in word w was be tagged with label H and aligned with its some other performance was poor as well. transliteration. Other than Hindi word all are only tagged with their corresponding label. The Named entity was tagged 1. INTRODUCTION with P, L and O which refers the name of a person, location The user’s count on Social sites are increasingly becom- and organization respectively and an acronym was tagged ing higher. They write messages (specially blog and post) with label A. There are some word they do not come under on site (such as Twitter, Facebook et.), in their own lan- these categories are considered as others and tagged with guage preferably using Roman scripts. These post might label O. comprise terms of Non-English (or terms of user’s native ) language, a simple English word, a Named Entity (NE) or 3. APPROACHES an acronym. This transformation of a word of a language into a non-native or foreign script is called transliteration [5]. Our attempt to solve the problem Query word Labeling is However more than one transformations are possible for a phased in the order as Word Labeling followed by Translit- non-English word in Roman representation. These multiple eration of H (Hindi) labeled words. transformations differ in spelling. Spelling variation is one of the serious issue in back-transliteration. 3.1 Word Labeling Transliterations, using Roman script, are used more fre- To accomplish this subtask, word level classification is quently on the Web not only to create documents as well as needed. The word can be classified manually or using any for generating user’s queries to search required document. classifier. But manual classification is not feasible for the Last year transliterated search in Indian languages was in- large set of data. For the classification of terms at word clude as a pilot task. The task, Shared Task on Translit- level we have used Stanford classifier (MaxEnt - a super- erated Search had two subtasks:first, Query Word Label- vised classifier) [4]. ing and second Multi-script Ad hoc retrieval for Hindi Song The classification is completed in two phase: first, train Lyrics. Like previous year this this year also task is divided the classifier and then classify word based on extracted fea- into two same subtasks. Moreover, some more issues were tures. considered to take into account such as Named entity tag- ging (P, L and O based on types). 3.1.1 Training As the task is divided into two part subtask-1 and subtask- The FIRE-2013 & 2014 training dataset we have used to 2, where we participated in former one only. Again this create two training files where second file is subset of first file. subtask is divided into two phase: first phase classification of These file containing list of words w from training dataset query words and second transliteration of Hindi language’s aligned with their class tags ( such as P, L, O, etc.). First words. file contains 23956 labeled terms from FIRE-2014 training Next in the Section 2 we discussed task description. Sec- data and second file contains 262605 words comprises addi- tion 3 describes our approaches for labeling and transliter- tional 30823 [2] Hindi words and 207824 English words from ation. In Section 4 we have discussed results and analyze FIRE-2013 dataset. These files stored data in column for- errors. Section 5 concludes the paper with directions of fu- mat ( Stanford classifier’s required format). Using both the training files classifier was train and the trained classifiers Algorithm 1 GEN-EXT Transliteration were named as ISM-1 and ISM-2 respectively. 1: procedure get lab translit(wl) . (∀wl ∈ Ql) 2: {w, lbl} ← Split(wl) Features. 3: if (lbl != H) then Different features were set in property file for classifiers 4: wlt ← {w, /, lbl} are listed in the box. 5: else if w ∈ EHP then # Features 6: T ← tansliteration useClassFeature=true 7: wlt ← {w, /, lbl, =,T } 1.useNGrams=true return (wlt) 1.usePrefixSuffixNGrams=true 8: else 1.maxNGramLeng=4 9: T ← TG(w) . Call Generative Function TG(w) 1.minNGramLeng=1 10: wlt ← {w, /, lbl, =,T }

1.binnedLengths=10,20,30 11: return (wlt) # Optimization intern=true sigma=3 Extracting the transliteration of a word is not feasible useQN=true for resource scare languages. So, the transliteration genera- QNsize=15 tion remains the only fair option[1]. The transliteration can tolerance=1e-4 be generated based on phonemes, grapheme, syllables, com- Same property file is used for both the training file. Total bined or hybrid techniques. We have used a rule (grapheme) six tags were identified during the training. Those tags are based system for automatic transliteration generation. The H, E, P,L,O and A. system work based on Indic character mappings [7]. 3.1.2 Classification In this approach, procedure Split(wl) split word wl and The given test data file was parsed on trained classifiers label lbl is checked using a simple string matching. If w is for classification. Words of test data classified in six classes non-Hindi term, then insert the character ‘/’ between w and such as Hindi word, English word, proper name (name of the lbl, produced result as wlt. Otherwise, w is a Hindi term person, location or organization), acronym and others and and the T is generated using procedure TG(w) [7]. After the words were labeled with different class tags H, E, P,L,O and generating T , the character ‘/’ was inserted between w and A for different classes respectively. lbl, followed by = T . Finally, wlt is produced as the result. The steps are given in Algorithm 2. Since this approach is 3.2 Transliteration rule based, so it may give inexact transliteration. Transliteration can be obtained by extraction, by gener- ation or combining both [3]. We have used later two tech- Algorithm 2 Transliteration Generation niques in our approaches for transliteration. The generative 1: procedure get lab translit(wl) . (∀wl ∈ Ql) approach represented in algorithm 2 and combine (Extrac- 2: {w, lbl} ← Split(wl) tion + Generation (GEN-EXT)) approach represented in 3: if (lbl != H ) then algorithm 1. 4: wlt ← {w, /, lbl} return (wlt) General Terms. 5: else The labeled query list Ql contain processed query words 6: T ← TG(w) . Call Generative Function TG(w) wl labeled with lbl . The EHP is a dictionary contains Hindi 7: wlt ← {w, /, lbl, =,T } words written in Roman script along with its transliteration 8: return (wlt) in Devanagari script. Procedure Split(wl) splits wl. We have used TG(w) - a Hindi-English Indic character mapping system (RomaDeva) for automatic transliteration genera- 3.3 Post-processing tion [7]. Since ‘\’ is not allowed in programming so we used ‘/’. To Algorithm 1. bring the result in proper format we replaced later character by former character in our outputs. The procedure select a word wl form Ql and the label is checked using a simple string matching. If the word is not a Hindi term (i.e. w is not labeled with H ), then insert ‘/’ be- 4. RESULT AND DISCUSSION tween wl and lbl which produce the result as wlt. Otherwise We have submitted total three Runs namely Run1, Run2 W is a Hindi term, and it will be searched in the translit- and Run3. In this section first we discussed results and eration pair dictionary EHP . During the search process, if then analysed relate issues. The Runs were evaluated using w ∈ EHP then the corresponding transliteration T will be following performance metrics [6]: extracted and wlt produced as result. Otherwise, w∈ / EHP LP, LR and LF are the Token level precision, recall and i.e. out-of-EHP dictionary, then the transliteration ‘T ’ will F-measure for the Indian language in the language pair. be generated by using the procedure TG(w). After the gen- EP, ER and EF are the Token level precision, recall and F- erating T , the character ‘/’ was inserted between w and lbl, measure for English tokens. TP, TR and TF are the Token followed by = T . Finally, wlt is produced as the result. level transliteration precision, recall, and F-measure. EQMF (Exact query match fraction) as defined in [6], EQMF (with- Algorithm 2. out transliteration) as defined in [6], but only considering language identification. ETPM (Exact transliterated pair [2] Gupta, K., Choudhury, M., and Bali, K. Mining match) as defined in [6]. hindi-english transliteration pairs from online hindi Correct label pair LA = (Correct label pairs+Incorrect label pairs) lyrics. In LREC (2012), pp. 2459–2465. Relative scores of various metrics for our runs are included [3] Karimi, S., Scholer, F., and Turpin, A. Machine in Table 1. transliteration survey. ACM Computing Surveys (CSUR) 43, 3 (2011), 17:1–46. 4.1 Results [4] Klein, D. The stanford classifier. http://http: In total we submitted 3 runs using two different approaches //nlp.stanford.edu/software/classifier.shtml, discussed in previous section. We compare the evaluation re- 2003. Online; accessed 16-06-2014. sults w.r.t. MAX and MEDIAN of all runs (including run [5] Knight, K., and Graehl, J. Machine transliteration. of other teams) submmited at FIRE-2014. Computational Linguistics 24, 4 (1998), 599–612. 4.1.1 Run1: [6] Roy, R. S., Choudhury, M., Majumder, P., and Agarwal, K. Overview and datasets of fire 2013 track For the first run Label identification was done using ISM- on transliterated search. 1 followed by back-transliteration using combined approach [7] Singh, P. RomaDeva: English(roman) to of algorithm 1. These are the position in different metrics: hindi(devanagri) transliteration tool. https: LP-1st, LF-5th, EP-9th, ER-2nd, EF-6th, LA-4th //code.google.com/p/romadeva/downloads/list, 4.1.2 Run2: 2012. Online; accessed 19-02-2014. For our second run we used ISM-2 for label identifica- tion and algorithm 2 for automatic transliteration. Run2 performed best in LF, EF, EQMF-All-NT (No transliter- ation), EQMF-wt-NE (No transliteration), EQMF-wt-Mix (No transliteration), EQMF-wt-Mix and NE-NT (No translit- eration). Other metrics position are as LP-2nd, ER-2nd, EP-4. 4.1.3 Run3: In this run ISM-2 is used to identify the label of Tokens and automatic transliteration approach in algorithm 1 is used. Run2 performed best in LF, EF, EQMF-All-NT (No transliteration), EQMF-wt-NE (No transliteration), EQMF- wt-Mix (No transliteration), EQMF-wt-Mix and NE-NT (No transliteration). Other metrics position are as LP-2nd, ER- 2nd, EP-4. On a different note, there were some errors in the transliteration pairs of training data. 4.2 Comparison We have compare our approaches score with Max. of all. We could not perform well in NEs related metrics, since our system fail to recognize NEs correctly. Possibly use of some NER tool can solve this problem. Since our system is heavily biased by the training data, we got inexact translit- eration for some terms. 5. CONCLUSIONS Our work comprises two subtasks labeling and transliter- ation. We used classifier for word labeling. Label accuracy relatively better than last year of our submission. We identi- fied some terms were incorrectly labeled. Perhaps this hap- pened due to two reasons, first less corpus size and second we did not used use any NER system. In the transliter- ation approaches we observed that results are inclined to extraction approaches which requires corpus of large size. By applying some learning based approaches for transliter- ation generation we are planning to improve result of our systems. 6. REFERENCES [1] Chinnakotla, M. K., Damani, O. P., and Satoskar, A. Transliteration for resource-scarce languages. ACM Transactions on Asian Language Information Processing (TALIP) 9, 4 (2010), 14. Table 1: Performance of different Runs Run ID LP LR LF EP ER EF TP TR TF LA Run1 0.942 0.852 0.895 0.823 0.942 0.878 0.131 0.636 0.217 0.872 Run2 0.93 0.892 0.911 0.871 0.932 0.901 0.07 0.363 0.118 0.886 Run3 0.93 0.892 0.911 0.871 0.932 0.901 0.122 0.628 0.204 0.886 Median 0.853 0.861 0.81 0.767 0.881 0.797 0.109 0.6335 0.1835 0.792 MAX. 0.942 0.917 0.911 0.895 0.987 0.901 0.2 0.76 0.304 0.886 Run ID EQMF- EQMF- EQMF- EQMF- EQMF- EQMF- EQMF- EQMF- ETPM All-NT wt-NE- wt-Mix- wt- All wt-NE wt-Mix wt- NT NT Mix-n- Mix- NE-NT n-NE Run1 0.269 0.409 0.269 0.409 0.001 0.001 0.001 0.001 1616/2203 Run2 0.276 0.427 0.276 0.427 0.001 0.002 0.001 0.002 924/2251 Run3 0.276 0.427 0.276 0.427 0 0.002 0 0.002 1596/2251 Median 0.194 0.285 0.194 0.285 0.001 0.003 0.001 0.003 NA MAX 0.276 0.427 0.276 0.427 0.005 0.01 0.005 0.01 NA

Figure 1: Comparison with Median and Max. Score A Hybrid Approach for Transliterated Word-Level Language Identification: CRF with Post Processing Heuristics

Somnath Banerjee Aniruddha Roy Alapan Kuila CSE Department,JU,India CSE Department,JU,India CSE Department,JU,India [email protected] [email protected] [email protected] Sudip Kumar Naskar Sivaji Bandyopadhyay Paolo Rosso CSE Department,JU,India CSE Department,JU,India NLE Lab,UPV,Spain [email protected] [email protected] [email protected]

ABSTRACT language in a nonnative script is called (forward) translitera- In this paper, we describe a hybrid approach for word-level tion.Especially the use of Roman script in transliteration for language (WLL) identification of Bangla words written in those languages presents serious challenges to understand- Roman script and mixed with English words as part of our ing, search and (backward) transliteration.These challenges participation in the shared task on transliterated search at include handling spelling variations, diphthongs, doubled Forum for Information Retrieval Evaluation (FIRE) in 2014. letters, reoccurring constructions, etc. A CRF based machine learning model and post-processing Language identification for documents is a well-studied heuristics are employed for the WLL identification task. In natural language problem [3].King and Abney[9] presented addition to language identification, two transliteration sys- the different aspects of this problem and focussed on the tems were built to transliterate detected Bangla words writ- problem of labeling the language of individual words within ten in Roman script into native Bangla script. The system a multilingual document.They proposed language identifica- demonstrated an overall token level language identification tion at the word level in mixed language documents instead accuracy of 0.905. The token level Bangla and English lan- of sentence level identification. guage identification F-scores are 0.899, 0.920 respectively. The last decade has seen the development of transliter- The two transliteration systems achieved accuracies of 0.062 ation systems for Asian languages. Some notable translit- and 0.037. The system presented in this paper resulted in eration systems were built for Chinese [14], Japanese [7], the best scores across almost all metrics among all the par- Korean [8], Arabic [1], etc. Transliteration systems were ticipating systems for the Bangla-English language pair. also developed for Indian languages [6, 16]. Categories and Subject Descriptors 2. TASK DEFINITION 1.2.7 [Artificial Intelligence]: Natural Language Process- A query q : < w w w ...w > is written in Roman script. ing,Language parsing and understanding 1 2 3 n The words, w1, w2, w3, ..., wn, could be standard English words or transliterated from Indian languages (IL), e.g., General Terms Bangla, Hindi, etc. The objective of the task is to iden- Experimentation, Languages tify the words as English or IL depending on whether it is a standard English word or a transliterated IL word. After Keywords labeling the words, for each transliterated word, the correct transliteration has to be provided in the native script (i.e., Word level language identification,Transliteration,code switch the script which is used for writing the IL). Names of peo- ple and places in IL should be considered as transliterated 1. INTRODUCTION entries, whenever it is a native name. Thus, the system In spite of having indigenous scripts, often Indian lan- has to transliterate the identified native names (e.g. Arund- guages (e.g., Bangla, Hindi, Tamil etc.) are written in Ro- hati Roy). Non-native names (e.g. Ruskin Bond) should be man script for user generated contents (such as blogs and skipped during labeling and are not evaluated. tweets) due to various socio-cultural and technological rea- sons.This process of phonetically representing the words of a 3. DATASETS AND RESOURCES This section describes the dataset that have been used in Permission to make digital or hard copies of all or part of this work for this work. The training and the test data have been con- personal or classroom use is granted without fee provided that copies are structed by using manual and automated techniques and not made or distributed for profit or commercial advantage and that copies made available to the task participants by the organizers. bear this notice and the full citation on the first page. To copy otherwise, to The training dataset consists of 800 lines. The testset con- republish, to post on servers or to redistribute to lists, requires prior specific tains 1000 sentences. permission and/or a fee. WOODSTOCK ’97 El Paso, Texas USA The following resources provided by the organizers were Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. also employed: • English word frequency list 1: contains standard dictio- prepared a small suffix-list (10 entries) under human super- nary words along with their frequencies computed from a vision from the archive (10 documents) of an online Bangla large corpus constructed from news corpora. newspaper. This feature is also used as a binary feature. • Bangla word frequency list 1 : contains Bangla words in  1 if word contains any suffix Roman script along with their frequencies computed from has suffix(word) = the Anandabazar Patrika news corpus. 0 otherwise • Bangla word transliteration pairs dataset[15]: contains Bangla-English transliteration pairs collected from different 4.1.6 Contextual Probability users in multiple setups - chat, dictation and other scenarios. This feature is very much crucial to resolve the ambiguity in the WLL identification problem. Let us consider examples given below. 4. SYSTEM DESCRIPTION • Mama take this badge off of me. We divided the overall task into two sub-problems: (a) • Ami take boli je ami bansdronir kichu agei thaki. word-level language (WLL) classification, and (b) translit- The word ‘take’ exists in the English vocabulary. How- eration of identified IL words into native script. ever, the backward transliteration of ‘take’ is a valid Bangla word. Words like ‘take’, ‘are’, ‘pore’, ‘bad’ are truly ambigu- 4.1 WLL classification Features ous words with respect to the WLL identification problem as they are valid English words as well as backward translit- 4.1.1 Character n-grams erations of valid Bangla words. In this regard, context of the Few studies [9, 5] successfully used the character n-gram word can be used to correctly identify the language for such feature and they obtained reasonable results. Therefore, fol- an ambiguous word. Therefore, we have considered this very lowing them, we also used this feature from character uni- useful feature. grams up to five-grams. After empirical study on the devel- As in the Bangla-English language identification task the opment set, we decided on the maximum length of a word label should be one from the tag-list: {English, Hindi, Bangla, to be 10 for generating the character n-grams. Therefore, if Others}, we calculate the probability of the previous word the length of the word is more than 10, then due to the fixed being English, Hindi, Bangla and Others. Thus, four prob- length vector constraint the system generates 10 unigrams abilities have been calculated for the previous word. In a and the last two characters are skipped. Thus the system similar way, the labeling probabilities for the next word have always generates a total of 40 n-grams, i.e., 10 unigrams, also been calculated. 9 bigrams, 8 trigrams, 7 four-grams and 6 five-grams.The The system calculates the respective probabilities as entire word is also considered as a feature. Ftag(W ) P (W ) = , where, tag is any one from the list: {E, tag F (W ) 4.1.2 Symbol character O, H, B}; Ftag(W ) = frequency of the word W belonging to A word might start with some symbol, e.g. #, @, etc.It tag; F (W ) = Frequency of word W. These frequencies are has also been observed from the training corpus that symbols counted from the training corpus. However, for few words in appear within the word itself, e.g. a***a, kankra-r, etc. the testset the respective probabilities are 0. Since we do not Sometimes the entire word is built up of a symbol, e.g. “, ?. want assign zero probability to those words, we need to as- sign some probability mass to those words using smoothing.  1 if word contains any symbol has symbol(word) = We use the simplest smoothing technique, Laplace smooth- 0 otherwise ing, which adapts the empirical counts by adding a fixed number (say, 1) to every count and thus eliminates counts of 4.1.3 Links zero. For simplicity, we use add-one smoothing. Therefore, This feature is used as a binary feature. If a word is a Ftag(W ) + 1 the adjusted formula is: P (W ) = , where, N link, then it is set to 1, otherwise it is set to 0. tag F (W ) + N = total number of words in the training corpus.  1 if word is a link is link(word) = 0 otherwise 4.2 WLL Classifier In this work, Conditional Random Field (CRF) is used to 4.1.4 Presence of Digit build the model for WLL identification classifier. We used The use of digit(s) in a word sometimes means different CRF++ toolkit 2 which is a simple, customizable, and open in the chat dialogue. For example, ‘gr8’ means ‘great’, ‘2’ source implementation of CRF. could mean ‘to’ or ‘too’. This feature is also used as binary feature. Therefore, 4.3 Post Processing  1 if word contains any digit After CRF classifier labels each word, post-processing heuris- has digit(word) = 0 otherwise tics are applied to make a rule-based decision over the out- come of the classifier. The following heuristics are employed: 4.1.5 Word suffix Rule-1: Many English words end with ‘ed’ (e.g. decided, Any language dependent feature increases the accuracy of reached, arrested, looked, etc.), but we have not found any the system for a particular language. Fixed length suffix occurrences of any Bangla word ending with that suffix in feature was used successfully by ([2]) in the Bangla named the given corpus. Therefore, an word ending with ‘ed’ and entity recognition task. To include this feature, we have having no symbol inside it is tagged as an English word. In 1http://cse.iitkgp.ac.in/resgrp/qa/fire13translit/index.html 2http://crfpp.googlecode.com/svn/trunk/doc/index.html the test corpus we found 306 such occurrences. of a word has the higher chance of the word being an En- R1: H-Tag(w)=E ; if C-Tag(w)= B or O, has suffix(w, ‘ed’)= glish/Hindi word than Bangla. E.g. torengeee, plzzzzzz, etc. true and w 6∈ S (2) Repetition of a character more than twice in the mid- Where, C-Tag(w)=Classifier’s output, H-Tag(w)=Heuristic dle of a word has the higher chance of the word being a based output, has suffix(w, s)= word ends with suffix s, and Bangla word than English. E.g. kisssob, oneeek, etc. S = set of special character , E = English tag, B = Bangla (3) If a word satisfies both condition (1) and (2), then the tag, O = Others tag. word is more likely to be an English word. E.g. muuuuaaah- hhhhhhh. Rule-2: An English word may end with ‘ly’ suffix also, The following rules are employed: e.g. thoughtfully, anxiously, unfriendly, etc. It has been ob- Case-1: R10a: H-Tag(w) = E ; if C-Tag(w) = B or O or served in the test dataset that few English words were not H, end repeat(ch) >= 3 and w 6∈ S written in correct spelling and they were mis-classified as Case-2: R10b: H-Tag(w) = B ; if C-Tag(w) = E or O or Bangla words, e.g. lvly, xactly, physicaly, etc. These words H, middle repeat(ch) >= 3 and w 6∈ S are corrected by applying this rule. Case-3: R10c: H-Tag(w) = E ; if C-Tag(w) = B or H or R2: H-Tag(w)=E ; if C-Tag(w)= B or O, has suffix(w, ‘ly’)= O, end repeat(ch) >= 3 and middle repeat(ch) >= 3 and w true and w 6∈ S 6∈ S Rule-3: It was also observed that unlike English words (e.g. Rule-11: This rule is also very much straightforward. If evening, kissing, playing, etc.) no Bangla words end with a word contains any substring from the list: {www., http:, ‘ing’ suffix in the training corpus. We found 316 such oc- https::}, then the word is tagged as Others. currences in testset, but some occurrences are not tagged as R11: H-Tag(w) = O ; if C-Tag(w) = B or E or H, and English because those words start with ‘#’ (e.g. #engineer- contains(w) = www.|http:|https:: ing). This rule was able to correct some spelling errors such as luking, nthing, njoying, etc. R3: H-Tag(w)=E ; if C-Tag(w)= B or O, has suffix(w, ‘ing’)= 5. TRANSLITERATION SYSTEM true and w 6∈ S For transliterating the detected Romanized Bangla words, Rule-4: The use of apostrophe s (i.e.,’s) is very common we built our transliteration system based on the state-of-the- in English words, e.g. women’s, uncle’s etc. In the test art phrase-based statistical machine translation (PB-SMT) dataset, we found 73 use of it. model [13] using the Moses toolkit [12]. PB-SMT is a ma- R4: H-Tag(w)=E ; if C-Tag(w)= B or O, has suffix(w, ‘’s’)= chine translation model; therefore, we adapted the PB-SMT true and w 6∈ S model to the transliteration task by translating characters rather than words as in character-level translation. For char- Rule-5: Another very common use of apostrophe is apos- acter alignment, we used GIZA++ implementation of the trophe t (i.e., ’t), e.g., don’t, isn’t, wouldn’t, etc. Even it is IBM word alignment model [4]. To suit the PB-SMT model used in different way such as rn’t, cudn’t, etc. to the transliteration task, we do not use the phrase reorder- R5: H-Tag(w)=E ; if C-Tag(w)= B or O, has suffix(w, ‘’t’)= ing model. The target language model is built on the target true and w 6∈ S side of the parallel data with Kneser-Ney smoothing [10] us- Rule-6: A few users prefer to use words ending with ’ll, e.g., ing the SRILM tool [11]. The PB-SMT model was trained I’ll, It’ll, he’ll, you’ll, etc. We found 20 such occurrences in on the English-Bangla word transliteration pairs dataset [15] the test set. provided by the task organizers. In a bid to simulate syllable R6: H-Tag(w)=E ; if C-Tag(w)= B or O, has suffix(w, ‘’ll’)= level transliteration we also built a transliteration model by true and w 6∈ S breaking the English and Bangla words to chunks of consec- utive characters and trained the transliteration system on Rule-7: The use of words like o’clock, O’Keefe etc. are very this chunked data. The chunk-level transliteration system is uncommon in Bangla social media users. But we found 16 supposed to perform better than the character-level translit- such occurrences in the test dataset. eration system since a chunk contains more context than a R7: H-Tag(w)=E ;if C-Tag(w)= B or O, starts with(w, character. While decoding, we first apply the chunk-level ‘o”)= true and w 6∈ S transliteration system on the detected Bangla words. If the chunk-level transliteration system is able to transliterate a Rule-8: This rule is very much straightforward. If a word word only partially (i.e., it still contains roman characters), contains a special symbol, then the word is tagged as O. the untranslated parts are decoded using the character-level R8: H-Tag(w)=O ; if C-Tag(w)=B or O or E or H and w transliteration system. For breaking the English and Ben- ∈ S gali words into chunks, we take two approaches. In the first Rule-9: Although a few ambiguities are discussed in 4.1.6, approach (Run-1) we simply break words into chunks of con- there is a high chance of a word being English if it is in the secutive 2/3 characters. In the other approach (Run-2), we English dictionary. Considering the ambiguity, we also con- break words into transliteration units (TU) following the sider the probability of the word to be in Bangla language. heuristic used in [6]. The TU-level transliteration system R9: H-Tag(w)=E ; if C-Tag(w)=B and probability Bangla(w) was trained on named entities. < 0.08 (this threshold was set empirically.) Rule-10: The use of character repetition in the word is ob- 6. RESULTS served not only in English and Hindi, but in Bangla as well. Table-1 presents the obtained results. Our system achieved The following observations have been noticed: an overall accuracy of 0.905 for the language labeling task (1) Repetition of a character more than twice at the end which is the best among the participating teams. Table 1: Results We acknowledge the support of the Department of Elec- Token level language accuracy tronics and Information Technology (DeitY), Government of Language Precision Recall F-Measure India, through the project “CLIA System Phase II”. Bangla 0.866 0.935 0.899 English 0.944 0.899 0.920 9. REFERENCES Token level Transliteration [1] Y. Al-Onaizan and K. Knight. Named entity Run Precision Recall F-Measure translation: Extended abstract. In HLT, pages Run-1 0.033 0.572 0.062 122–124. Singapore, 2002. Run-2 0.019 0.338 0.037 [2] S. Banerjee, S. Naskar, and S. Bandyopadhyay. Other Performance Metrics Bengali named entity recognition using margin infused EQMF All(No Translit.) 0.444 relaxed algorithm. In Text, Speech and Dialogue, pages EQMF without NE(No Translit.) 0.548 125–132. Springer International Publishing, 2014. EQMF without MIX(No Translit.) 0.444 [3] K. R. Beesley. Language identifier: A computer EQMF without NE&MIX(No Translit.) 0.548 program for automatic natural-language identification EQMF All Run-1 0.005 of on-line text. In ATA, pages 47–54, 1988. EQMF All Run-2 0.004 [4] P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, and EQMF without NE: Run-1 0.007 R. L. Mercer. Mercer: The mathematics of statistical EQMF without NE: Run-2 0.004 machine translation: parameter estimation. In EQMF without MIX: Run-1 0.005 Computational Linguistics, pages 263–311, 1993. EQMF without MIX: Run-2 0.004 [5] G. Chittaranjan, Y. Vyas, K. Bali, and EQMF without NE&MIX: Run-1 0.007 M. Choudhury. Word-level language identification EQMF without NE&MIX: Run-2 0.004 using crf: Code-switching shared task report of msr ETPM: Run-1 227/364 india system. In EMNLP, pages 73–79, 2014. ETPM: Run-2 134/364 [6] A. Ekbal, S. Naskar, and S. Bandyopadhyay. A Language Identification Accuracy 0.905 modified joint source channel model for transliteration. In COLING-ACL, pages 191–198. Australia, 2006. [7] I. Goto, N. Kato, N. Uratani, and T. Ehara. 6.1 Error Analysis Transliteration considering context information based It was observed that the WLL classifier based on CRF on the maximum entropy method. In MT-Summit IX, wrongly predicted due to the small training data. Moreover, pages 125–132. New Orleans, USA, 2003. some words were predicted correctly by the classifier, how- [8] S. Y. Jung, S. L. Hong, and E. Paek. An english to ever, due to the heuristics the final prediction went wrong; korean transliteration model of extended markov e.g., the word Wannna is re-classified by (R10b) wrongly window. In COLING, pages 383–389, 2000. as Bangla. R10a also mis-classified Hindi words having [9] B. King and S. Abney. Labeling the languages of words character repetition at the end, such as torengeee, Arehhh, in mixed-language documents using weakly supervised etc. R10a also mis-classified Bangla words such as jahhh, methods. In NAACL-HLT, pages 1110–1119, 2013. jetooooo, etc. Rule-8 re-classified some words due to tok- [10] R. Kneser and H. Ney. Improved backing-off for enization errors in the provided test dataset am!”, back!”, m-gram language modeling. In ICASSP, pages goin’, ekjon-eri, etc. Some words in the testset were of the 181–184. Detroit, MI, 1995. form word1/word2, such as isharay/nirupay, samanyo/8B [11] R. Kneser and H. Ney. Srilm-an extensible language etc., which were simply classified as O (i.e., Others) using modeling toolkit. In Intl. Conf. on Spoken Language Rule-8 in our system. Processing, pages 901–904, 2002. The TU-level transliteration system was trained over named [12] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, entities; hence it performed well for NEs, but the overall M. Federico, N. Bertoldi, B. Cowan, W. Shen, performance was affected because majority of the detected C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, Bangla words were non-NE words. and E. Herbst. Moses: open source toolkit for statistical machine translation. In ACL, pages 7. CONCLUSIONS 177–180, 2007. In this paper, we presented a brief overview of our hybrid [13] P. Koehn, F. J. Och, and D. Marcu. Statistical approach to address the automatic WLL identification prob- phrase-based translation. In HLT-NAACL, pages lem. We found that the use of simple post-processing heuris- 48–54, 2003. tics enhances the overall performance of the WLL system. [14] H. Li, Z. Min, and J. Su. A joint source-channel model Two variants of the transliteration systems were developed for machine transliteration. In ACL, page 159, 2004. based on the segmentation of the transliteration data, i.e., at [15] V. Sowmya, M. Choudhury, K. Bali, T. Dasgupta, and chunk-level and syllable-level. As future work we would like A. Basu. Resource creation for training and testing of to explore more features for the machine learning model and transliteration systems for indian languages. In LREC, better post-processing heuristics for the WLL identification pages 2902–2907, 2010. task and try to increase the efficiency of our transliteration [16] H. Surana and A. K. Singh. A more discerning and system. adaptable multilingual transliteration mechanism for indian languages. In COLING-ACL, pages 64–71. 8. ACKNOWLEDGMENTS India, 2008. Query Word Labeling using Supervised Machine Learning: Shared task report by PESIT team Channa Bankapur Adithya Abraham Philip Saimadhav A Heblikar Asst. Prof. of Computer Science PES Institute of Technology PES Institute of Technology PES University Bangalore, India Bangalore, India Bangalore, India [email protected] [email protected] [email protected] words to be labeled. If the query word is English, there is no transliteration required. If the query word is of the Indian ABSTRACT language L, then the word is fed to the transliterator which uses The aim of this task is to identify words as belonging to an states to transliterate it from Roman script to Devanagari script. Indian language (L) or English (E) from sentences written in Roman script and if the word belongs to Indian language (L), 2.1 Classification transliterate the same to its Devanagari script equivalent. We The training data for the classifier was obtained from FIRE 2013 have participated in the shared task where the Indian language is shared resources. The training data was processed to make a list Hindi. We have used a supervised machine learning approach to of 29,934 Hindi words represented in Roman script. A similar classify words as belonging to either L or E and a state machine list was made for English words using the Debian wordlist based approach to transliterate words. virtual package[4]. This wordlist has 71,971 words. It is to be noted that there is no relationship or correspondence between Categories and Subject Descriptors the two lists. H.3.3 [Information Search and Retrieval]: Information Search The two lists were separately processed as follows: and Retrieval 2.1.1 Extracting character n-gram frequency Keywords Bilingual classification, transliteration The feature to train the classifier was chosen as character n- grams. Frequency data was obtained for character n-grams for 1. INTRODUCTION both the lists for n = 1, 2, 3, 4, 5, and 6 including word Most Indian languages are written in the Devanagari script. The boundaries. Including word boundaries means including the difficulty in representing Devanagari characters lies in the fact space character at the start of and at the end of the word. Space that they don't have one-to-one correspondence with their character was used as a delimiter in this case. This data was Roman script equivalents. We are forced to use the Roman script sanitized to remove overlapping cases at word boundaries to attempt to phonetically replicate the corresponding word in between n and n-1 grams. This data was sorted and stored Devanagari script. The ambiguity in this representation lies separately. No feature was discarded even though the feature list primarily in the fact that multiple combinations of letters in was huge. Roman script can be used to refer to the same word in 2.1.2 Training the classifier Devanagari script. For example, “bharat” and “bharath” both refer to the same word in Devanagari script (meaning “India”). The processed data from the above step was used to train the Besides, due to the different ways it is pronounced in different classifier. A popular Python based machine learning library regions, there will be further ambiguity when the same word is called scikit-learn 0.15.0[5] was used. The classification to be represented in Roman script. Also due to socio-cultural algorithm used was Multinomial Naive Bayes. The other reasons, there is bound to be a mix of languages even though the classifier used was LinearSVC, which performed slightly worse script used is the Roman script. The process of labeling is a two as compared to the Multinomial Naive Bayes classifier in terms class classification problem. The process of representing words of accuracy. Performance-wise, there was no noticeable from a source language into the target language phonetic difference between the two, with both the classifiers taking equivalent is called transliteration. approximately the same time to classify the same data set. We have used training data from FIRE 2013 to train the 2.1.3 Classifying the words classifier. Character n-grams have been used for this purpose. The transliterator has been manually built using language In this step, the input query string was tokenized on common specific data available on the internet. word boundaries like space, full stop, etc. Words containing characters other than alphabets were ignored. Each word was fed 2. HINDI – ENGLISH QUERY WORD to the classifier built earlier to predict whether it belongs to the Indian language L or English E. TRANSLITERATION: Our system has been divided into two parts: the classification part and the transliteration part. The classification uses a supervised machine learning approach. First the query string q1 q2 … qn are fed to the classifier, where q1, q2, … , qn are query 2.2 TRANSLITERATION succeeding syllables are not to be joined as half letters. “H#!#M” would sound like “hum” while “H#M” would be half 2.2.1 Introduction of the “H” consonant joined with “M”. When more than one The transliteration of a word consists of three steps: the Splitter, possible syllable equivalent exists for a block, the output from the Convertor, and the Generator. the Convertor is ordered with the most probable syllable The Splitter splits the word into separate vowel and consonant equivalent first. This order is implemented in the part of the blocks, with the idea that each block represents a syllable. Convertor where we map the block to the syllable equivalent. For example, for the “a” block in the middle of the word The Convertor maps each of the blocks generated by the “hamare”, we add the “!” to the output list before the sign convertor to a set of possible Hindi equivalents, based on a set (“matra”) of the first vowel of the Hindi language, as we believe of rules. “!” is more likely than the sign (“matra”) of the first Hindi vowel The Generator generates all possible Hindi transliterations of the in this case. These rules, which include the order of syllables in word, based on output from the Convertor, and selects one the output list, have been decided based on human intuition and transliteration, based on a dictionary search, followed by picking individual letter frequency charts[6]. These rules have been the first word which found a match in case of multiple matches. elaborately upon in the appendix. If the block cannot be mapped in the first stage, the algorithm proceeds to the second stage. In this stage the word is 2.2.2 The Splitter compressed in a manner similar to Compressed Word Format In most words, a set of continuous consonants or vowels (CWF)[1] by removing continuous sequences of the same vowel represents a single syllable. e.g. in the words “dekha”, “hamare”, or consonant and replacing each sequence with one character. ”zindagi”. This would mean that “aao” or “aaaoo” after failing the first By splitting a word into such blocks, we are required to define stage would be reduced to “ao” in the second stage. conversion rules only for single syllables, which help simplify In the case of consonant blocks, the second stage is also the the problem to a great degree. stage where, based on position, “special” letters like “h” or “n” This would result in the word “dekha” being split into blocks are used to determine whether certain letters require extra stress “d”-“e”-“kh”-“a”, each of which would be processed (in case of “h”), or require a dot on top (in case of “n”). These independently by the convertor. parts of the block once processed are removed from the block before further processing. There are certain exceptions to the one-syllable-per-block rule, which are discussed in, and handled by, the Convertor. After the second stage, the unmapped parts of the block are sent to the first stage again. If it still cannot be mapped, it is 2.2.3 The Convertor classified as an exception and passed on to the exception The Convertor takes input from the Splitter in the form of handler as defined below. individual consonant or vowel blocks, and generates a list of The exception handler is meant to handle exceptions to the one- possible Hindi equivalents of the input, as output. syllable-per-block rule, which can occur in two broadly The output of the Convertor is represented in an intermediary classified groups – clearly enunciated words and ambiguously form consisting of groups of Roman script letters, representing enunciated words. specific symbols in Devanagari script of Hindi, delimited by “#” Clearly enunciated words refer to Hindi words written in Roman characters, such as “K” to represent the first Hindi consonant script which clearly represent the enunciation of the (pronounced “ka”) and “DU” to represent a dot to be placed corresponding Hindi word. The word “karvega” is an example under the succeeding consonant. The control character “!” is of a word in this group while “krvega” is an example of the used to indicate that the preceding and succeeding consonants same Hindi word in a less clearly enunciated form and therefore are not to be joined as half letters, which is the default behavior not in this group. (in case “!” is absent”). This intermediary form allows for easier Exceptions to the rule in the clearly enunciated group primarily programming and debugging rather than directly using Unicode occur in vowel blocks. Words such as “aao” will be considered characters. The mapping has been given in the appendix. as a single vowel block but should be represented as two There are two distinct stages in the Convertor, followed by syllables “aa” and “o”. In the case of “aao”, the block would exception handling. already have been reduced to “ao” (CWF) by the second stage, In the first stage, the Convertor accepts a single consonant or and if no existing rules map the block “ao”, the Convertor would vowel block, and the position of the block in the word assume that each letter in the block represents a different (“beginning”, “middle” or “end”) as input. Using a set of pre- syllable. It would map them by recursively calling itself, passing defined letter groupings as rules, and based on the position of “a” and “o” as two separate blocks, both at the “beginning” the block in the word, it attempts to map each block to a set of position. similar-sounding matches in Hindi. For example, the block “a” Hindi words improperly/ambiguously enunciated in Roman present at the beginning of a word would be mapped to the first Script include words like “krvega”, which is an ambiguously vowel of Hindi (which sounds like “aa”), while “a” present at enunciated equivalent of “karvega” and is harder to handle, as the end of a word would be mapped to the “sign” (“matra”) of the consonant block “krv” represents as a single block what we the first vowel in Hindi. The block “a” in the middle of a word, should have ideally interpreted as three blocks – “k”, “a” and say “hamare”, would be mapped to two possibilities – the sign “rv” However, such constructs are popular among users of social (“matra”) of the first vowel in Hindi, or “nothing”, where networking sites and must be taken care of. In its current state, “nothing” (represented by “!”) means the preceding and the Convertor guesses that all letters in an unmapped consonant block represent individual half letters and maps each individual system and makes it prone to human error in laying down the letter to the corresponding half-consonant (except the last letter, rules. One example is when the rules for an “a” vowel block in which is assumed to be a “full” consonant). the middle of a word was set – we initially gave the The output from the Convertor is an ordered list of syllables in sign(“matra”) of the first vowel of Hindi higher preference as Hindi which sound like the given block at the given position in compared to the “!” (“nothing”) character (the former was the word, with the most probable syllable equivalent first. This ordered first in the output list of the Convertor, and the latter output is used by the Generator. second). We later found, by trial and error, that “!” should have been given higher priority in the output order of the Convertor, 2.2.4 The Generator and we had to manually change the rule to make “!” come before A set of words is generated, by listing all possible combinations the sign(“matra”) of the first vowel in Hindi in the output list of from the set of possible syllable mappings produced for each the Convertor. block by the Convertor. These words are then passed through a dictionary filter to remove non-existent combinations. In the 5. SCOPE FOR IMPROVEMENT case of exactly one word from the list of words being present in 5.1 Word Sense Disambiguation the dictionary, it is given as output. In the case of more than one The drawback described in 4.1.1 can be approached by having word finding dictionary matches, the first match among them, in place a system which can identify the sense of the word. This based on how the Convertor orders its output (most probable could possibly be done by looking at the words around the first) is chosen and given as output. In case of no matches, the sentence or having the system learn common patterns of dictionary is not considered and the most first word from the language use in a bilingual space. An example of such a pattern entire unfiltered list is chosen, once again based on the ordering is L-L-E-L-E. Then based on such data of patterns, we may be from the Convertor. able to identify such words. After the word is chosen, it is mapped from the intermediary form used in the Convertor to its Hindi Devanagari Script 5.2 Introduction of a Machine Learning equivalent. System in Transliteration Allow an unsupervised learning system to be used in the Convertor to map the blocks generated by the Splitter to their syllable equivalents. This would eliminate one of the biggest 3. RESULTS weaknesses of our model – having the rules which map blocks The token level precision, recall and F-measure for Hindi were to syllables being set by humans based on nothing but human found to be 0.81, 0.813 and 0.812 respectively, and for English intuition assisted by Hindi letter frequency charts.[6] For was found to be 0.767, 0.798 and 0.782. The token level example, we would leave it to the machine learning system to language labeling accuracy was 0.656, while EQMF[7], and decide what an “a” block in the middle of a word or an “ee” or ETPM[7] were found to be 0.001 and 0.5737 respectively. “ch” block at the end of a word should be mapped to, and not EQMF without transliteration – With both named entities and specify any rules ourselves. MIX tags, without named entities, without MIX tags, and without both named entities and MIX tags were 0.158, 0.258, 6. ACKNOWLEDGEMENTS 0.158, and 0.258 respectively. We would like to thank the Indian Statistical Institute for organizing FIRE - 2014, and giving us the opportunity to work 4. DRAWBACKS on this transliteration system. We would also like to thank 4.1 Classification Microsoft Research India team for their participation in the same, whose data and evaluation helped us better understand 4.1.1 Undefined behavior for certain queries our strengths and weaknesses. The classifier was found to exhibit undefined behavior for certain words which are “valid” in both languages (Hindi and 7. REFERENCES English in our case). Examples of such words are “do”, “hai”, [1] Srinivasan C. Janarthanam, Sethuramalingam S, and “to” etc. The classifier's prediction varied on each run for such Udhyakumar Nallasamy. Named entity transliteration for cross- queries. One possible reason is that such a word is equally language information retrieval using compressed word format probable of belonging to both the classes, in which case the mapping algorithm. In iNEWS ’08: Proceeding of the 2nd ACM prediction may be randomly decided. workshop on improving non english web searching, pages 33– 4.1.2 Classifying named entities 38, New York, NY, USA, 2008. ACM. The system was unable to detect named entities as it was not [2] Parth Gupta, Kalika Bali, Rafael E. Banchs, Monojit trained for this task. This prevented it from classifying them. Choudhury, and Paolo Rosso. 2014. Query expansion for Named entities in queries were not treated as such. They were mixed-script information retrieval. In Proceedings of the 37th treated as normal words belonging to either Indian language L international ACM SIGIR conference on Research & or English E. development in information retrieval (SIGIR '14). ACM [3] P. J. Antony and K. P. Soman, Machine Transliteration for 4.2 Transliteration Indian Languages: A Literature Survey. In International Journal 4.2.1 Lack of a Machine Learning System of Scientific & Engineering Research, Volume 2, Issue 12, No machine learning system was used, and hence all rules were December-2011 written based on manual evaluation of data such as frequency [4] Debian wordlist virtual package - charts[6]. This prevents the use of training data to improve the https://packages.debian.org/sid/wordlist [5] Scikit-learn version 0.15.0 http://scikit-learn.org/stable/ "BB"/*adds “B” and “BB” in that order*/ [6] http://www.sttmedia.com/characterfrequency-hindi Else if ‘c’ or 'k': "K" [7] R.S. Roy, M. Choudhury, P. Majumder, K. Agarwal. Else if 'd': Overview and Datasets of FIRE 2013 Track on Transliterated "D2" Search. FIRE @ ISM. 2013 "D" "DD2" Else if 'f': APPENDIX "PP" Else if 'g': A. INTERMEDIARY FORM MAPPING "G" DT DU "GG" Else if 'h': iii THH थ "H" SHH ष EEE ऑ Else if 'j': "J" DD ढ UU ऊ "JJ" II ई oo Else if 'l': "L" eee AA आ Else if 'm': "M" PP फ EE ऐ Else if 'p': III ई TT ठ "P" D E Else if 'r': ड ए "R" G ग A अ Else if 's': "S" JJ B झ ब Else if 't': C च TH त "TH" "T" L ल M म "TT" N न O ओ Else if ‘v’ or 'w': "W" H ह I इ Else if 'y': "Y" J ज K क Else if 'z': U उ T ट "J#DU" W V For Double Character Vowel Blocks occurring at beginning of व व word (not all rules included, due to limited space): BB P भ प If “aa”: S स uu “AA” DD2 ध R र Else if “ai”: Y य e “EE” Else if “au”: SH श KK ख “OO” a ii Else if “ee”: o OO औ “II” i u Else if “ea”: ee CH छ “I” D2 द HALF “II” GG घ Else if “eu”: “y#uu” B. SAMPLE SET OF RULES “y#u” For Single Character Consonant Block occurring anywhere (not all rules included, due to limited space): If ‘b’: "B" Hindi-English Language Identification, Named Entity Recognition and Back Transliteration: Shared Task System Description Navneet Sinha 1, Gowri Srinivasa 2 Dept. of Information Science and Engineering, PES Center for Pattern Recognition, PESIT Bangalore South Campus, Bengaluru. [email protected] , [email protected]

Abstract This paper presents an algorithm for word level language identification, named entity recognition and classification, and transliteration of Indian language words written in the Roman script to their native Devanagari script from bilingual textual data. We propose the construction of an extensive, hierarchical structured dictionary and hierarchical rule-based classifier to expedite word search and language identification. The proposed method uses lexical, contextual and special character features particular to Hindi and English. With a few modifications to the system, the present solution can be replicated for other languages. The system we have submitted shows the best performance in English token level precision (0.895) and the second best in Indian language token recall (0.915). The transliteration level f-measure is relatively low (0.15); this can be significantly improved with a more representative and exhaustive training data.

1. Introduction Hindi is one of the official languages of the Federal Government of India and the fourth most commonly used language in the world; the first three being Chinese, English and Spanish. It has been widely observed that a large number of Indians tend to communicate using an admixture of words from English and Hindi and/ or other native languages. Most of these native languages have their own script, including the Devanagiri script for Hindi. Yet, due to various socio-cultural and technological reasons, most writers of dual or multiple tongues use the Roman script transliteration of words in the native language in blogs, tweets, posts, etc. Transliteration between the Devanagari script and Roman script is difficult due to difference in the encoding of the scripts, number of alphabets, capitalization of leading characters, phonetic characteristics, character length and various modifications. In this paper, we propose to parse natural language sentences written in Hindi and English using the Roman script from sources such as blogs, tweets, etc. and, use multiple hierarchical pre-processed dictionaries and a match based classifier to classify the language and subsequently transliterate words in Hindi to Devanagiri. The paper is organized as follows – in section 2, we describe the dataset used for training and designing the system. In section 3, we present our approach and experimental setup, the experimental results in section 4 and an error analysis in section 5. Finally, we conclude with a few possible directions for further work in Section 6. 2. Dataset To build the system, we used the English word list and the annotated output together with the file of Hindi word and transliteration pairs from the FIRE 2013 dataset. We also used several lists off the internet for the names of location and other named entities to build various dictionaries to expedite the classification of each word [2-4].

3. Approach Problem: The task involves reading an input file of bilingual text and tagging each word with a label: Hindi, English, Named Entity (proper nouns such as names of people, organizations, etc.), Location (places such as cities, countries, etc.), Abbreviation or Other (numerals, emoticons or smileys, etc.). And, for those words labelled “Hindi”, it is required to transliterate the word written in Roman script to Devanagiri script. The proposed solution: The foundation of our approach is based on the formation of dictionaries for words in each category, except the “Other” category, which is determined based on evaluating regular expressions. The dictionaries formed are used for language tagging, back transliteration as well as for naming entities. For every input token, rather than search exhaustively in the various dictionaries, we define a flag to keep track of the last accessed language dictionary: Hindi or English. This determines the priority with which the Hindi dictionary will be searched for the next input token. The other labels, such as “Abbreviation”, Location”, etc. do not alter the language label. The assumption is that the probability of finding a Hindi word succeeding another word in Hindi is higher than an English word succeeding a word in Hindi. Likewise, the probability of finding an English word succeeding another word in English would be higher than a Hindi word succeeding a word in English. Proper nouns such as names of people or places, numbers, smileys and abbreviations are generally language independent and hence do not contribute to altering the language flag. The steps of the approach are as follows: a file with bilingual text is input and parsed by the system. For an input token (word) from the bilingual text file, the system first evaluates various regular expressions to determine if it belongs to the “Other” category. If the token does not belong to the “Other” category, the system proceeds to search for the word based on the language flag. Suppose the language flag is set to “Hindi”, the token that is read is first matched with the Hindi dictionary. If it is found, the system checks the sub-dictionary for the Devanagiri transliteration of the word. If the transliteration is found, the category is output as “H” for Hindi and the transliteration is also noted. If the transliteration is not found, only the label is output. In case the word is not found in the Hindi dictionary, the system proceeds to check for the word in the “Named Entity”, “Location” and “Abbreviation” dictionaries in turns. If they the token is found in any of these categories, the annotation is output. If not, a word is tagged as “E” for English by default. This accounts for anomalous entries and typographical errors in the input text. If the token is tagged as “E”, the flag is reset to “English” and the search order of the dictionaries in the next iteration would the “Other” category followed by the English dictionary and various named entities before searching through the Hindi dictionary. 3.1 Dictionary formation The dictionaries are based on lists of words in each of the categories. To create the Hindi dictionary, first a python script is run on the various “every transliteration pairs” (populated from the FIRE 2013 data archive) in accordance with their structure of representation (Hindi word in Roman script followed by the Devanagiri transliteration). The script scraps the required text from all these files as a prerequisite for the Hindi annotation and transliteration dictionaries. The English dictionary is built from word lists available online, although these have presently not been used to search for a word as any input not found in the other categories is labelled as “E” for English by default. The dictionary follows a tree-like hierarchical structure. It was constructed so as to reduce the number of comparisons over a linear search through all the possible dictionaries. The lists of words are sorted by their alphabetical order (Roman script) and a sub-dictionary is created for every alphabet. In the case of Hindi words, a sub-dictionary stores the Hindi words written in Roman script with their Devanagari equivalent; this is also alphabetically sorted (based on the Roman script). For example, a Hindi sub-dictionary for the alphabet “b” would contain the entry: [“b”, (“bharat”:“भारत ”],…)]

The ellipses in the above example indicate other words with their transliteration pair starting with the alphabet ‘b’. The lists for English words, Hindi words and named entities form separate dictionaries of the ‘defaultdict’ structure which is already provided by the collections package in python. An entry in the English dictionary for the alphabet a might be as follows: [“a”,(abase, advance, attest,…)] These styles of dictionaries help in the quicker processing of the output file as they drastically reduce the number of comparisons.

3.2 Regular Expressions, Tagging, Back Transliteration, Named Entity Representations A python script loads the different dictionaries and, if the straight forward alphabet search does not yield a class label, the script proceeds with the evaluation of regular expressions. First, the system uses manually written regular expressions for checking whether the input word belongs to the “Other” category which includes numbers, symbols, smileys, web page links and web page references. The expressions also screen for shortened words (such as “they’re”), hyphenated words, etc. These are annotated as “English”. A list of the regular expressions is presented in Table 1. Next, the system checks for the presence of an input word in the Hindi dictionary through nested if-elif ladder. If the input word is found in the Hindi dictionary, then the system proceeds to check whether the token’s Devanagari equivalent is available. If the pair is found, the label and the transliteration are printed in the output. If not, the word is labelled but not transliterated. In case the word is not found in the Hindi dictionary, the systems searches in the “Named Entities”, “Location” and “Abbreviation” dictionaries. Then the word is checked whether if it’s an abbreviation or not. If an input word is not found in any other category, it is assumed to be an English word and suitably labelled.

Table 1: Regular expressions to screen for the “Other” category and compound words r'[\.\=\:\;\,\#\@\(\)\`\~\$\*\!\?\'\"\+\-\\\/\|\{\}\[\]\_\<\>]+' Others r'[0 -9]+' Others r'[a -zA -Z]+[ \@]+[a -zA -Z\.]*' Others r'[a -zA -Z]+[ \']+[a -z]*' English r'[A-Za-z]+[\-]+[a-z]*' English

r'[A-Za-z]+[\+\-]+' English r'http+' Others r'www.[A-Za-z0-9]+.com' Others r'[A -Za -z0 -9]+.com' Others r'[a -zA -Z]+ -[a -zA -Z]*' Others Disambiguation of words common to both languages: An advantage of using a language flag is the ability to delineate bilingual words as belonging to English or Hindi based on the context of the previous word. The underlying assumption is that if the previous word was found in the Hindi dictionary, the current word is more likely to be a word in Hindi. The language flag is used at the start of the conditional ladder and thus, effectively handles the evaluation of words which could be used in both languages such as “to”, “in”, “me”, “so”, “us”, “par” and others. The system is not sensitive to the case of the input with regard to searching through the language dictionaries. Thus, bilingual words in either case are handled in the same way. However, some of the regular expressions evaluated for the “Abbreviation”, etc. categories are sensitive to case and annotate the input appropriately.

4. Results A hierarchical rule-based search driven by the language flag yields promising results for the tagging precision, recall and f-measure for both languages and for named entity tagging. Results for transliteration are relatively low due to a paucity of training data used to populate the dictionary as well as errors in the test data. The results are summarized in Table 2.

Table 2: Results: various metrics computed by MSRI team: LP LR LF EP ER EF TP TR TF LA 0.853 0.915 0.883 0.895 0.822 0.857 0.091 0.427 0.15 0.855 EQMF All EQMF w/o NE EQMF w/o EQMF w/o Mix ETPM (NT) (NT) Mix (NT) and NE(NT) 0.231 0.354 0.231 0.354 1086/2235 LP, LR, LF - Token level precision, recall and F-measure for the Indian Language in the language pair, EP,ER, EF - Token level precision, recall and F-measure for English tokens, TP,TR,TF - Token level transliteration, precision, recall and F-measure, LA – Token level language labelling accuracy. EQMF – Exact query match fraction, ETPM – Exact transliterated pair match, NE – Named Entities NT – No Transliteration 5. Error Analysis The system yields promising results for the tasks of classifying words based on language and recognizing named entities. However, the results of the transliteration are relatively low. We conjecture this is based on the creation of the dictionary: the transliteration pairs were directly scrapped from the files in the dataset and were not checked for errors. Thus, the erroneous data output by the system for the test data did not match with the ground truth that was used to evaluate the performance of the system. A few examples of such errors would be: baat/H= भारत ab/H= आब mujhse/H= मुझे kiya/H= या .....

Further, for a majority of the words in the test data, Devanagiri equivalents were not available in the dictionary (due to the training data not being exhaustive). Since there was no other way for the proposed system to transliterate Hindi words in the Roman script to Devanagari script, the transliteration accuracy has been low. 6. Conclusion and Future Work In this paper we presented a brief overview of a hierarchical rule-based classifier for Hindi- English language identification of words, back transliteration of Hindi words to the Devanagari script and named entity tagging for people, location, organization and abbreviations. The experimental results demonstrate that having sound dictionaries with an efficient architecture and suitable regular expressions and search strategies yield promising results for the tagging. In the future, we would like to improve the performance of the annotation on unknown words with the help of the addition of n-grams, dictionaries that are dynamically updated and other search strategies (such as sub-dictionaries for a pair of letters). Furthermore, the transliteration task could be improved based on Unicode equivalents for the Roman script rather than rely on a pre-existing dictionary of transliterated words.

References

[1] S. Gella, J. Sharma, K. Bali, “Query word labelling and Back Transliteration for Indian Languages: Shared task system description”, Shared task system description in MSRI FIRE Working Notes , 2013. [2] John Lawler, “English Word List”, url; http://www.personal.umich.edu/~jlawler/word-list.html . Last accessed : November 10, 2014. [3] Mieliestronk, “Mieliestronk’s list of more than 58,000 English words”, url: http://www.- mieliestronk.com/wordlist.html . Last accessed : November 10, 2014. [4] Simha Naidu, “Indian Stuff: Text list of Indian cities (alphabetical)”, url: http://simhanaidu.- blogspot.in/2013/01/text-list-of-indian-cities-alphabetical.html . Last accessed: November 10, 2014. Language Identification, Transliteration and Resolving Common Words Ambiguity in a Pair of Languages: Shared Task on Transliterated Search

Malepati Bala Siva Sai Akhil Abhishek J M S Ramaiah Institute of Technology M S Ramaiah Institute of Technology MSR Nagar, MSRIT Post, MSR Nagar, MSRIT Post, Bangalore, India-560054 Bangalore, India-560054 [email protected] [email protected]

ABSTRACT words as E or H depending on whether it is an English word This paper describes the work that we have done for FIRE or transliterated Hindi word. Then, for each transliterated 2014 shared task on transliterated search. We targeted the Hindi word the system has to provide the correct translit- problem of word level language identification for Indian lan- eration in the native script(i.e the script which is used to guages mixed with English language. Further, we have also write Hindi language). done transliteration of Indian language words in Roman script into their corresponding native scripts of Indian lan- 2. OUR APPROACH guages. In addition to this, our technique resolves the prob- Our system has been designed on look-up based approach. lem of ambiguity with common words present in a pair of It is not only just look up based but also it has the ability to languages. identify the language even when the words are common to a pair of languages. This paper is focused towards the English- Keywords Hindi pair of language, but the same logic and algorithms Tranliteration, Language Identification, Resolving common can be extended to other pair of languages also. word ambiguity In this language pair, for example: “is” can be in both En- glish and Hindi. In English it is “is” and in Hindi when 1. INTRODUCTION “is” is transliterated to English, it again becomes “is”. This Majority of the population in the world like to use their na- creates ambiguity. There are many words like this in this tive languages as the medium of communication. The web- pair of languages. Some of them have been mentioned here sites and the user generated content such as tweets and blogs with their corresponding frequency in the test data in Table in these languages are written using Roman script due to 2.1. various socio-cultural and technological reasons. This pro- cess of phonetically representing the words of a language in English Word Hindi Word Frequency in test data a non-native script is called transliteration. Transliteration, to to 422 especially into Roman script, is used abundantly on the Web is is 172 not only for documents, but also for user queries that intend in in 169 to search for these documents. Due to this there is a prob- so so 69 lem for search engines to handle transliterated queries and documents. In this paper we have done language identifica- Table 2.1 Words common to the language pair English- tion, transliteration and resolved common words ambiguity Hindi and corresponding Hindi text with Frequency in a pair of languages. The words mentioned in the table are just for example, there The input to our system is a mixed language query which are many words like these in every pair of languages. From contains a sequence of words. For example q: w1 w2 w3 the table it can be observed that the number of such words wn. Where q is the query and w1, w2, w3, ..., wn are the are many. So, it gives a considerable amount of effect on final words in Roman script having English words and translit- result. The solution to this problem along with translitera- erated Hindi words. The aim of the system is to label the tion and language identification has been discussed here.

Following is the overall algorithm of our system:

Algorithm 2.1 Overall Algorithm while Take a word input from the input file if then use algorithm 2.2 else if then Write the word as english in the output file Example else if Let us consider an example to illustrate the Overall Algo- then Write the word as Hindi with its corresponding rithm. transliterated word in the output file else Let us discuss the flow of the algorithm step by step. At first The input file which contained the mixed language text was the “jab” will be searched in the common words database read character by character, when a space is encountered, and as the word will not be found here, the system searches it recognizes the preceding characters as a single word. In this in the English database. As it will not be found here as this way our system recognizes words. Then each word is well, it goes to transliterated pair database and if available checked for a match in the created databases. First we in the database, it prints it as Hindi and the corresponding checked whether the word is present in the common word transliterated Hindi word. The flow for “tak” will be same database. The common words database has all the possible as “jab” . common words like words in Table 2.1. The common words are words which are common to both languages in terms And “happy” will be found in the English words database of spelling. If the word is found in this database we use and it will be marked as English. “ending” will also be like Algorithm 2.2. “happy”. The flow for “na” and “ho” will be like “jab”.

Algorithm 2.2 is to solve the problem of common words in Now here comes the common word “to”, after detecting the a pair of languages. The following is the explanation of word as ambiguous, it goes to Algorithm 2.2, where it first Algorithm 2.2 checks for the word preceding and the word following. In this case we have chosen an example with one English word 2.2.1 First find the language of the word preceding and i.e “ho” and the other as Hindi word “picture”. Now as the following the common word, if the word is not the starting first rule of algorithm fails, it goes to second rule. or ending of any sentence. If the word is present in the starting or ending of the sentence go to 2.2.2. Then if they Now the system finds out total English and Hindi words in both are of one language, then the common word can also the given sentence excluding the common ambiguous word be identified as that language. i.e “to” in this case.

2.2.2 If the first condition fails, then we will have to find Total words in sentence = 12 the language of all the words in that particular sentence Total English words = 3 except the common word. The common word can be identi- Total Hindi words = 9 fied as that language which has the maximum words of that language in that sentence. So, as the ratio Total Hindi Words/Total English words is greater than 1, then the common word is identified as Hindi When we tested our algorithm on the test data set, and it word. This has been done by finding out the context of the worked 9 out of 10 times. The problem arises when the sen- sentence. tence has less than 4 words and finding the context of the sentence in this way becomes difficult. One more drawback Similarly other words in the sentence will be identified and which we observed was, when the sentence contains most transliterated accordingly and written in the output file. of the words as common words, then ambiguity increases. Then, in this case, feedback method can be implemented by The following would be the output for the above explained checking the context not considering the common words for example. the first common word. With the remaining non-common words in the sentence. Then considering this word after la- jab/H=jº tak/H=tk happy/E ending/E na/H=nA ho/H=ho beling as non-common word, the language of other common to/H=to picture/E abhi/H=aEB baki/H=ºAEk .... words can be found.

If the word is not a common word in the language pair, then check for the word in English Database. If the word 3. ANALYSIS is found in this database, we can conclude saying the word The following is brief description of the metrics terminology is an English word. If not, then the word is searched in which has been used to analyze our system: the database which has the transliterated pairs of words. If the word is found here, we can conclude that the word ETPM is Exact transliterated pair match. is Hindi and it can be printed in the output file with the LP, LR and LF are Token level precision, recall and F- transliterated pair provided in the database. measure for the Indian language in the language pair. EP, ER and EF are Token level precision, recall and F- Now, if the word was not found till here in any of the measure for English tokens. databases, then we can conclude that the word is Hindi as TP, TR and TF are Token level transliteration precision, we have more reliability on the English words than on the recall, and F-measure. transliterated pairs. As the transliterated pair list can be endless. The overall result of our system is shown in the Table 2.2 Performance Metric Performance Sowmya V. B., Challenges in Designing Input Method ETPM 1214/1333 Editors for Indian Languages: The Role of Word-Origin LP 0.864 and Context, in Proceedings of IJCNLP Workshop LR 0.579 on Advances in Text Input Methods , Association for LF 0.693 Computational Linguistics, November 2011 EP 0.494 ER 0.912 3. Sarvnaz Karimi, Falk Scholer and Andrew Turpin, Ma- chine Transliteration Survey. In ACM Computing Sur- EF 0.641 veys (CSUR), Volume 43 Issue 3, April 2011 TP 0.133 TR 0.478 4. P. J. Antony and K. P. Soman, Machine Translitera- TF 0.208 tion for Indian Languages: A Literature Survey. In International Journal of Scientific & Engineering Re- search, Volume 2, Issue 12, December-2011 Table 2.2 Overall Performance of our system 5. Kanika Gupta and Monojit Choudhury and Kalika As our system is look up based, the ETPM can be obtained Bali, Mining Hindi-English Transliteration Pairs from very accurately compared to other metrics. The other met- Online Hindi Lyrics, In Proceedings of the Eight Inter- rics depend on how well ones database of words is. There national Conference on Language Resources and Eval- is no problem of wrong spelling of transliterated words in uation (LREC’12), 2012 look up based systems like ours which could be a problem in systems using machine learning. 6. B King, S Abney, Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised The following is an example where our system falls in am- Methods In Proceedings of NAACL-HLT, 2013 biguity with regard to common word issue: is college student ko kya problem hai

We have made this example ourselves to show that there is a minute possibility of failure of our algorithm in resolving the common word ambiguity. As here the first word itself is the common word, Algorithm 2.2.1 is not applicable, and the control goes to Algorithm 2.2.2 directly. Now after finding the language of all the other words, here ko, kya and hai are Hindi and college, student and problem are English. So, our system falls in ambiguity as number of English words and Hindi words are equal.

According to our observation the possibility of such a sen- tence in real circumstances is very less. 4. CONCLUSIONS In this paper, we have presented our technique applied for transliteration, language identification and resolving com- mon word ambiguity in a pair of languages, English-Hindi. Based on our analysis the system is working well. We are able to get high accuracy in EPTM. The same system can be extended to any pair of languages for resolving com- mon words ambiguity. As our system is look up based the databases play an important role.

Machine learning and look up based approaches are two dif- ferent ways of transliteration,language identification and re- solving common word ambiguity in a pair of languages. Indi- vidually both the systems have their own pros and cons. On the basis of our observations the best possible way would be to use both machine learning and look up based approaches together for an overall improvement in the system. 4.1 References 1. Parth Gupta et al. Query Expansion for Mixed-script Information Retrieval, in Proceedings of SIGIR 2014. 2. Umair Z Ahmed, Kalika Bali, Monojit Choudhury, and AMRITA@FIRE-2014: Named Entity Recognition for Indian Languages

Abinaya.N, Neethu John, Dr.M.Anand Kumar and Dr.K.P Soman

Center for Excellence in Computational Engineering and Networking Amrita School of Engineering Amrita Vishwa Vidyapeetham Coimbatore-641 112

[email protected], [email protected], [email protected], [email protected] Abstract:

This paper describes Named Entity Recognition systems for English, Hindi, Tamil and Malayalam languages. This paper presents our working methodology and results on Named Entity Recognition (NER) Task of FIRE-2014. In this work, English NER system is implemented based on Conditional Random Fields (CRF) which makes use of the model learned from the training corpus and feature representation, for tagging. The other systems for Hindi, Tamil and Malayalam are based on Support Vector Machine (SVM). In addition to training corpus, Tamil and Malayalam system uses gazetteer information.

1 Introduction

Named Entity Recognition is the main NLP application which identifies and classifies the proper nouns into pre-defined categories such as person names, place names, organization names, quantities, date and time present in the text document. NER plays a key role in NLP applications like text summarization, information retrieval, question-answering and machine translation. Unlike English, Indian languages face many challenges in NER due to its morphological rich nature.

This task focuses on Embedded Named Entity Recognition. Embedded Named Entity means named entity having other named entities inside it. The given corpus contains three levels of named entities. The named entities are recognised using Machine Learning algorithm CRF and SVM. CRF++ tool is a customizable, open source implementation of Conditional Random Field and use very less memory both for training and testing. It build probabilistic model during training and make use of this model for generating tags while testing.

2 Methodology

2.1 Training

Since the given task required to predict three different levels of named entities, training should be performed in three different levels. First level includes training of all the features and level 1 named entity tag as the label. Second level includes training of level 2 named entity tag in addition to the first level features. Final level considers level 3 named entity tag also while training the system. Separate models are learned for each level of training. Figure 1 shows the stages in corpus training.

Figure 1: Flowchart of training the corpus

2.2 Testing

The untagged test data are given for testing with three levels. First level makes use of level 1 model obtained by training to get level 1 named entity tag. Output of the first level testing is taken as input for level 2 testing which use level 2 model. The output obtained from this level becomes the input of final level testing which results with level 3 named entity tag. Figure 2 shows the stages of testing process.

Figure 2: Flowchart of testing the corpus 2.3 Feature Description

The training corpus that is given contains features like POS tags, Chunk tag and three levels of named entity tags. We included more features to the training data in order to improve the accuracy of the system. The description of the features used for each language in the training corpus is listed in Table 1.

Table 1: Feature Description

Features English Hindi Tamil Malayalam

Context Words: The previous two and next two √ √ √ √ of the current word.

POS Tag: The parts of speech tag of the previous two words and next two words are √ √ √ √ considered along with current word POS tag.

Chunk Tag: The window of (-2,+2) chunk √ √ √ √ around the current chunk.

Prefix and Suffix information: The prefixes × √ √ √ and suffixes of length 3, 4 and 5 are considered.

Root Word: The root word of window size (- √ × × × 2,+2) are considered.

Lexical information: The trigram (-1,+1) is √ × × × taken as the feature along with current token.

Contains number: This is a binary valued feature defined based on the presence or absence √ √ √ √ of number.

Capitalization: Binary valued feature which is √ × × × defined for trigram.

Gazetteer: Gazetteer information of current × × √ √ token is considered.

Length: The length of the current token is √ √ √ √ considered as a feature.

Position: The position of the current token in the √ √ √ √ sentence is considered. Dot √ √ √ √

Hyphen √ √ √ √

Comma √ √ √ √

Apostrophe √ √ √ √

Single quote / Double √ √ √ √ quote

Colon / Semi Colon √ √ √ √

Back slash / Escape √ √ √ √ Character Binary Features Parentheses √ √ √ √

2 Digit Number √ √ √ √

3 Digit Number × × √ √

4 Digit Number √ √ √ √

Any Digit Number × × √ √

Word end with Dot × × √ √

4 Digit number × × √ √ followed by suffix

Full Upper Case √ × × ×

2.4 Dataset

In the process of system development, we trained the model using CRF and SVM. The obtained model is tuned with the help of development data. This validated model is used for tagging the test data. The size of training, development and test dataset for all four languages used in the system is as follows:

Table 2: Size of Dataset

Languages Training Data Development Data Test Data

English 90005 19998 29473

Hindi 80992 13277 31453

Tamil 80015 17001 27407

Malayalam 45009 9782 14661

3 Conclusion

Since the English NER system learned using CRF takes more time for training the model, we used SVM algorithm for training other three systems such as Hindi, Tamil and Malayalam. This system uses three levels of training and testing, so the second level tagging depends on the result of first level tags which increase the error rate. Our future work will overcome this by using structured output learning. We plan to utilize collocations and associative measures as a feature in order to improve the performance of recognizing nested named entity tags in our future experiments. The parts-of-speech tag is the important feature for NER to identify the named entity chunk. Incorrect parts-of-speech tag for the token may result in reducing the accuracy of NER system.

References

[1] Vijayakrishna R and Sobha L, “Domain Focused Named Entity Recognizer for Tamil Using Conditional Random Fields”, in Proceedings of IJCNLP-08, pages 59-66 (2008)

[2] Georgi Georgiev, Preslav Nakov, Kuzman Ganchev, Petya Osenova and Kiril Simov, “Feature-Rich Named Entity Recognition for Bulgarian Using Conditional Random Fields”, International Conference RANLP 2009- Borovets, pages 113-117 (2009)

[3] Maksim Tkachenko and Andrey Simanovsky, “Named Entity Recognition: Exploring Features”, in Proceedings of KONVENS 2012 (2012)

[4] Sujan Kumar Saha, Sanjay Chatterji, Sandipan Dandapat, Sudeshna Sarkar and Pabitra Mitra, “A Hybrid Approach for Named Entity Recognition in Indian Languages”, in Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages, pages 17-24 (2008)

[5] Malarkodi, C S., Pattabhi, RK Rao and Sobha, Lalitha Devi, “Tamil NER – Coping with Real Time Challenges”, in Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages (MTPIL-2012), pages 23-38 (2012) ISM@FIRE-2014: Named Entity Recognition Indian Languages

Shantanu Dubey, Bharti Goel, Dinesh Kumar Prabhakar and Sukomal Pal Indian School of Mines, Dhanbad Jharkhand, India-826004 {shantanudubey750, bhartigoel0812, dinesh.nitr, sukomalpal}@gmail.com

ABSTRACT Table 1: Input data format This paper describe the approach we have applied to identify Tokens POS tag Chunk tag and annotate the proper tag to a given test word. Named Department NNP I-NP Entity Recognition is a subtask of information extraction porta l NN I-NP that seeks to locate and classify the proper names in a text. and CC I-NP NER systems are extremely useful in many Natural Lan- know VB B-VP guage Processing (NLP) applications such as question an- about IN B-PP swering, machine translation, information extraction and so the DT B-NP on. A Conditional Random Field (CRF) based Stanford services NNS I-NP Classifier has been used for classification. Tokens are clas- provided VBN B-VP sified hierarchically. We have observed performances on de- by IN B-PP velopment data gives moderate result. us. CD B-NP 1. INTRODUCTION Named Entity Recognition (NER) is is process to identi- Table 2: Expected output fying proper names (such as Shantanu, Dhanbad, U.P. etc.) Tokens POS tag Chunk tag NE-tag NE-tag NE-tag and its classifications. NER is very useful in many Natural Department NNP I-NP I-GOV o o Language Processing (NLP) applications such as Machine porta l NN I-NP o o o and CC I-NP o o o Translation, Question Answering and Information Extrac- know VB B-VP o o o tion [3]. In Name-Entity Recognition, input is a set of text about IN B-PP o o o documents. The NER system processes these text files and the DT B-NP o o o identify the named-entities. Initially texts are divided into services NNS I-NP o o o a number of tokens. These tokens can be noun phrases, ad- provided VBN B-VP o o o jectives, prepositions, verbs, articles etc. From these tokens, by IN B-PP o o o system identifies proper-noun. This proper-noun can be the us. CD B-NP o o o name of a person, a place, an organization, a disease or a natural disaster, continents, year, count etc. We have used Conditional Random Field (CRF) classifier from Stanford Named-Entity Recognizer tool (version 3.4) [1, 2]. Task is accomplished in three stages: Pre-processing, Train Classifier and Classification. 1.1 Task Description The Named Entity Recognition (NER) task is all about 2.1 Pre-processing tagging of words by appropriate class symbol for a given In Pre-processing stage we removed all duplicates to get sentence. a file of unique words only. Then we change all digits by Input: ‘D’ to improve accuracy and also remove blank rows. Now Inputs as a test data is in three column format. First col- we divide training file into three different files (File1.txt, umn contains sentence i.e. list of tokens, second contains File1.txt and File1.txt), each containing four columns. First Part-of-speech tags for corresponding token and third col- training file contains Tokens, POS, Chunks, 1st NE-Tag (i.e. umn contains Chunk tags as shown in table 1. fourth column). The second training file contains Tokens, Output: POS, Chunks, 2nd NE-Tag (i.e. fifth column) and the third Based on training, classifiers should classify and annotate training file contains Tokens, POS, Chunks, 3rd NE-Tag (i.e. test data with 1st, 2nd and 3rd level of NE-tags. Finally sixth column). output would be in six columns format shown in table 2. Next in section 2 we have discussed approaches applied 2.2 Training for NER task. We describe results in section 3. In section 4 We have used Stanford NER tool i.e. we train CRF clas- we conclude the paper and discuss future direction. sifier. This tool requires high-end system configuration to handle huge data which, unfortunately we did not have. 2. APPROACH Therefore we had manage the task with some Engineering solution. We form more than one classifier for first and sec- Correct Classification, Incorrect Classification and No Clas- ond training file on the basis of the frequency of each class. sification. In first training file, there were around 72 classes, so we form 12 classifier for first file named as classifier1 file1, clas- 3.1 Correct Classification sifier2 file1, classifier3 file1 and so on. In each classifier we This part consists list of some classes whose tokens are group 6 classes on the basis of the frequency. classifier1 file1 correctly classified. Table 3.1 shows the result of classifiers consist of those classes whose frequency is very high. Simi- which classified tokens based on first training file. larly classifier2 file1 consist of those classes whose frequency is less than classes which are present in classifier1 file1 but more than next upcoming classes. Table 4: List of some classes which are correctly classified Classes Precision Recall F-Measure

Table 3: Classes present in classifier1 file1 B-YEAR 73.41% 94.69% 82.70% B-INDIVIDUAL I-INDIVIDUAL I-PERIOD 21.30% 12.12% 36.20% B-INSTITUTE I-INSTITUTE B-DATE 13.22% 29.09% 18.18% B-PERSON I-PERSON B-ORGANIGATION I-ORGANIGATION 3.2 Incorrect Classification B-GOV I-GOV This part consist list of some classes whose tokens are B-ASSOCIATION I-ASSOCIATION incorrectly classified. Table 3.2 contains tokens incorrectly classified tokens based on first training file. 2.3 Classification Table 5: List of classes incorrectly classified In this stage first we test our classifiers on Development Data. We observed that our system classified test data cor- Classes rectly to a large extent, although there was a substantial B-ASSOCIATION portion which was not either classified correctly or not clas- I-ASSOCIATION sified. Result for it shown in section 3. I-DISEASE

3.3 No Classification This part consist list of some classes whose tokens are not classified. Table 3 shows the result of classifiers classified tokens based on first training file.

Table 6: List of classes not classified Classes Figure 1: Classification Layout I-COUNT

Then, we run classifiers on test data. First we run classi- B-DATE fier1 file1 of first training file on test data present in test.txt, I-DATE it annotated tokens according to classes which are present in it and rest are annotated as other (o), this output is stored in file name as temp1.txt . Now from temp1.txt, we extract 4. CONCLUSIONS AND FUTURE WORK tokens which are annotated as other and store in file name as test1.txt. Now we run classifier2 file1 of first training file on In this task we used machine learning approach of named test1.txt, it annotated tokens according to classes present in entity recognition. The CRF classifier, a supervised ma- it and rest are annotated as other (o), this output is stored chine learning technique was used for classification [4]. This in file name as temp2.txt. Now from temp2.txt we extract classifier basically worked on small data. We observed that tokens which are annotated as other and store in file name our system classified test data correctly to a large extent, as test2.txt. This procedure is run until all classifiers of first although there was a substantial portion which was either training file get complete. classified incorrectly or not classified at all. This happened Similarly we run classifiers of second and third training possibly due to following two reasons: one, lack of test data files on test data and at last we merge the output of all files. belonging to that particular class and two, hardware con- Figure 2.3: which shows the overview of Procedure. straints (small RAM) in our system. Also, there can be another reason for quite a large number of misclassification. This is due to our engineering hack to 3. RESULTS AND DISCUSSION handle huge amount of data within a constrained system We have observed that our classifiers classified tokens of environment. When we pass test data through the first level Development Data into different classes present in training of classifiers (classifier1 file1) we can only classify them to data. These classifications are categorized into three parts: those classes for which class information is available in that level. The rest are put in a common other class. If a data which should actually go to the other class is misclassified here, it never gets a chance to be classified correctly in the subsequent levels. Only those data which are in the other class get a chance to go to next level. Had we been able to consider all possible classes in a single level, each data will get equal and uniform chance of classification. As we can not classify all data to their appropriate classes in a single level, this causes a non-uniform chance of misclassification. However, it remains to be seen how classification error is affected by this staged classification vis-a-vis a single-level classification scheme an important task which we could not do because of hardware constraints, but can be attempted as future work.

5. REFERENCES [1] Finkel, J. R., Grenager, T., and Manning, C. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (2005), Association for Computational Linguistics, pp. 363–370. [2] Finkel, J. R., Grenager, T., and Manning, C. Named Entity Recognizer (version 3.4). http://nlp. stanford.edu/software/CRF-NER.shtml#Download, 2014. [Online; accessed 20-September-2014]. [3] Nadeau, D., and Sekine, S. A survey of named entity recognition and classification. Lingvisticae Investigationes 30, 1 (2007), 3–26. [4] Sutton, C., and McCallum, A. An introduction to conditional random fields for relational learning. Introduction to statistical relational learning (2006), 93–128. UAM@SOCO 2014: Detection of Source Code Re-use by means of Combining Different Types of Representations

Notebook for SOCO at FIRE 2014 ∗ † A. Ramírez-de-la-Cruz G. Ramírez-de-la-Rosa C. Sánchez-Sánchez Universidad Autónoma Universidad Autónoma Universidad Autónoma Metropolitana Metropolitana Metropolitana Unidad Cuajimalpa Unidad Cuajimalpa Unidad Cuajimalpa México D.F. México D.F. México D.F.

W. A. Luna-Ramírez H. Jiménez-Salazar C. Rodríguez-Lucatero Universidad Autónoma Universidad Autónoma Universidad Autónoma Metropolitana Metropolitana Metropolitana Unidad Cuajimalpa Unidad Cuajimalpa Unidad Cuajimalpa México D.F. México D.F. México DF.

ABSTRACT software is always looking for protecting their developments, This paper describes the participation of the Language and thus they usually search for any sign of unauthorized use of Reasoning group from UAM-C in the context of the SOurce their own blocks of source code. Secondly, in the academic COde re-use competition (SOCO 2014). We propose dif- field, it is well known that the habit of copying programs ferent representations of a source code, which attempt to is a common practice among computation students. Such highlight different aspects of a code; particularly: i) lexical, phenomena is also motivated due to all the facilities that ii) structural, and iii) stylistics. From the lexical view, we web forums, blogs, repositories, etc., offer to share source used a character 3-gram model without considering all re- codes which most of the times have been already debugged served words for the programming language in revision. For and tested. the structural view, we proposed two similarity metrics that Consequently, source code re-use detection has became an takes into account the function’s signatures within a source important research topic, motivating different groups to de- code, namely the data types and the identifier’s names of the fine the problem more formally in order to build automatic function’s signature. The third view consists on accounting systems to identify such problem. As an example, in 1987 for several stylistics’ features, such as the number of white Faidhi and Robinson [5] proposed a seven level hierarchy spaces, lines of code, upper letters, etc. At the end, we that aimed at representing most of the program’s modifica- combine these different representations in three ways, each tions used by students when they plagiarize source code. As of which was a run submission for the SOCO competition a consequence, many approaches that try to identify plagia- this year. Obtained results indicate that proposed repre- rized code are based on these levels of complexity. sentations provide some information that allows to detect However, it is important to notice that programmers that particular cases of source code re-use. re-use a source code usually apply not one, but several ob- fuscation techniques when re-using sections from a program. Therefore, even though there are several proposed techniques Categories and Subject Descriptors to detect different types of source code re-use, it is very diffi- H.3.1 [Information storage and retrieval]: Content Anal- cult for a single automatic system to detect all these different ysis and Indexing—Linguistic processing; H.3.4 [Information types of obfuscation practices. storage and retrieval]: Systems and Software In this work we propose different representations of a source code, namely: character n-grams, data types, iden- Keywords tifiers’ names, and stylistics features. Our intuitive idea is that by means of considering different aspects from a source Lexical, structural and stylistic attributes, Document repre- code, it will be possible to capture some of the most common sentation, Plagiarism detection, Source code re-use practices performed by programmers when they are re-using a source code. 1. INTRODUCTION Identification of source code re-use is an interesting topic 2. RELATED WORK from two points of views. Firstly, the industry that produces Lately, developed automated systems to identify source ∗Corresponding author. E-mail address: code re-use are applying natural language processing (NLP) [email protected] techniques that are been adapted to this specific context. †Principal corresponding author. E-mail address: One example of those systems is one that take into account [email protected] a remanence trace left after a copy of source code, such as, white space patterns [2]. The intuitive idea behind this between a pair of source codes is computed using the cosine approach indicates that a plagiarist camouflages almost ev- similarity, which is defined as follows: ery thing when copying a source code but the white spaces. Accordingly, it compute similarities between source codes Cα · Cβ taking into account the use of letters ( all represented as X) sim3grams(Cα,Cβ ) = cos(θ) = (1) kCαkkCβ k and white spaces (represented as S). As another example of automatic systems that employ 4.2 Structural view: data types from the func- NLP techniques, are those based on word n-grams [1, 8]. These works consider several features of source code, such tion’s signature representation as, identifiers, number of lines, number of hapax, etc. Their The proposed structural view consists of two forms of rep- obtained results were very promising. resentation. The first representation considers only the data 1 Some other works employed transformations techniques types of the function’s signatures . This representation at- based on LSA, for example the work presented in [4]. In this tempts to compare some elements that belong, to some ex- work, authors focused on three components: preprocessing tent, to the structure of the program by means of using the (keeping or removing comments, keywords or program skele- data types of function’s signatures. ton), weighting (combining diverse local or global weights) Accordingly, first we represent each function’s signature and the dimensionality of LSA. into a list of data types. For example, the following func- As can be observed, a common characteristic of previous tion’s signature int sum(int numX, int numY) will be trans- works is that they attempt to capture several aspects from lated into int (int, int). Our proposed representation source codes into one single/mixed representation (i.e., a also accounts for the frequency of each data type. single view) in order to detect source code re-use. Contrary To calculate the similarity between two functions, we need to these previous works, our hypothesis states that each as- to compare two elements of the function’s signature: return pect (i.e., either structural or superficial elements) provides data type and arguments’ data types. We measure the im- its own important information and can not be mixed with portance of each element independently and then we merge other aspects when representing source codes. them. α β Given two functions, m and m from source codes Cα 3. SHARED TASK DESCRIPTION and Cβ respectively. The similarity of their return data type (simr) is 1 if they are the same, and 0 otherwise. SOCO, Detection of SOurce COde Re-use, is a shared task The similarity of their arguments data types is a little that focuses on monolingual source code re-use detection. more elaborated to compute. We use a bag of data-types Participant systems were provided with a set of source codes for each function, we also count each repetition of each data- in C and Java programming languages. The task consists on type. Then we represent each function as a vector. Finally, retrieving the source code pairs that have been re-use at a we compute a similarity between two functions’ vectors mα document level. The details about the tasks are described β and m from source codes Cα and Cβ respectively as defined in [7]. in Equation 2. The data set provided for the shared task is divided into two sets: training and test. The training set has two col- Pn α β lections, for C and Java. The Java collection contains 259 α β i=0 min(m i, m i) sima(m , m ) = Pn α β (2) source codes while the C collection contains 79 source code. i=0 max(m i, m i) Note that the relevance judgments represent cases of re-use where n indicates the number of different data types in both in both directions, i.e., the direction of the re-use is not source codes,i.e., the vocabulary of data types. being detected. Once we have all the information from the function’s sig- natures, i.e., the similarities from the return data-type and 4. PROPOSED SOURCE CODE REPRESEN- the arguments’ data-type; we can compute a single similar- TATIONS ity measure. For doing so, we merge the two measures by means of a linear combination, which represents the similar- In this section we describe our proposed representations α β for source code in order to detect several aspects that help to ity between m and m (See Equation 3). detect source code re-use. We divided these representations into three views: i.e., lexical, structural and stylistics. α β α β α β sim1(m , m ) = σ∗simr(m , m )+(1−σ)∗sima(m , m ) 4.1 Lexical view: character 3-grams represen- (3) where σ is a scalar that weights the importance of each tation term and it satisfies that 0 ≤ σ ≤ 1. For our performed The approach used in this representation was proposed by experiments, we established σ = 0.5 so both parts are con- Flores S´aez [6]. The main idea was to represent source code sidered equally important. by means of a bag of character n-grams, Bj , where all the Finally, in order to calculate this structural similarity value white spaces and line-breaks are deleted and the letters are we perform as follows. Given two codes, Cα and Cβ , we com- changed into lowercase. In addition to the original method, type pute a function-similarity matrix Mα,β , where all functions we improve the method by eliminating all the reserved words in Cα are compare against all functions in Cβ . Thus, the into the document. final values of similarity between two codes are defined as in Thus, given two codes, Cα and Cβ , their bag of character Equation 4. 3-grams is computed as we mentioned before; then, each 1 code is represented as a vector Cα and Cβ according to the We will refer just as function to every programming func- vector space model proposed by [3]. Finally, the similarity tion within a source code. was set from 10 to 90 percent of similarity. For each thresh- old we evaluated the precision, recall and F-measure. sim (C ,C ) = f(Mtype) (4) DataT ypes α β α,β The results of this evaluation are given in figures 1 to where f(x) represents either the maximum value contained 6. Figure 1 presents the performance of the lexical view, in the matrix, or the average value among all values from i.e., using a character 3-grams model without considering the matrix. reserved words. We found that a good compromise between precision and recall is reached at 80% of similarity, when 4.3 Structural view: names from the function’s the f-measure is 0.56. Figure 2 shows the performance of signatures representation the stylistic representation. In general, the results of this As a complement for the previous representation, this rep- representation were not good. resentation considers the structure by using the names of the functions as well as the name of the arguments. Character 3-grams representation 1 This representation concatenate the name of the func- Precision Recall tion’s name with the name of the arguments, convert ev- F-measure ery character into lowercase and removes white spaces (if 0.8 present). Thus, the function int sum(int numX, int numY) is represented as the string sumnumxnumy. Then we extracted 0.6 all the character 3-grams and form a bag of 3-grams.

Once we have computed the bag of n-grams, we can com- 0.4 pute how similar are two functions. Given two functions, α β m and m from Cα and Cβ respectively; and their cor- 0.2 responding vector representation using the bag of 3-grams mα and mβ , we compute the similarity using the Jaccard coefficient as follows: 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity threshold

α β α β m ∩ m Figure 1: Lexical view. Best result is obtained with sim2(m , m ) = (5) mα ∪ mβ the 80% of similarity between two methods Similarly to the previous approach, every function in source code Cα is compared to every function in code Cβ . From this names comparison we obtain a name-similarity matrix Mα,β . Stylistic similarity 1 Hence, the final similarity values of Cα and Cβ is defined Precision as established in Equation 6. Recall F-measure 0.8

names simNames(Cα,Cβ ) = f(Mα,β ) (6) 0.6 where f(x) can be set to the maximum value in the matrix, or the average value from the matrix. 0.4 4.4 Stylistic view 0.2 This representation aims at finding unique properties from the original author such as his/her programming style. In this sense, we compute 11 stylistic features to represent each 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity threshold source code. Then, we use a vector representation and by using a cosine similarity (see Equation 1) we found the sim- Figure 2: Stylistic view. A high recall is obtained ilarities between two source code. for every similarity threshold, but also very low pre- The eleven features are: number of lines with code, num- cision. ber of white spaces, number of tabulations, number of empty lines, number of functions, average word length, number of For the structural representation, as we mentioned before, upper case letters, number of lower case letters, number of we used two representations: data types function’s signa- under scores, number of total number of words in a source tures representation and names from the function’s signa- code, lexical richness. tures representation. For each one, we define two ways to compute the similarity: (a) using the maximum value of 5. EXPERIMENTAL EVALUATION similarity and (b) using the average of the similarities from This evaluation was perform with the training provided all the functions into two source codes. Figures 3 and 4 show by the shared task. We carried out a series of experiments the data type function’s signature representation. The best using single views in order to find the amount of relevant results are obtained when the similarity is 90% (0.14 of f- information given by each representation. measure) when considering the maximum, and 50% (0.16 of For each experiment we compute the similarities values of f-measure) when considering the average. each source code files given in the training data. Then, we The performance from the second proposed representa- measure the performance of each proposed representation tion, i.e. the structural view (name of the function’s signa- by means of establishing a manual threshold for considering ture representation) is shown in 5 and 6. The results show when two codes are plagiarized (re-used). That threshold that the best F-measure, i.e., 0.26 and 0.22, was obtained when the similarity threshold between codes was set to 40% Identifiers from the signature (maximum) and 20%, respectively. Precision 0.6 Recall F-measure

Data types from the signature (maximum)

Precision Recall 0.2 F-measure 0.4

0.2

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity threshold

Figure 5: Structural view: identifiers of function’s 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity threshold signatures using the maximum value of similarities between functions. Best result is obtained with 40% Figure 3: Structural view: data type of function’s of similarity between two methods signatures using the maximum value of similarities between functions. Best result is obtained with Identifiers from the signature (average) 1 more than 90% of similarity between two methods Precision Recall F-measure 0.8

Data types from the signature (average) 0.6 Precision Recall F-measure

0.2 0.4

0.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity threshold

Figure 6: Structural view: identifiers of function’s signatures using the average value of similarities be- 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity threshold tween functions. Best result is obtained with more than 20% of similarity between two methods Figure 4: Structural view: data type of function’s signatures using the average value of similarities be- tween functions. Best result is obtained with 50% 2. Combination of Lexical and Structural views of similarity between two methods (run 2). For this experiment, we used a combina- tion of two views: lexical view as in the previous run All the results shown here are from C language, however, (LexSim), and the structural view with both represen- similar results were obtained in Java data set. tations (DTSim and NameSim). Since the experimen- tal evaluation over the training test shows much better results with the lexical view, we decided to give more 6. SUBMITTED RUNS weight to this factor. The evaluation experiments also We submitted three runs for the task based on three com- shows that the two representations for the structural binations of the proposed representations, considering the view are complementary, thus in order to use this infor- performance over the training set. Details about the runs mation we use both with equal weight. The Equation and the results are shown below. 7 shows the employed linear combination.

∗ 1. Lexical view only (run 1). For this experiment, we sim = (0.5∗LexSim)+(0.25∗DT Sim)+(0.25 NameSim) used the representation described in section 4.1 using (7) a threshold of similarity of 50%. The results from C The results of this experiment are shown in Table 2. and Java are shown in Table 1.

Table 2: Results over the test set for run 2 Table 1: Results over the test set for run 1 Precision Recall F-measure Precision Recall F-measure C 0.005 0.950 0.010 C 0.006 1.00 0.013 Java 0.019 0.928 0.037 Java 0.349 1.00 0.517 3. Supervised approach (run 3). Here we decided to iii) stylistic. From the lexical view, we used a modified im- used all the similarities computed from all the views plementation of the method proposed by Flores’s [6]. For and used a learning algorithm to classify all source the structural view, we proposed two similarity metrics that code pairs into two classes: re-use and no-re-use. For consider the function’s signatures within the source code. this we use a J48 decision tree implemented in Weka. Finally, for the third view we defined eleven features that The results over the test data are shown in table 3. intent to extract some stylistic attributes from the original author that are more difficult to obfuscate. Obtained results during the training phase, demonstrate Table 3: Results over the test set for run 3 that in fact each type of representation provide some infor- Precision Recall F-measure mation that can be used to detect some particular cases of C 0.006 0.997 0.013 source code re-use. A more deep analysis need to be per- Java 0.691 0.968 0.807 form in order to determine what are the characteristics of those cases that are accurately detected by each proposed representation and, hence, to come up with a more adequate From previous tables we can see that our best results was form of combining these representations. achieve for the supervised approach (run3). To get the big Finally, obtained results during the test phase motivate us picture about the performance in this task, Table 4 shows for keep working on the same direction. It is important to our best system (run3) agains the best systems in each lan- remark that although the obtained F − measure were low, guage and the baseline (the baseline consists of a charac- it was no the case for the precision and recall values for the ter 3-gram model weighted using term frequency and cosine experiments in Java and C respectively. Particularly, for measure to compute the similarity. This baseline considers the test experiments performed in the C subset, we believe as re-used cases all source code pairs that surpass a similar- that the low precision values are due to the fact that several ity threshold of 0.95). source codes are not just in pure C, and instead, also C/C++ alike programs.

Table 4: Comparison agains the best method 8. ACKNOWLEDGMENTS Precision Recall F-measure This work was supported by CONACyT Mexico Project C Language Grant CB-2010/153315. Our run 3 0.006 0.997 0.013 Best system 0.282 1.00 0.440 Baseline 0.258 0.345 0.295 9. REFERENCES Java Language [1] A. Aiken. Moss, a system for detecting software Our run 3 0.691 0.968 0.807 plagiarism, 1994. The 2nd best system 0.530 0.995 0.692 [2] N. Baer and R. Zeidman. Measuring whitespace Baseline 0.457 0.712 0.556 pattern sequence as an indication of plagiarism. Journal of Software Engineering and Applications, 5(4):249–254, 2012. From Table 4 we can draw interesting conclusions. First, [3] R. A. Baeza-Yates and B. Ribeiro-Neto. Modern our obtained recall value for detecting source code re-use in Information Retrieval. Addison-Wesley Longman C are competitive with the recall of the best system (1.00 Publishing Co., Inc., Boston, MA, USA, 1999. and 0.997), while ours is higher than the baseline. The prob- lem was that our system detected a lot more pairs of source [4] G. Cosma and M. Joy. Evaluating the performance of code that were not source code re-used. lsa for source-code plagiarism detection. Informatica, The opposite happened with the performances for Java. 36(4):409–424, 2013. Here our system performs very well, in recall as well as in [5] J. A. W. Faidhi and S. K. Robinson. An empirical precision values, which put our system at the first place in approach for detecting program similarity and the performance’s ranking of all the participant systems in plagiarism within a university programming the task. environment. Comput. Educ., 11(1):11–19, Jan. 1987. [6] E. Flores. Reutilizaci´on de c´odigo fuente entre lenguajes de programaci´on. Master’s thesis, 7. CONCLUSIONS AND FUTURE WORK Universidad Polit´ecnica de Valencia, Valencia, Espa˜na, In this paper, we have described the experiments per- February 2012. formed by the Language and Reasoning group from UAM-C [7] E. Flores, P. Rosso, L. Moreno, and E. Villatoro-Tello. in the context of the SOCO 2014 evaluation exercise. Our Pan@FIRE: Overview of soco track on the detection of proposed system was designed for addressing the problem of source code re-use. In Proceedings of the Sixth Forum source code re-use detection by means of employing differ- for Information Retrieval Evaluation (FIRE 2014), ent types of representations. Our intuitive idea states that December 2014. different aspects (i.e., either structural or superficial) pro- [8] S. Narayanan and S. Simi. Source code plagiarism vide different (important) information and they must not be detection and performance analysis using fingerprint mixed with other aspects when representing source codes. based distance measure method. In Computer Science Accordingly, we presented a method that can help to rep- Education (ICCSE), 2012 7th International Conference resent a source code in several forms, each of them attempt- on, pages 1065–1068, July 2012. ing to highlight different aspect of a code. Particularly, we proposed three representations: i) lexical, ii) structural and, PAN@FIRE: Overview of SOCO Track on the Detection of SOurce COde Re-use

Enrique Flores1, Paolo Rosso1 Lidia Moreno1, and Esa´uVillatoro-Tello2

1 Universitat Polit`ecnicade Val`encia,Spain, {eflores,prosso,lmoreno}@dsic.upv.es 2 Universidad Aut´onomaMetropolitana, Unidad Cuajimalpa, M´exico [email protected]

Abstract. This paper summarizes the goals, organization and results of the first SOCO competitive evaluation campaign for systems that au- tomatically detect the source code re-use phenomenon. The detection of source code re-use is an important research field for both software indus- try and academia fields. Accordingly, PAN@FIRE task, named SOurce COde Re-use (SOCO); focused on the detection of re-used source codes in C/C++ and Java programming languages. Participant systems were asked to annotate several source codes whether or not they represent cases of source code re-use. In total three teams participated and sub- mitted 13 runs. The training set consisted of annotations made by several experts, a feature which turns the SOCO 2014 collection in a useful data set for future evaluations and, at the same time, it establishes a standard evaluation framework for future research works.

Keywords: SOCO, Source code re-use, Plagiarism detection, Evaluation frame- work, Test collections

1 Introduction

Nowadays, the information has become easily accessible with the advent of the Web. Blogs, forums, repositories, etc. have made source code widely available to be read, to be copied and to be modified. Programmers are tempted to re-use debugged and tested source codes that can be found easily on the Web. The vast amount of resources on the Web makes the manual analysis of suspicious source code re-used unfeasible. Therefore, there is an urgent need for developing automatic tools capable of accurately detect the source code re-use phenomenon. Currently, software companies have a special interest in preserving their own intellectual property. In a survey applied to 3, 970 developers, more than 75 per- cent of them admitted that have re-used blocks of source code from elsewhere when developing their software3. Moreover, on the one hand, the academic envi- ronment has also become a potential scenario for research in source code re-use because it is a frequent practice among students. A recent survey [3] reveals that

3 http://www.out-law.com/page-4613 2 source code plagiarism represents 30% of the cases of plagiarism in the academia field. In this context, students are tempted to re-use source code because, very often, they have to solve similar problems within their courses. Hence, the task of detecting source code re-use becomes even more difficult, since all the source codes will contain (to some extent) a considerable thematic overlap. On the other hand, detection of source code re-use in programming environments, such as programming contests, has an additional challenge, this is the large number of source codes that must be processed for detecting such practises [6], and as a result, source code re-use detection becomes some how unfeasible. Consequently, most of the research on source code re-use detection has been mostly applied to closed groups [16, 14, 9]. Traditionally, the source code re-use detection problem has been approached by means of two main perspectives: i) feature comparison, and ii) structural comparison. In the first approach, i.e. features comparison, the similarity be- tween two programs considers features such as the average number of characters per line, the number of commented lines, etc. [18]. On the contrary, the struc- tural comparison usually takes into account a more complex representation, e.g. the representation of a source code made by a compiler, which it can be seeing as a fingerprint representing the structure of a program; then different techniques are applied in order to determine whether or not a case of re-use exists. As examples of structural approaches are [14, 2, 7]. In [14], authors search for the longest non-overlapped common substrings between a pair of fingerprints, whilst in [2], the source code is represented as a dependency graph in order to search for common sub-graphs. Finally, DeSoCoRe [7] proposes a comparison of two source codes at function-level and looks for highly similar functions or methods in a graphical representation. Whereas at PAN@CLEF the shared task addresses plagiarism detection in texts [13], PAN@FIRE focuses on the detection of source codes that have been re-used in a monolingual context, i.e., using the same programing language. It is worth mentioning that such situation represents a common scenario in the academic environment. Particularly, SOCO involves identifying and distinguish- ing the most similar source code pairs among a large source code collection. In the next sections we will first define the task and then summarise all participant systems approach as well as their obtained results during the SOCO 2014 shared task.

2 Task description

SOCO shared task focuses on monolingual source code re-use detection, which means that participant systems must deal with the case where the suspicious and original source codes are written in the same programing language. Accordingly, participants are provided with a set of source codes written both in C/C++ and Java languages, where source codes have been be tagged by language to ease the detection. Thus the task consists in retrieving source code pairs that have been re-used. It is important to mentions that this task must be performed at docu- 3 ment level, hence no specific fragments inside of the source codes are expected to be identified; only pairs of source codes. Therefore, participant systems were asked to annotate several source codes whether or not they represent cases of source code re-use. This year’s task was divided in two main phases: training and testing. For the training phase we provided an annotated corpus for each programming language, i.e., C/C++ and Java. Such annotation includes information about whether a source code has been re-used and, if it is the case, what its original code is. It is worth mentioning that the order of each pair was not important4, e.g., if X has been re-used from Y , it was considered as valid to retrieve the pair X–Y or the pair X–Y . Finally, for the testing phase the only annotation that has been provided is the one corresponding to the programming language.

3 Corpus

In this section we describe the two corpora used in the SOCO 2014 competition. For the training phase, a corpus composed by source codes written in C and Java programming languages was released. For the testing phase, participants were provided with source codes written in C-like languages (i.e., C and C++) and also in Java language.

3.1 Training Corpus The training collection consists of source codes written in C and Java program- ming languages. For the construction of this collection we employed the corpus used in [1]. Source code re-use is committed in both programming languages but only at monolingual level. The Java collection contains 259 source codes, which are labelled from 000.java to 258.java. The C collection contains 79 source codes, labelled from 000.c to 078.c. Relevance judgements represent re-used cases in both directions(X → Y and Y → X). Table 1 shows the characteristics of the training corpus and the κ value of the inter-annotator agreement [5].

Table 1. Total number of source codes and re-used source code pairs annotated by three experts. The last column shows their κ value for inter-annotator agreement.

Programing Number of Re-used source Inter-annotator language. source codes code pairs agreement C 79 26 0.480 Java 259 84 0.668

As can be seen in Table 1 the inter-annotator agreement for the C collection represents a moderate agreement whilst for the Java collection the kappa value 4 An additional challenge in plagiarism detection is to determine the direction of the plagiarism, i.e., which document is the original and which the copy. 4 indicates a substantial agreement between annotators [5]. Such results indicate (to some extent) that the provided training corpus represents a reliable data set.

3.2 Test Corpus

The provided test corpus is divided by programming language (C/C++ and Java) and by scenarios (i.e., different thematic). This corpus has been extracted from the 2012 edition of Google Code Jam Contest5. Each programming lan- guage contains 6 monolingual re-use scenario (A1, A2, B1, B2, C1 and C2). Hence, the name of the files consists of the name of the scenario which they belong to and an identifier, for example, file ”B10021” belongs to scenario B1 and its identifier number is 0021. Table 2 shows the number of source codes for each programming language in every scenario.

Table 2. Number of source codes that the test corpus contains by programming lan- guage and scenario.

A1 A2 B1 B2 C1 C2 Total C/C++ 5, 408 5, 195 4, 939 3, 873 335 145 19, 895 Java 3, 241 3, 093 3, 268 2, 266 124 88 12, 080

It is important to mention that there is no re-use cases between scenarios, therefore participant systems just need to look for re-used cases among the source code files inside each scenario. For example, participants do not have to submit a re-used case between files ”B10021” and ”B20013”. Notice that the first one belongs to scenario B1 but the second one belongs to B2. As can be noticed in Table 2, the amount of source codes in the test set is significantly higher than the amount of codes in the training corpus. Therefore, participant systems are some how obligated to develop efficient applications for solving the SOCO task. Due to the huge size of the corpus is practically impossible to label it manu- ally by human reviewers. Hence, in order to evaluate the performance of partic- ipant systems, we prepared a pool formed by the provided detections from the different submitted runs [17]. By means of following this technique, a source code pair needs to appear at least in the 66% of the competition runs to be considered as a relevant judgement. Thus, for the construction of the relevance judgements, we considered all the submitted runs from participant systems with the addition of the two baselines described in the next section. Table 3 indicates the number of identified relevant judgements for each programming language and scenario.

5 https://code.google.com/codejam/contest/1460488/dashboard 5

Table 3. Number of relevant judgements per programming language and scenario.

A1 A2 B1 B2 C1 C2 Total C/C++ 105 92 95 50 8 0 350 Java 115 106 138 76 4 25 464

4 Evaluation Metrics

All the participants were asked to submit a detection file with all the considered re-used source code pairs. Participants were allowed to submit up to three runs. All the results were required to be formatted in XML as shown below. As can be noticed, for each suspicious source code pair it must be one entry of the in the XML file.

...

To evaluate the detection of re-used source code pairs we calculate Precision, Recall and F1 measure. For ranking all the submitted runs we used the F1 measure in order to favour those systems that were able to obtain (high) balanced values of precision and recall. Two baselines have been considered for the SOCO 2014 task, which are de- scribed below: – Baseline 1. Consists of the well-known JPlag tool [14] using its default parameters. In this model, the source code is parsed and converted into token strings. The greedy string tiling algorithm is then used to compare token strings and identify the longest non-overlapped common substrings. – Baseline 2. Consists of the character 3-gram based model proposed in [8]. In this model, the source code is considered as a text and represented as character 3-grams, where these n-grams are weighted using term frequency scheme. As pre-processing, whitespaces, line-breaks and tabs are removed. All the characters are casefolded and characters repeated more than three times are truncated. Then, the similarity between two source codes is com- puted using the cosine similarity measure. For this baseline, a code pair is considered as a re-use case if the similarity value is higher than 0.95.

5 Participation Overview

In total three teams participated and submitted 13 runs. Particularly, the Au- tonomous University of the State of Mexico (UAEM) and the Universidad Aut´onoma 6

Metropolitana - Unidad Cuajimalpa (UAM-C) have submitted three runs in both programming languages while the Dublin City University (DCU) only in Java. UAEM [11] used a model for the detection of source code re-use that is divided into four phases. In the first phase only the lexical items of each source lan- guage are separated and more than one whitespaces are removed. In the second phase, a similarity measure is obtained for each source code regarding the other source codes. The second phase uses as similarity measure the sum of the dif- ferent lengths of the longest common substrings between the two source codes, normalised to the length of the longest code. Using the comparisons made for each source code, in the third phase a set of parameters is obtained that allow later the identification of re-used cases. The parameters obtained are: the value of the distance (1- similarity), the ranking of the distance (rank order of the most similar), the gap that exists with the next closest code (it is only calcu- lated for the first 10 closest codes) and, using the maximum gap between the 10 most closest codes, the source codes that are Before or After the maximum gap relative difference are labelled. The result of the third phase is a matrix where each row represents a comparison of a source code with other codes (columns). For the decision, a source code pair X ↔ Y will be a re-used case if there is evidence of re-use in both directions, it means, X → Y and Y → X. A re-used case exists when the distance is less than 0.45 or the gap is greater than 0.14, but also it is important that one of the additional conditions is achieved. The first condition is that the ranking must be, at least, in the second position and, the second condition, that the label of the relative difference must be Before. The first run for C and Java languages were processed with above conditions. However, in some cases the evidence in one direction was very high and in the other direction was almost reliable, but according to the training corpus in Java, in most of the cases this pair was a re-used case. In the second run, if there were not high evidence of re-use in one direction, then the pair can be considered as re-used case whether at least one of the both codes has the ranking of 1 and the relative difference of Before and the gap greater than 0.1. UAM-C [15] represents the source code in three views attempting to highlight different aspects of a source code: lexical, structural and stylistics. From the lexical view, they represent the source code using a bag of character 3-grams without the reserved words for the programming language. For the structural view, they proposed two similarities that take into account functions’ signatures within the source code, e.g., the data types and the identifier names of the functions’ signature. The third view consists in accounting for the stylistics’ attributes, such as, number of spaces, number of lines upper letters, etc. For each view they computed a similarity value for each pair of source codes and then they established a threshold calculated on the training corpus. In the first run, they only consider the first view with a manually defined similarity threshold of value 0.5. In the second run, they use the first and the second view. From these two views they have three different similarities: lexical similarity (L), data types similarities (DT ), and identifiers name similarity (IN). Then, they combined them as: 0.5L ∗ 0.25DT ∗ 0.23IN, according to a confidence’s level manually 7 established. In the third run, they uses 8 similarities derived from the three proposed views: one similarity for the lexical view, 6 similarities from the second view and 1 more for the stylistic view. Finally, they trained a model using a supervised approach to be used over the test corpus. DCU [10] undertakes an information retrieval (IR) based approach for address- ing the source code plagiarism task. First, they employ a Java parser to parse the Java source code files and build an annotated syntax tree (AST ). Then, they extract content from selective fields of the AST to store as separate fields in a Lucene index. More specifically speaking, the nodes in the AST from which they extract information from are the statements, class names, function names, func- tion bodies, string literals, arrays and comments. For every source code in the test corpus, they formulate a pseudo-query by extracting representative terms (those having the highest language modelling query likelihood estimate scores). A ranked list of 1000 documents along with their similarities with the query is retrieved after executing the query. The retrieval model that they use is language model (LM). Their model walks down this ranked list (sorted in decreasing or- der by the similarities) of documents and stops where the relative decrease in threshold in comparison to the previous document similarity is less than a pre- defined threshold value acting as a parameter. The documents collected this way are then reported as the re-used set of documents. In the first run, separate fields are created for each AST node type, e.g. the terms present in the class names and the method bodies are stored in separate fields. They compute relative term frequency statistics for each field separately. In the second run, an AST is con- structed from the program code using a Java parser and then bag-of-words from the selected nodes of the AST are used. However, separate fields are not used to store the bag-of-words. The index is essentially a flat one. In the third run, a simple bag-of-words document representation is used for a program code, i.e., no program structure is taken into account.

6 Results and Analysis

The results obtained by the participants are shown in Table 4 for the program- ming language C and in Table 5 for Java. As we mentioned before, we ranked obtained results by means of the F1 measure, given that we prefer systems that are able to obtaing (high) balanced values of Precisions and Recall. The best results according to F1 were obtained by UAEM in C and UAM-C in Java. In the C programming language, the two runs of UAEM were able to retrieve all the re-used source code pairs. The rule introduced for retrieving less obvious re-used cases in run 2 had a negative impact on the performance in terms of precision and, therefore, of F1. Results by UAM-C have been adversely affected in terms of precision by the huge number of retrieved source code pairs (+50K). This may happened because they have removed reserved words and taken into account the number of functions according to the C language. C++ language includes new characteristics such as classes and methods and also includes new reserved words (e.g. cin or cout). Contrary to the UAEM-C, the other two teams, 8 as well as the baselines run do not retrieved such amount of re-used pairs. Such fact has a direct impact on the formed pool, since it only takes into account the retrieved pairs with a certain degree of agreement (at least 4 out of 7 runs). has been hindered by the source codes written in C ++ which include classes, methods of classes C not included.

Table 4. Overall evaluation results for C/C++ programing language. The ranking is upon the F1 values. Baseline 1 corresponds to JPlag model and baseline 2 corresponds to a character 3-grams based model.

Position Run F1 Precision Recall 1 UAEM-run 1-2 0.440 0.282 1.000 2 UAEM-run 3 0.387 0.240 1.000 # baseline 2 0.295 0.258 0.345 # baseline 1 0.190 0.350 0.130 3 UAM-C-run 1 0.013 0.006 1.000 4 UAM-C-run 3 0.013 0.006 0.997 5 UAM-C-run 2 0.010 0.005 0.950

In the Java scenario, UAM-C has achieved the best performance with a bal- ance between precision and recall. The combination of all the similarities (lexical, structural and stylistic) measures using a supervised decision tree has been de- cisive. The second run has been affected by the same phenomenon than in the C scenario: it retrieved +10K re-used source code pairs. The three runs of DCU achieved a similar performance. In the second run, the addition of the bag-of- words to the selected nodes of the AST slightly improved the performance of the first run. DCU did not select nodes to create a bag-of-words in the third run. This fact has generated slightly lower results

Table 5. Overall evaluation results for Java programing language. The ranking is upon the F1 values. Baseline 1 corresponds to JPlag model and baseline 2 corresponds to a character 3-grams based model.

Position Run F1 Precision Recall 1 UAM-C-run 3 0.807 0.691 0.968 2 DCU-run 2 0.692 0.530 0.995 3 DCU-run 3 0.680 0.515 1.000 4 DCU-run 1 0.602 0.432 0.995 # baseline 2 0.556 0.457 0.712 5 UAEM-run 1 0.556 0.385 1.000 6 UAM-C-run 1 0.517 0.349 1.000 # baseline 1 0.380 0.542 0.293 7 UAEM-run 2-3 0.273 0.158 1.000 8 UAM-C-run 2 0.037 0.019 0.928 9

In general, different approaches were applied to solve the problem of source code re-use detection. Proposed approaches vary from string-matching to ab- stract syntax tree based models. Additionally, given that all these approaches were evaluated under the same conditions employing the same collections, it was possible to make a more fair comparison among participant systems. Ac- cordingly, the best performing model was the string-matching based in C [11] while a combination of lexical, structural and stylistic in Java [15].

7 Remarks and Future Work

In this paper we have presented the overview of the Detection of SOurce COde Re-use (SOCO) PAN track at FIRE. Especially, SOCO 2014 has provided a task specification which is particularly challenging for participating systems. The task was focused on retrieving cases of re-used source code pairs from a large collection of programs. At the same time, SOCO has provided an evaluation framework where all participants were able to compare their obtained results by means of applying different approaches under the same conditions and using the same corpora. With these specifications, the task has turned out to be particularly challenging and well beyond the current state of the art of participant systems. In total three teams participated and submitted 13 runs. We summarise the followed approaches by each of the participant systems and presented the eval- uation of submitted runs along with its respective analysis. In general, different approaches were proposed, varying from string-matching to abstract syntax tree based models. It is important to notice that the participation for the Java lan- guage was much higher than for the C programming language (8 vs. 5 runs). Nevertheless, the team that achieved the best results in both scenarios (i.e., C/C++ and Java) was the UAEM by means of their string-matching approach. Finally, a note has to be made with respect to the re-usability of test col- lections, which were calculated using a pool formed by submitted and baseline runs; is that more experiments need to be performed in order to construct a more fine-grained relevance judgements. Nonetheless, both training and test col- lections represent a valuable resource for future research work on the field of source code re-use identification.

Acknowledgement

We want to thank C. Arwin and S. Tahaghoghi for providing the training collec- tion. PAN@FIRE (SOCO) has been organised in the framework of WIQ-EI (EC IRSES grant n. 269180) and DIANA-APPLICATIONS (TIN2012-38603-C02- 01) research projects. The work of the last author was supported by CONA- CyT Mexico Project Grant CB-2010/153315, and SEP-PROMEP UAM-PTC- 380/48510349. 10

References

1. Arwin, C., Tahaghoghi, S.: Plagiarism detection across programming languages. Proc. 29th Australasian Computer Science Conference, 48, 277–286 (2006) 2. Chae, D., Ha, J., Kim, S., Kang, B., Im, E.: Software plagiarism detection: a graph- based approach. Proc. 22nd ACM Int. Conf. Information & Knowledge Manage- ment, CIKM-2013, 1577–1580 (2013) 3. Chuda, D., Navrat, P., Kovacova, B., Humay, P.: The issue of (software) plagiarism: A student view. IEEE Trans. Educ., 55, 22-28 (2012) 4. FIRE (ed.): FIRE 2014 Working Notes. Sixth International Workshop of the Forum for Information Retrieval Evaluation, Bangalore, India, 5–7 December (2014) 5. Fleiss, J.: Measuring nominal scale agreement among many raters. Psychological bulletin. American Psychological Association, 76(5), 378–382 (1971) 6. Flores, E., Barr´on-Cede˜no,A., Moreno, L., Rosso, P.: Uncovering source code reuse in large-scale academic environments. Comput. Appl. Eng. Educ. doi: 10.1002/cae.21608 (2014) (online) 7. Flores, E., Barr´on-Cede˜no,A., Rosso, P., Moreno, L.: DeSoCoRe: Detecting Source Code Re-Use across Programming Languages. Proc. 12th Int. Conf. of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, NAACL-2012, 1–4 (2012) 8. Flores, E., Barr´on-Cede˜no,A., Rosso, P., Moreno, L.: Towards the Detection of Cross-Language Source Code Reuse. Proc. 16th Int. Conf. on Applications of Natu- ral Language to Information Systems, NLDB-2011, Springer-Verlag, LNCS(6716), 250–253 doi: 10.1007/978-3-642-22327-3 31 (2011) 9. Flores, E., Ibarra-Romero, M., Moreno, L., Sidorov, G., Rosso, P.: Modelos de recuperaci´onde informaci´onbasados en n-gramas aplicados a la reutilizaci´onde c´odigofuente. Proc. 3rd Spanish Conf. on Information Retrieval, CERI-2014, 185– 188 (2014) (In Spanish) 10. Ganguly, D., Jones, G.: DCU@FIRE-2014: An Information Retrieval Approach for Source Code Plagiarism Detection. In FIRE [4] 11. Garc´ıa-Hern´andez, R., Lendeneva, Y.: Identification of similar source codes based on longest common substrings. In FIRE [4] 12. Marinescu, D., Baicoianu, A., Dimitriu, S.: Software for Plagiarism Detection in Computer Source Code. Proc. 7th Int. Conf. Virtual Learning, ICVL-2012, 373–379 (2012) 13. Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview of the 6th International Competition on Plagiarism Detection. In: Cap- pellato L., Ferro N., Halvey M., Kraaij W. (Eds.) CLEF 2014 Labs and Workshops, Notebook Papers. CEUR-WS.org, 1180, 845–876 (2014) 14. Prechelt, L., Philippsen, M., Malpohl, G.: JPlag: Finding plagiarisms among a set of programs. Tech. Report, Universit¨atKarlsruhe (2000) 15. Ram´ırez-de-la-Cruz, A., Ram´ırez-de-la-Rosa, G., S´anchez-S´anchez, C., Luna- Ram´ırez,W. A., Jim´enez-Salazar,H., Rodr´ıguez-Lucatero,C.: UAM@SOCO 2014: Detection of Source Code Reuse by means of combining different types of repre- sentations. In FIRE [4] 16. Rosales, F., Garc´ıa,A., Rodr´ıguez,S., Pedraza, J., M´endez,R., Nieto, M.: Detec- tion of plagiarism in programming assignments. IEEE Trans. Educ., 51, 174-183 (2008) 17. Sparck, K., van Rijsbergen, C.: Report on the need for and provision of an ”ideal information retrieval test collection. British Library Research and Development Report, 5266, University of Cambridge (1975) 11

18. Whale, G.: Software metrics and plagiarism detection. Proc. Ninth Australian Computer Science Conf., 231–241 (1990) DCU@FIRE-2014: An Information Retrieval Approach for Source Code Plagiarism Detection

Debasis Ganguly Gareth J.F. Jones Centre for Global Intelligent Computing (CNGL) Centre for Global Intelligent Computing (CNGL) School of Computing School of Computing Dublin City University Dublin City University Dublin, Ireland Dublin, Ireland [email protected] [email protected]

ABSTRACT tend to re-use source code snippets that are available on This paper investigates an information retrieval (IR) based the web. The massive amount of programing resources on approach for source code plagiarism detection. The method the web makes it infeasible in practice to perform manual of extensively checking pairwise similarities between docu- analysis of suspicious source code re-usage. Consequently, ments is not scalable for large collections of source code doc- there is a need of developing automated methods for detect- uments. To make the task of source code plagiarism detec- ing source code plagiarism. This is particularly the case for tion fast and scalable in practice, we propose an IR based ap- software development companies who want to preserve their proach in which each document is treated as a pseudo-query intellectual property. in order to retrieve a list of potential candidate documents The problem of software plagiarism is challenging because in a decreasing order of their similarity values. A threshold the bag-of-words encoding of source codes in a particular is then applied on the relative similarity decrement ratios to programming language is likely to result in a massive num- report a set of documents as potential cases of source-code ber of hits (non-zero similarity values) due to the use of reuse. Instead of treating a source code as an unstructured similar programming language specific constructs and key- text document, we explore term extraction from the anno- words. It is therefore highly inefficient to compute pairwise tated parse tree of a source code and also make use of field similarity values between source codes in a reasonably large based language model for indexing and retrieval of source sized collection. code documents. Results confirm that source code parsing This pairwise computation can be avoided by an informa- plays a vital role in improving the plagiarism prediction ac- tion retrieval (IR) based approach. In this approach, first curacy. each document is added to an inverted list based indexed or- ganization and then each document in turn can be treated as a pseudo-query to retrieve a ranked list of similar documents Categories and Subject Descriptors from the collection. The plagiarized documents can then be H.3.3 [INFORMATION STORAGE AND RETRIE- selected from the retrieved list of documents. The research VAL]: Information Search and Retrieval—Retrieval models, questions in the IR based approach are the following. Relevance Feedback, Query formulation Q1: Does a bag-of-words model (as used in standard IR) suffice or should the source code structure be used in General Terms some way to extract more meaningful pieces of infor- Theory, Experimentation mation? Q2: How to index the source code documents so that a re- Keywords trieval model can best make use of the indexed terms to retrieve relevant (plagiarized) documents at top ranks? Source Code Plagiarism Detection, Field Search Q3: How to represent a source code document as a pseudo- query, i.e. what are most likely representative terms 1. INTRODUCTION in a source code document? Community question answering (CQA) forums and pro- gramming related blogs have made source code widely avail- The rest of the paper describes our investigation on these able to be read, copied and modified. Programmers often research questions as a part of our participation in the source code plagiarism detection task SOurce COde reuse (SOCO) in FIRE 2014 [4]. Section 2 discusses the limitations of the Permission to make digital or hard copies of all or part of this work for existing approaches for software plagiarism detection and personal or classroom use is granted without fee provided that copies are not motivates the need for an IR approach. In Section 3, we made or distributed for profit or commercial advantage and that copies bear describe our IR based approach to index a large collection this notice and the full citation on the first page. Copyrights for components of source code documents and retrieve candidate plagiarized of this work owned by others than ACM must be honored. Abstracting with documents from this indexed collection. In Section 4, we credit is permitted. To copy otherwise, or republish, to post on servers or to present the results on the development set of documents and redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. our official results on the test collection. Finally, Section 5 Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. concludes the paper with directions for future work. 2. LIMITATIONS OF EXISTING METHODS consideration, and then describe in details how are the doc- In this section, we first describe why some standard tech- uments indexed and how are the pseudo-queries formulated. niques of document similarity estimation may likely fail to work for source codes. We then discuss our proposed method 3.1 Retrieval with Pseudo-query where we attempt to alleviate each of these problems. To detect all plagiarized document sets in a collection, we propose to treat every document in the collection as a 2.1 Near Duplicate Document Detection pseudo-query and retrieve a list of top ranked most similar It is usually the case in software plagiarism that only a documents in response to the query. However, in contrast part of the source code is copied for reuse in another pro- to the standard method of result presentation with the help gram. Occurrences of exact duplicates at the level of whole of a ranked list in IR, the objective in the case of plagiarism documents is rare. Consequently, the standard techniques detection is to obtain a set of documents that are to be of near duplicate document detection from large collections, predicted as plagiarized. such as shingling [1], may not be useful for this problem To get this candidate set of plagiarized documents from because the Jacard coefficient of the set of shingles for two the ranked list of (say 1000) retrieved documents, we need source codes (a part of one being copy-pasted into the other) to cut-off the ranked list at some point because the ones fur- is likely to be low. ther down the list are progressively less likely to be relevant (plagiarized). The use of thresholding to obtain a smaller 2.2 Bag-of-words Model set of candidate documents have been reported in previous Most existing approaches for source code plagiarism de- works [2, 3]. tection take into account the programming language spe- The cut-off strategy that we use in particular is a thresh- cific features. This is because a bag-of-words encoding of olding on the relative drops in similarity values of the ranked the source codes is likely to result in falsely high similarity list. More precisely, we go on accumulating documents from values between non-plagiarized documents pairs due to the the ranked list of retrieved documents until the relative de- cease in similarity of the ith ranked document with respect use of similar programming language specific constructs and th keywords. to the (i − 1) one is higher than a pre-defined threshold, Programs tend to use a frequent set of variable names, spe- say , as shown in Equation 1. The reasoning behind this is cially for looping constructs, e.g. i, j, k etc., which may also that the first relative drop higher than the threshold most cause false high similarities with a flat bag-of-words repre- likely indicates the start of non-plagiarized documents. In- sentation of documents. Furthermore, programs extensively tuitively speaking, the relative similarity decrement values make use of common library classes, such as the “ArrayList”, in contrast to the absolute similarity values are normalized “HashMap” etc. in Java, which may also contribute to the and hence are devoid of any document and query length false matches; for instance two Java programs making use effects. of the standard library “HashMap” may be falsely identified sim(Q, Di) − sim(Q, Di−1) P lag(Q) = {Di : ≤ } (1) as plagiarized pairs. It is therefore of utmost importance to sim(Q, Di−1) take into account the structure of a program while comput- ing the similarity. In fact, it has been shown that a control- 3.2 Document Representation flow graph based analysis of program pairs produces signif- As discussed in Section 2, the bag-of-words document icantly better results than a simple term frequency based model representation of a source-code document cannot ef- approach [2]. fectively capture the cases where a part of the source code is copy-pasted into another. To alleviate this problem, a so- 2.3 Exhaustive Pair-wise Similarity lution is to take into account the code structure of a source An exhaustive pairwise similarity computation between code while representing the source document as a vector. all documents pairs in a collection is clearly intractable for Specifically, as a part of document processing for indexing, large sized collections. However, all previously reported ap- we used a Java parser1 to construct an annotated syntax proaches of source code plagiarism detection, that we are tree (AST) from each source code document in our collec- aware of, use an exhaustive pairwise similarity computation tion. We then extract terms from specific nodes of the AST. approach [2, 3]. Due to the difficulty in carrying out ex- The words extracted from the list of AST nodes shown perimental investigations on large software collections, such in Table 3 are then indexed in stored in separate fields of approaches are mostly evaluated on a very small collection a Lucene index. A field representation of a document is of source codes, e.g. the evaluation in [2] uses a collection supposed to better utilize the document structure than a of 56 programs. A standard way to avoid per pair similar- flat view, e.g. a match in the string literal field is treated ity computatiion is to use an inverted list organization of separately in comparison to a match in the class name field, documents to retrieve a candidate list of top most similar as a result of which a program using the string constant documents with respect to a given query. Next, we describe “HelloWorld” is not considered as plagiarized from a source how such an approach can be applied for the software pla- which defines a class named “HelloWorld”. giarism detection task. 3.3 Query Representation 3. IR APPROACH Since only a part of the source code is typically copy- pasted into another one, it is not reasonable to use whole In this section, we first describe how an IR based approach source code documents as a pseudo-queries. Instead, we ex- is used to obtain a set of candidate plagiarized documents tract a pre-set number of terms from each field of a document from a ranked list of documents retrieved in response to a pseudo-query constituted from the current document under 1http://code.google.com/p/javaparser/ Table 1: Results on the training data. Parametes Metric AST Fields #Qry terms n-gram Precision Recall F-score no no all 1 0.8000 0.0952 0.1702 no no 50 1 0.6363 0.4166 0.5035 yes no all 1 0.7778 0.0833 0.1505 yes no 50 1 0.7631 0.3452 0.4754 yes yes all 1 0.8235 0.1666 0.2772 yes yes 50 1 0.7894 0.3571 0.4918 no no all 2 0.7667 0.2738 0.4035 no no 50 2 0.7200 0.4285 0.5373 yes no all 2 0.7391 0.2023 0.3177 yes no 50 2 0.5714 0.2857 0.3809 yes yes all 2 0.7826 0.2142 0.3364 yes yes 50 2 0.6842 0.6190 0.6500

Table 2: SOCO Official Results. Parametes Metric Run Name Parse Fields #Qry terms n-gram Precision Recall F-score dcu-run1 no no 50 2 0.432 0.995 0.602 dcu-run2 yes no 50 2 0.530 0.995 0.692 dcu-run3 yes yes 50 2 0.515 1.000 0.680

4.1 Baselines Table 3: Annotated Syntax Tree nodes of a Java program from which terms are extracted during in- As baseline approaches, we submitted two runs in the dexing. SOCO task. The first simply uses the standard LM retrieval with a flat bag-of-words representation. Source code docu- Field Name Field Description ments are treated similar to non-structured text documents Classes The names of the classes used in a and the index does not constitute of separate fields. Java source As a second baseline approach, we submitted a run where Method calls The method names and actual pa- we extract only the terms from the selected nodes of the rameter names and types AST (see Table 3). However, we do not store these terms in String literals Values of the string constants separate fields; instead, we use a single field to store them. Arrays Names of arrays and dimensions In addition to the official submissions, we also investigated Method definitions Names of methods and formal pa- other approaches with different parameter settings, e.g. us- rameter names and types ing different number of terms while constructing the pseudo- Assignment statements Variable names and types queries, and unigram and bi-gram (word level) indexing. Package imports Names of imported packages Comments 4.2 Results on Training Data The results obtained with different parameter settings are shown in Table 1. The first observation that can be made (see Table 3) for constructing a pseudo-query. The term se- from Table 1 is that the the use of all terms while construct- lection function that we use in particular is the language ing the pseudo-query from the documents result in a very modeling (LM) term score [5], shown in Equation 2. low recall. The second important observation is that the method of extracting terms from selected nodes of the AST tf(t, f, d) cf(t) is not much useful without the document field structures. LM(t, f, d) = λ + (1 − λ) (2) len(f, d) cs The third observation is that making use of the word bi- grams for indexing and retrieval tends to improve results for Specifically speaking, in order to formulate a query from all cases. a document d, we score each term of each field of d by the function shown in Equation 2 and then select the top most 4.3 Results on Test Set (Official Results) k ones, where k is a parameter. The parameter λ controls The official results of our submitted runs are shown in the relative importance of the term frequency as against the Table 2. It can be seen that the results on the test set collection frequency. are somewhat different from those on the training one. For instance, the a flat index constituted from the AST terms 4. EXPERIMENTS AND RESULTS produces very good results which was not the case for the In this section, we report the results of the experiments training set (c.f. Table 1). Flat indexing with no parsing conducted on the training data and the official results of the yields considerably worse precision (and hence F-score) in SOCO task [4]. comparison to the parsing based approaches. Surprisingly, field based LM does not turn out to be more effective than the standard bag-of-words LM.

5. CONCLUSIONS AND FUTURE WORK This paper described our approach to the problem of source code plagiarism detection. The key idea our approach cen- ters around the hypothesis that the program structure is im- portant for determining source code plagiarisms. Both the development set and the official results empirically confirm this hypothesis. Thus, the answer to research question Q1 (see Section 1) is that parsing the source code helps improve more meaningful pieces of information which can in turn be used to improve the accuracy of plagiarism detection. Whether a field based retrieval model improves results further or not is inconclusive from the results of the devel- opment and the test sets. Research question Q2, where we wanted to explore effective ways of document representa- tion, is yet inconclusive because of the apparent anomalous results obtained on the development and the test sets. In the third research question, namely Q3 (see Section 1), we wanted to explore effective ways of representing a document as a pseudo-query. The results show that an LM based term selection method of selecting representa- tive terms works significantly better than using all terms for pseudo-query formulation. In future, we would like to explore the reasons for the apparent anomaly between the development set and the test set results. Using different weights for different fields during the retrieval process is another direction for future research.

Acknowledgments This research is supported by Science Foundation Ireland (SFI) as a part of the CNGL Centre for Global Intelligent Content at DCU (Grant No: 12/CE/I2267).

6. REFERENCES [1] A. Z. Broder. Identifying and filtering near-duplicate documents. In Combinatorial Pattern Matching, 11th Annual Symposium, CPM 2000, Montreal, Canada, June 21-23, 2000, pages 1–10, 2000. [2] D. Chae, J. Ha, S. Kim, B. Kang, and E. G. Im. Software plagiarism detection: a graph-based approach. In 22nd ACM International Conference on Information and Knowledge Management, CIKM’13, San Francisco, , USA, October 27 - November 1, 2013, pages 1577–1580. [3] D.-K. Chae, S.-W. Kim, J. Ha, S.-C. Lee, and G. Woo. Software plagiarism detection via the static api call frequency birthmark. In Proceedings of the 28th Annual ACM Symposium on Applied Computing, SAC ’13, pages 1639–1643, New York, NY, USA, 2013. ACM. [4] E. Flores, P. Rosso, L. Moreno, and E. Villatoro-Tello. PAN@FIRE: Overview of SOCO Track on the Detection of SOurce COde Re-use. In Sixth Forum for Information Retrieval Evaluation (FIRE 2014), Bangalore, India, 2014. [5] D. Hiemstra. Using Language Models for Information Retrieval. PhD thesis, CTIT, AE Enschede, 2000. Identification of Similar Source Codes based on Longest Common Substrings René Arnulfo García Hernández Yulia Ledeneva Autonomous University of the State of Mexico Autonomous University of the State of Mexico Santiago Tianguistenco, San Pedro Tlaltizapan Santiago Tianguistenco, San Pedro Tlaltizapan State of Mexico, Mexico State of Mexico, Mexico [email protected] [email protected]

ABSTRACT search re-uses cases in a given collection of source codes. In this In this paper, we describe the system developed by Autonomous case, every code in the collection is considering suspects. In the University of the State of Mexico (in Spanish, UAEM) for the extrinsic tools, the problem consists of given a suspect to find the detection of source code re-use (SOCO) task of FIRE 2014. The re-use case in other collections, like the Web. aim of the SOCO task is to detect the most similar code pairs JPLAG and MOSS are examples of free tools. JPLAG [1] was between a large source code collection in java and c languages. developed by Guido Malpohl in 1996 which supports Java, C#, Our method is divided in for phases: preprocessing, similarity C++, Scheme and natural language text. JPLAG uses a variation measure, ranking and getting the decision. One way to measure of the Karp-Rabin comparison algorithm developed by Wise [2]. the similarity between a pair of codes is to use the length of the First, JPLAG converts the source code into strings of token Longest Common Substring (LCS). However, beside the LCS employing a parser. The parser brings more semantic information there is another important set of longest common substrings and depends on the source language. MOSS [3] (Measure Of (shorter than LCS) that are not taking into account. Our Software Similarity) was developed by Alex Aiken in 1994. hypothesis is that if we use all the longest common substrings MOSS works with different languages: C, C++, Java, Pascal, Ada, (LCSs) is possible to improve the detection of similarity between Lisp and Scheme. MOSS is based on getting fingerprints which two codes. The second hypothesis is that a re-use case not only identifying a source code in particular way. According to the Web depends on the value of a measure but also depends on the page of MOOS the ideas of its algorithm can be found in [7]. The similarity of other codes. For this, we get other parameters using idea is that the more common fingerprints exist between a pair of the LCSs measure with respect to other codes. For taking a re-use source codes, the more similar they are. Fingerprints are a small case decision, we obtain some rules using the training corpus of subset of all the n-grams (substrings of n characters) that exists in SOCO. a source code. The fingerprint is form with a unique value that represents an n-gram (normally, a hash functions is applied). Categories and Subject Descriptors Sherlock and PMD are open-source free tools. Sherlock was H.3 [Information Storage and Retrieval]: H.3.1 Content developed by the department of Computational Science of the Analysis and Indexing; H.3.3 Information Search and Retrieval; University of Warwick. Sherlock works with source codes and H.3.4 Systems and Software natural language texts. Also, PDM uses the string-matching comparison algorithm of Karp-Rabin. PDM supports Java, JSP, General Terms C, C++, Fortran and PHP. Algorithms, Measurement, Performance, Experimentation, CodeMatch is a commercial tool that supports the languages: Languages BASIC, C, C++, C#, Delphi, Flash, Java, etc. CodeMatch is based on the combinations of five algorithms: Source Line Matching, Keywords Comment Line Matching, Word Matching, Partial Word Source Code Reuse, Longest Common Substrings, Similar Codes, Matching, and Semantic Sequence Matching. For processing a Java Code Reuse, C Code Reuse. source code, first CodeMatch separates comments, identifiers (name of variables, names of constants, names of functions, etc.) and functional code. Word Matching algorithm obtains for each code a substring of words (eliminating reserved words) that allow 1. INTRODUCTION counting the number of common words in this sequence. Unlike Word Matching, for Partial Word Matching is not necessary that the complete words matching, it could be partial. Source Line Even though is common to find a lot of web pages showing Matching compares the source lines (excluding comments) of the source codes in different languages, the source code is the result source code pair. On the contrary, Comment Line Matching of an intellectual effort, for such reason, it is protected by compares the lines of comments excluding the lines with copyright laws. Normally, the source code in the WEB is functional code. Semantic Sequence Matching compares the lines presented in short fragments with tutorial proposes. However, re- of codes using the first word (excluding comments) of the pair of using source code of works brings economic problems for the source codes. Finally, a single score is given for the similarity of author and legal problems for whom make the act. Some the source code pair. automatic tools have been developed to assist with the problematic of the re-use detection of source code. These tools The state of the art in the s research deals with the detection of can be classified in intrinsic or extrinsic tools. The intrinsic tools source-code re-use across programming languages [6]. 2. Proposed system of the rules were tuning with Java corpus since more re-use cases Our system (UAEM) used for the detection of source code reuse exists in the relevant judgments. In this sense, we think that the is divided into four phases. results in C are worse since there are only 26 pairs for training. Table 1. Results with training corpus according to our 2.1 Preprocessing phase evaluation. In the first phase, only the lexical items (like {,},(,),+,*,;.etc.) of each source code are separated with a whitespace and more than Corpus-Run Precision Recall F-measure one whitespace is removed. The result of this phase is a string of Java-Run1 0.78 0.83 0.80 tokens of the source code. This phase depends on the input language, but for C and Java is almost the same. Even thought, we Java-Run2 0.85 0.80 0.83 test some options like remove comments or identifiers, the C-Run1 0.80 0.58 0.67 evaluation in the training corpus decreases. C-Run2 0.80 0.63 0.71 2.2 Similarity measure phase In the second phase, for each source code given as a string, the similarity measure with respect to the other source codes is 4. Testing experiments obtained. The sum of the different lengths of the longest common We were surprised when the test corpus was delivered, it was substrings between the two source codes (normalized to the length bigger than we expected. The Java corpus has 12,080 files divided of the longest code) is used as the similarity measure. For this in 6 scenarios. In the case of C corpus, it has 19,895 files divided phase, we used the algorithm in [4]. in 6 scenarios, too. Table 2 shows the distribution of the corpus according to SOCO scenarios. 2.3 Ranking phase In the third phase, a set of parameters that allow later the Table 2. Distribution of test corpus according to the scenario. identification of cases of re-use is obtained using comparisons done in the previous phase. The parameters obtained are: the Scenario Java C value of the DISTANCE (1 - similarity), the RANKING of the A1 3,241 5,408 distance (rank order of the most similar), the GAP that exists with the next closest code (it is only calculated for the first 10 closest A2 3,093 5,195 codes) and, using the maximum gap between the 10 most closest B1 3,268 4,939 codes, the codes that are (B)efore or (A)fter the maximum gap (RELATIVE DIFFERENCE) are labeled. The result of the third B2 2,266 3,873 phase is a matrix where each row represents a comparison of a source code with other codes (columns) and each cell represents a C1 124 335 pair of source codes in both directions. C2 88 145 2.4 Reuse decision phase For taking the decision, a source code pair X↔Y will be a reuse At the beginning, the system was not optimized for running with case, if there is evidence of reuse in both directions, it means, bigger collections. The time estimated for processing the whole X→Y and Y→X. A reuse case exists when the DISTANCE is less corpus was of 3 months, unacceptable for SOCO time than 0.45 or the GAP is greater than 0.14, but also it is important competition. After a reprograming the system was possible to that one of the additional conditions is achieved. The first process the collections in one day using a computer with CPU condition is that the RANKING must be, at least, in the second Xenon with 6 cores and 32 GB in RAM. position and, the second condition, that the label of the Since we did not know how the evaluation will be done, it could RELATIVE DIFFERENCE must be B. The first run for C and be done: by scenario, by language or by runs; we decide to do the Java languages were processed with above conditions. However, combination of 2 runs to use 3 options that we are able to submit. in some cases the evidence in one direction was very high and in These is the explanation of why run 2 and run 3 are the same in the other direction was almost reliable, but according to the Java, and why run 1 and run 2 are the same in C. The results of training corpus in Java and C, in most of the cases this pair was a our system (UAEM) with the test corpus for C are showed in code reuse case. In the second run, if there were not high evidence Table 3. The f-measure results for UAEM-run1 and UAEM-run2 of reuse in one direction, then the pair can be considered as reuse (actually, it correspond to the system tuning to C-Run1 in training case whether at least one of the both codes has the RANKING of phase) were better than UAEM-run3 (it corresponds to C-Run2 in 1 and the RELATIVE DIFFERENCE of B and the GAP greater training phase). However, in the training phase the system tuning than 0.1. to C-Run1 was worse in the recall and the precision was better than recall. We think this variation is possible since the training 3. Training experiments corpus is very small compared with the test corpus. The training corpus consists of 259 source codes in Java and 79 sources codes in C. Relevant judgments in Java has 84 pairs and in C has 26 pairs. In table 1 is showed the results of our system Table 3. Results of the systems for C according to the first with the training corpus for C and Java. In the first run with Java SOCO evaluation. the system gets a better recall than precision, and in the second Rank Team-Run Precision Recall F-measure run the precision is better. However, with C language the system 1 obtains the same precision, the difference was in the recall. Most UAEM-run1 0.306 0.500 0.380 2 UAEM-run2 0.306 0.500 0.380 Table 5. Results of the systems for C according to the second SOCO evaluation. 3 UAEM-run3 0.260 0.500 0.342 Rank Team-Run Precision Recall F-measure # Baseline-1 0.400 0.069 0.117 1 UAEM-run1 0.282 0.100 0.440 # Baseline-2 0.040 0.280 0.060 2 UAEM-run2 0.240 0.100 0.387 4 UAM-C-run1 0.007 0.494 0.013 # Baseline-2 0.258 0.345 0.295 5 UAM-C-run3 0.007 0.493 0.013 # Baseline-1 0.350 0.130 0.190 6 UAM-C-run2 0.005 0.444 0.010 4 UAM-C-run1 0.006 1.000 0.013

5 UAM-C-run3 0.006 0.997 0.013 The results of our system (UAEM) with the test corpus for Java 6 UAM-C-run2 0.005 0.950 0.010 are showed in Table 4. The evaluation for UAEM-run2 and UAEM-run3 (actually, it corresponds to the system tuning to JAVA-Run2 in the training phase) were better than UAEM-run1 According to the second evaluation by SOCO, our system retains (it corresponds to JAVA-Run1). As in previous evaluation, the the top position and the f-measure was increased, but the UAM-C system has a different behavior with respect to the training phase. team obtained the same f-measure. Nevertheless, the results in Java were better than in C. Table 6. Results of the systems for java according to second SOCO evaluation. Table 4. Results of the systems for java according to first Rank F- SOCO evaluation. Team-Run Precision Recall measure Rank F- Team-Run Precision Recall 1 measure UAM-C-run3 0.691 0.968 0.807 1 UAEM-run2 0.641 0.969 0.771 2 DCU-run2 0.530 0.995 0.692 2 UAEM-run3 0.641 0.969 0.771 3 DCU-run3 0.515 1.000 0.680 3 UAEM-run1 0.759 0.472 0.582 4 DCU-run1 0.432 0.995 0.602 # 4 UAM-C-run1 0.633 0.435 0.515 baseline 2 0.457 0.712 0.556 5 5 DCU-run3 0.775 0.360 0.492 UAEM-run1 0.385 1.000 0.556 6 UAM-C-run1 0.349 1.000 0.517 6 DCU-run2 0.777 0.350 0.482 # baseline 1 0.542 0.293 0.380 7 DCU-run1 0.658 0.364 0.468 7 UAEM-run2 0.158 1.000 0.273 8 UAM-C-run3 0.926 0.311 0.465 8 UAM-C-run2 0.019 0.928 0.037 # Baseline-2 0.464 0.288 0.356 # Baseline-1 0.617 0.080 0.141 According to the second evaluation in java, our system obtains the fifth position with run1 and the seventh position with run 2. 9 UAM-C-run2 0.029 0.343 0.054 5. Conclusions and future work In this paper, a new system for detecting re-use of source code is The configuration of Baseline-1 corresponds to JPLAG program described. The proposed system works in four phases. The with the default parameters, and the configuration of Baseline-2 preprocessing phase is very interesting since it does not require consists of a character 3-gram model weighted using term sophisticated processes or dictionaries, making the execution of frequency and cosine measure to compute the similarity. This this phase very fast. It is worth noting that all phases of our baseline considers as re-used cases all source code pairs that system work with words, making the process faster than when surpass a similarity threshold of 0.9. An overview of the SOCO working with characters. The second phase introduces a new Track can be found in [5] measure based on the different lengths of longest common substrings between the pairs of source codes which outperform 4.1 Second SOCO evaluation LCS. Third phase presents a new way for considering other The second evaluation by SOCO eliminates our third run since parameters derived from the LCSs measure. These parameters was a combination of the runs 1 and 2. Results for C are showed allow proposing some rules for catch some groups of re-use cases. in table 5 and results for java in table 6. According to first SOCO evaluation, our system outperforms other systems in both cases. Even thought, the evaluation of Java reach the best f-measure score, the results in C are also relevant, since was the unique system that surpass both baselines in the first evaluation. In both second evaluations is interesting to observe that all of the maximal sequential patterns in a document collection. systems obtains excellent recalls between 0.928 and 1.000. Computational Linguistics and Intelligent Text Processing, Therefore, as a future work we must concentrate our efforts in LNCS 3878, Springer, 514-523, Mexico. precision. Also, as future work, we think the rules for C can be [5] E. Flores, P. Rosso, L. Moreno, and E. Villatoro-Tello, 2014. improved considering some more re-use cases. PAN@FIRE 2014: Overview of SOCO Track on the Detection of SOurce COde Re-use. In Proceedings of the 6. REFERENCES Sixth Forum for Information Retrieval Evaluation (FIRE [1] L. Prechelt, G. Malpohl and M. Phlippsen, 2000. JPlag: 2014), Bangalore, India. Finding plagiarism among a set of programs. Technical [6] E. Flores, A. Barrón-Cedeño, P. Rosso and L. Moreno, 2012. Report, Universität Karlsruhe, Germany. Detecting source code re-use across programming languages. [2] A. Aiken. 1998. MOSS (Measure Of Software Similarity) In Proceedings of the 2012 Conference of the North plagiarism detection system. American Chapter of the Association for Computational http://www.cs.berkeley.edu/˜moss/ (as of April 2000) and Linguistics: Demostration Session, NAACL, 1-4. personal communication, University of Berkeley, CA. [7] S. Schleimer, D. Wilkerson, A. Aiken, 2003. Winnowing: [3] R.M. Karp and M.O. 1987. Efficient randomized pattern- Local Algorithms for Document Fingerprinting. In matching algorithms. IBM J. of Research and Development, Proceedings of the 2003 ACM SIGMOD International 31(2), 249-260. conference on Management of data, 76-85, CA. [4] R.A. García-Hernández, J. Martínez-Trinidad and J. Carrasco-Ochoa, 2006. A new algorithm for fast discovery of