Mining Multi-word Named Entity Equivalents from Comparable Corpora

Abhijit Bhole Goutham Tholpadi Raghavendra Udupa Microsoft Research India Indian Institute of Science Microsoft Research India Bangalore, India Bangalore, India Bangalore, India [email protected] [email protected] [email protected]

Abstract general problem of MWNE equivalents from com- parable corpora. Named entity (NE) equivalents are useful In the MWNE equivalents mining problem, in many multilingual tasks including MT, a NE in the source language could be related transliteration, cross-language IR, etc. to a NE in the target language by, not just Recently, several works have addressed transliteration, but a combination of translitera- the problem of mining NE equivalents tion, translation, acronyms, deletion/addition of from comparable corpora. These methods terms, etc. To give an example, Figure 1 shows usually focus only on single-word NE a pair of comparable articles in English and equivalents whereas, in practice, most . ‘ Tendulkar’ and ‘.sa; a;.ca;na .tea;nqu+.l+.k+.=’ NEs are multi-word. In this work, we are MWNE equivalents, and both words have present a generative model for extracting been transliterated. Another example is the pair equivalents of multi-word NEs (MWNEs) ‘Siddhivinayak Temple Trust’ and ‘;a;sa; a:;dÄâ ; a;va;na;a;ya;k from a comparable corpus, given a NE ma;///a;nd:= siddhivinayak mandir’. Here, the tagger in only one of the languages. We first word has been transliterated, the second one show that our method is highly effective translated, and the third omitted in Hindi. The task on three language pairs, and provide a is to (a) identify these MWNEs as equivalents, detailed error analysis for one of them. (b) infer the word correspondence between the MWNE equivalents, and (c) identify the type of 1 Introduction correspondence (transliteration, translation, etc.). NEs are important for many applications in natu- Such NE equivalents would not be mined cor- ral language processing and information retrieval. rectly by the previously mentioned approaches as In particular, NE equivalents, i.e. the same NE they would mine only the pair (Siddhivinayak, ;a;sa; a:;dÄâ ; a;va;na;a;ya;k expressed in multiple languages, are used in sev- ). In practice, most NEs are multi- eral cross-language tasks such as machine trans- word and hence it makes sense to address the prob- lation, machine transliteration, cross-language in- lem of mining MWNE equivalents. formation retrieval, cross-language news aggrega- To the best of our knowledge, this is the first tion, etc. Recently, the problem of automatically work on mining MWNEs in a language-neutral constructing a table of NE equivalents in multi- manner. ple languages has received considerable attention In this work, we make the following contribu- from the research community. One approach to tions: solving this problem is to leverage the abundantly We perform an empirical study of MWNE available comparable corpora in many different • occurrences, and the issues involved in min- languages of the world (Udupa et al., 2008; Udupa ing (Section 2). et al., 2009a; Udupa et al., 2009b). While consid- erable progress has been made in improving both We define a two-tier generative model for • recall and precision of mining of NE equivalents MWNE equivalents in a comparable corpus from comparable corpora, most approaches in the (Section 4). literature are applicable only to single-word NEs, and particularly to transliterations (e.g. Tendulkar We propose a modified Viterbi algorithm • and .tea;nqu+.l+.k+.=). In this work, we consider the more for identifying MWNE equivalents, and

65 Proceedings of the 2011 Named Entities Workshop, IJCNLP 2011, pages 65–72, Chiang Mai, Thailand, November 12, 2011. , July 29: Sachin Tendulkar will make his debut with a cameo role in a film about the miracles of Lord Ganesh. Tendulkar, widely regarded as one of the world's best batsmen, will play himself in Vighnaharta Shri Siddhivinayak," a film about the god, who is sometimes referred to as Siddhivinayak. "He will play a small role, as himself, either in a song sequence or in an actual scene," said Rajiv Sanghvi, whose company is handling the film's production. Tendulkar's office confirmed the cricketer would be shooting for the film after he returns from Sri Lanka where India is touring at the moment. Tendulkar, a devotee of Ganesh, had offered to be a part of the project and will not be charging for the role. The film is being produced by the Siddhivinayak Temple Trust, which looks after a famous temple dedicated to Ganesh in Mumbai.

[ अपनी ब쥍ऱेबाजी से दनु नया भर के क्रिकेटप्रेममयⴂ को अपना दीवाना बनाने वाऱे ]/O [ सचिन तᴂडुऱकर ]/[ Sachin Tendulkar ] [ अब ]/O [ बॉऱीवुड ]/[ Bollywood ] [ मᴂ पदापपण करने जा रहे हℂ और गणपनि पर बनने वाऱी एक क्रि쥍म मᴂ वह नजर आएॉगे ]/O [ गणपनि के परमभ啍ि ]/O [ सचिन ]/[ Sachin ] [ ' ]/O [ वव鵍नहतता ससविववनतयक ]/[ Vighnaharta Shri Siddhivinayak ] [ ' क्रि쥍म मᴂ एक सॊक्षऺप्ि भूममका ननभाएॉगे ]/O [ क्रि쥍म का ननमापण ]/O [ ससविववनतयक मंदिर ]/[ Siddhivinayak Temple Trust ] [ न्यास कर रहा है , जो मुॊबई के प्रभादेवी इऱाके मᴂ स्थिि इस मशहूर मॊददर की देखरेख करिा है ]/O [ न्यास के प्रमुख ]/O [ सुभतष मतयेकर ]/[ Subhash Mayekar ] [ ने कहा ]/O [ सचिन ]/[ Sachin ] [ कई साऱ से ननयममि 셂प से इस मॊददर मᴂ आ मीडिया की खबरⴂ के अनुसार क्रि쥍म के ननमापण से जुड़ी कॊपनी के प्रमुख ]/O [ रतजीव संघवी ]/[ Rajiv Sanghvi ] [ ने कहा ]/O [ सचिन ]/[ Sachin ] [ की इसमᴂ सॊक्षऺप्ि भूममका होगी ]/O [ वह ]/O [ सचिन तᴂडुऱकर ]/[ Sachin Tendulkar ] [ के 셂प मᴂ ही नजर आएॉगे ]/O

Figure 1: An example of MWNE mining.

for inferring correspondence information E.g. (Mahatma Gandhi, ma;h;a;tma;a ga;

66 yuddha sammaan) where the correspon- 4 Mining algorithm dence is (Battle, yua:;dÄâ ), (Honour, .sa;mma;a;na) and (Gurais, gua:=+a;I+.sa). 4.1 Key idea We model the problem of finding NE equivalents 9. Some words in the English MWNE do not in the target sentence T using source NEs as a gen- have an equivalent in the Hindi MWNE. E.g. erative model. Each word t in the target sentence dU:=+sMa;.ca;a:= (Department of Telecommunication, is hypothesized to be either part of a NE, or gener- ; a;va;Ba;a;ga doorsanchaar vibhaag) ated from a target language model (LM). Thus, in where ‘of’ does not have an counterpart in the generative model, the source NEs N’s plus the the Hindi MWNE. target language model constitute the set of hidden states. The t’s are the observations. We want to 10. Acronym transliteration by transliterating align states and observations, i.e. determine which each character separately. E.g. (IRRC, state generated which observation, and choose the A;a;IR A;a:=A;a:=+sa;a ai aar aar si) and alignment that maximizes the probability of the (RBC, A;a:= ba;a .sa;a aar bi si). observations. The probability of generating a tar- 11. Acronym transliteration by transliterating as get word t from a source NE state N is dependent a whole. E.g. (SAARC, .sa;a;kR saark) and on (TRAI, f";a;IR traai). whether N is itself multi-word; if so, each • Our study revealed that each of the above char- word in N acts as a substate and can generate acteristics is statistically important. Nearly 37% t. of location names and 77% of organization names the context (the words preceding t in T ); note involved both transliteration and translation. 12% • of person names, 30% of location names and 45% that the length of the context window for t of organization names had either one-to-many depends on the length of the source NE gen- or many-to-one correspondence between words. erating t, and is not a fixed parameter. 36% of organization names had non-sequential the relationship (transliteration or translation) correspondence between words. These statis- • the state/substate and the target word.2 tics clearly indicate that MWNEs need special treatment and any non-trivial MWNE equivalent Dynamic programming (DP) approaches are usu- mining technique must take into account the ally used to compute the best alignment, but it fails characteristics described above. here as the context size varies for each NE. Hence, we posit the generative model at two levels: 3 Problem Description 1. A sentence-level generative model (SGeM), Given a pair of comparable documents in differ- where each word in the target sentence is gen- ent languages, we wish to extract a set of pairs of erated either by the target LM or by one of the MWNEs, one in each language, that are equiva- source NEs. lent to each other. We are given a NE tagger in one of the languages, dubbed the source language, 2. A generative model for the NE (NEGeM), while the other language is called the target lan- where each word in the target NE is gener- guage (denoted with subscripts s and t). We are ated by one of the substates of the source NE. given a document pair (ds, dt) and the NEs in ds m i.e. Ni i=1 and we want to find all possible NEs This is illustrated by the example in Figure 2. { } mMa;ga;l+.va;a:= k+:ea :ke C+.a:ˆa;ea nea in dt which are equivalent to some Ni. The prob- The portions ’ ’ and ’ lem now reduces to finding sequences of words in A;pa;nea’ of the Hindi sentence is generated by the .sa;a;o+.Tea;}å.pa;f;na yua; a;na;va;a;sRa;f;a dt that are equivalent to some Ni’s. language model. ’ ’ is In the example in Figure 1, N m = generated by the English NE ’University of { i}i=1 (Sachin, Tendulkar), (Lord, Ganesh), (Siddhiv- Southampton’. Note that without using the { inayak, Temple, Trust), ... . We want to extract language model, ‘:ke ’ would have been incorrectly } O;;ma :ke the set (Sachin Tendulkar, सिचन तᴂडुलकर), aligned with ‘of’. Another example is ’ { (Siddhivinayak Temple Trust, सि饍िवनायक मिदरं ), 2We also use another relationship for letters in acronyms ... . that are transliterated. }

67 ga;

P tj naibj , rj rtlat (= m tlatPtlat) tj naibj if rj = translation rtlit = mtlitPtlit (tj naibj ) if rj = transliteration

= δ tj na(ibj ) if rj = acronym ≡ = Plm[ (tj) ] if rj = none

Figure 2: Generation of a Hindi sentence from an The four probability terms on the right are ob- English NE. tained, respectively, from a translation model, a transliteration model 4, an acronym model 5, and a language model. 4.2 Generative Model Controlling target NE length In the SGeM, SGeM Let T = t1 . . . tn be the target sentence i 1 m P ai a1− ,Nai is the probability that Nai will and N = Ni i=0 be the hidden states (as before), { } generate ti. To compute this, we first note that, where N0 is the target LM state. In the SGeM, we ( ) t a = a N want to predict the hidden state used to produce for a given term i, either i i+1 i.e. ai continues to generate beyond ti, or ai = ai+1 the next target term ti. Let ai = j if ti is generated 6 i.e. Na terminates at ti. The probability of by Nj. We find an alignment A = a1 . . . an which i L N maximizes continuation depends on the length of ai and the length l of the target NE generated so far by

P (T,A N ) = Nai . Based on empirical observations, we defined | n a function f(l, L) as i 1 i 1 P ai a1− ,Nai P ti t1− ,Nai (1) i=1 f(l, L) = 0 for l / L 2,L + 2 ∏ ( ) ( ) ∈ { − } = 1  for l L 1,L By choosing which source NE generates each tar- − ∈ { − } =  for l L + 1,L + 2 get term, this model also controls the length of the ∈ { } target NE equivalent to a source NE. where f (l, L) is the probability of continuation, Let tk . . . ti 1 be the context for ti (all these i − and 1 f (l, L) is the probability of termination. terms are aligned to Nai ). Then −  is a very small number. We now define

i 1 i 1 P ti t1− ,Nai = P ti tk− ,Nai i 1 i P ai a1− ,N ( ) ( ) (= p NE ) if ai 1 = 0 NEGeM To model the generation of the target − i 1 = f (i ki, lai ) if ai 1 = 0, ki < i term ti given the context t1− and the substates of − − 6 the source NE Nj, we let Nj = nj1, . . . , njL = 1 f i ki 1, lai 1 if ai 1 = 0, ki = i j − − − − − 6 where n is a substate. The internal alignment jp ( ) ( ) B = bki , . . . , bi is defined such that bp = s if tp is where the probabilities on the right are for begin- generated by njs. We get ning an NE, continuing an NE, and terminating a previous NE, respectively. i 1 P ti t − ,Na = 3 ki i δ [x] = 1 if condition x is true 4 ( i ) A character-level extended HMM described in (Udupa et

P b bi P t n (2) al., 2009a). p p+1 p aibp 5A mapping from source language alphabets to target lan- B p=k ∑ ∏i ( ) ( ) guage transliterations of the alphabets.

68 4.3 Modified Viterbi algorithm Similarly, for a probability p given by • We use the dynamic programming framework to the translation model, we calculate p = m prtlat where r R, do the maximization in (1). For each target term ti, tlat0 tlat tlat ∈ m (0, + ) for each source NE Nj, the subproblem is to find tlat ∈ ∞ the best alignment a . . . a such that a = a 1 i i+1 6 i In our experiments, we found that transliteration i.e. ti is the last term in the equivalent of Nj. probabilities were quite low compared to the oth- subproblem [i, j] = ers, followed by the translation probabilities. So, i 1 i 1 we used the following procedure to tune these pa- max P ai = j = ai+1 a − ,Nj P ti t − ,Nj i 1 1 a1 6 rameters use a small hand-annotated set of docu- ( ) ( ) ment pairs. Let l be the length of the target NE ending at ti, based on the alignment so far. The first probability 1. Initially set p = + , and all other pa- NE ∞ term becomes rameters to zero. i P ai l 1 = ai l = j = ai+1 Nj − − 6 − 6 | 2. Tune rtlit to find as many of the transliter- = α f (l, L ) (1 f (l + 1,L )) ( × j − )j ations as possible. Then, use mtlit to fine- This is non-zero only for certain values of tune it to improve precision without losing l, for which we can construct the solution to too much on recall. subproblem [i, j] using solutions for i = l. 3. Next, tune rtlat to find as many of the trans- Denote k = i l, then − lations as possible. Then, use mtlat to fine- subproblem [i, j] = tune it to improve precision without losing p too much on recall. max subproblem [k 1, j] negem t ,Ni j=i − × k 6 4. The system is now finding as many NEs as where the procedure negem computes the( proba-) possible, but it is also finding noise. Keep bility that a given sequence of target words is an lowering p to allow the language model equivalent of the given source NE. This procedure NE LM to absorb more and more noise. Do this solves a second (independent) DP problem (for until NEs also begin to get absorbed by LM. the NEGeM), constructed in a similar fashion. It also models conditions such as “If a target term is 6 Empirical Evaluation a transliteration, it cannot map to more than one source substate.” In this section, we study the overall precision and The output of the system is a set of MWNE recall of our algorithm for three different language pairs. For each pair, we also give the internal pairs. English (En) is the source language, and alignment between the words of the two NEs. Hindi (Hi), Tamil (Ta) and Arabic (Ar) are the tar- get languages. Hindi belongs to the Indo-Aryan 5 Parameter Tuning family, Tamil belongs to Dravidian family, and The MWNE model has five user-set parameters. Arabic belongs to the Semitic family of languages. These need to be tuned appropriately in order to be The results show that the method is applicable for able to compare probabilities from different mod- a wide spectrum of languages. els. In the following, we describe the parameters 6.1 Linguistic Resources and a systematic way to go about tuning them. Models We need four models (translation, p (0, + ) specifies how likely are we • NE ∈ ∞ transliteration, language, and acronym) in order to to find an NE in a target sentence run the proposed algorithm. For a language pair, Given a probability p returned by the we learnt these models using the following kinds • transliteration model, the probability of data, which was available to us: value used for comparisons ptlit0 is cal- A set of pairs of NEs that are transliterations, rtlit • culated as ptlit0 = mtlit p where to train the transliteration model r R, m (0, + ). r is tuned to tlit ∈ tlit ∈ ∞ tlit boost/suppress p; m is also used similarly, A set of parallel sentences, to learn a transla- tlit • but to get more fine-grained control. tion model

69 Lang. Translit. Word Monolin. identifies transliterations in the target document. pairs pairs pairs corpus For MWNEs, the annotator also marks which En-Hi 15K 634K 23M words word in the source corresponds to each word in En-Ta 17K 509K 27M words the target MWNE. This constitutes gold standard En-Ar 30K 8.2M 47M words data that can be used to measure performance. (1K = 1 thousand, 1M = 1 million) 120 article pairs were annotated for En-Hi, 120 for En-Ta, and 36 for En-Ar. Table 1: Training data for the models. Evaluation The NEs mined from one article pair are compared with the gold standard for A monolingual corpus in the target language, • that pair, and one of three possible judgements is to train a language model made: A dictionary mapping English alphabets to • Fully matched (if it fully matches some an- their transliterations in the target language. • notated NE (both source and target)). One can get an idea of the scale of linguistic re- Partially matched (if source NEs match, and sources used by looking at Table 1. • the mined target NE is a subset of the gold Source language NER The Stanford NER tool target NE). (Finkel et al., 2005) was used for obtaining a list Incorrect match (in all other cases). of English NEs from the source document. • The algorithm is agnostic of the type of the 6.2 Corpus for MWNE mining NE (Person, Organization, etc.). So, reporting For each language pair, a set of comparable article the precision and recall for each NE type does pairs is required. The article pairs each for En- not provide much insight into the performance of 6 Hi and En-Ta were obtained from news websites , the method. Instead, we report at different lev- where the article correspondence was obtained us- els of match—full or partial, and for different ing a method described in (Udupa et al., 2009b). categories of MWNEs—single word translitera- En-Ar article pairs were extracted from Wikipedia tion equivalents (SW), multi word transliteration using inter-language links. equivalents (including acronyms) (MW-Translit) and multi word NEs having at least one translation Preprocessing The Stanford NER tags each word in the source document as a person, location, equivalent (MW-Mixed). We compute the num- organization or other. A continuous sequence of bers for each article pair and then average over all identical tags was treated as a single MWNE. pairs. Completely capitalized NEs were treated as Parameter Tuning Parameter tuning was done acronyms. For each acronym (e.g. “FIFA”), following the procedure described in Section 5. both the acronym version (“FIFA”) as well as the For En-Hi and En-Ta, the following values were abbreviation version (“F I F A”) were included used: pNE = 1, mtlit = 100, rtlit = 7, mtlat = 1, in the list of source NEs. Each target document rtlat = 1. For En-Ar, mtlit = 1, rtlit = 14 was was sentence-separated and tokenized using used, the other parameters remaining the same. simple rules based on the presence of newlines, For the tuning exercise, 40 annotated article pairs punctuation, and blank spaces. If a word can be were used for En-Hi, 40 pairs for En-Ta, and 26 constructed by concatenating strings from the pairs for En-Ar. acronym model, it is treated as an acronym, and the acronym strings are separated out (e.g. ’O;;ma:ke ’ 6.4 Results and Analysis O;;ma :ke emke is changed to ’ ’ em ke). We evaluated the algorithm on 80 article pairs for En-Hi, 80 pairs for En-Ta, and 11 pairs for En-Ar. 6.3 Experimental Setup The results are given in Table 2. Annotation Given an article pair, a human an- We observe that the results for both types of pre- notator looks through the list of source NEs, and cision (and recall) are nearly identical. This is so 6En-Hi from Webdunia, En-Ta from The New Indian Ex- because, in most cases, the system is able to mine press. the entire NE. This validates our intuition of using

70 Lang Prec. Prec. Recall Recall levels of diacritization for the same words). اﻟﺠﻤﻬﻮرﻳﻪ Pair (full) (part.) (full) (part.) E.g. The target document contained En-Hi 0.84 0.86 0.89 0.89 al-jamhooriyah “republic”; the dictionary ,al-jamhooriyat اﻟﺠﻤﻬﻮرﻳﺎت En-Ta 0.78 0.80 0.61 0.63 contained En-Ar 0.42 0.44 0.63 0.66 which has a different suffix, and hence was not En-Ar* 0.43 0.44 0.60 0.62 found. * including the data used for tuning Transliteration model The non-uniform usage Table 2: Precision and recall of the system of diacritics and affixes (across training and test data) as mentioned above affected the perfor- Category En-Hi En-Ta En-Ar mance of transliteration too. E.g. The model is prefix usually occurs ’ال’ SW 0.90 0.82 0.69 trained on data where the MW - Translit 0.91 0.64 0.63 in the Arabic NE, but not in the English NE. As اﻟﻨﻴﻮ MW - Mixed 0.77 0.40 0.66 a result, it maps the ‘new’ in ‘new york’ to al-nyoo. The annotator had mapped ‘new’ to Table 3: Category-wise recall of the system nyoo (i.e. without the prefix), causing the ﻧﻴﻮ evaluation program to mark the system’s output language models to disambiguate NE boundaries. as a false positive. (The false negatives are mostly due to limitations of transliteration model and the dictionary.) The Generative Model Some errors occurred due to precision is relatively low in Arabic, even when deficiencies in the generative model. The model we include the tuning data. This suggests that the requires every word in the source NE to be mapped problem is not because of incorrect parameter val- to a unique word in the target NE. This causes ues. The error analysis for Arabic is discussed in problems when there are function words in the Section 6.5. source NE, or when two source words are mapped We also report recall of the system for various to the same target word. E.g. ‘yale school of man- اﻻداره‘ categories of NEs in Table 3.7 Note that the MW agement’ corresponds to the 3-word NE .where ’of’ has no Arabic counterpart ’ﻣﺪرﺳﻪ ﻳﻴﻞ .cases and the SW case are mutually exclusive اﻻزﻫﺮ al azhar’ corresponds to the single word‘ al ال ازﻫﺮ Error Analysis for Arabic al-azhar(which can be split as 6.5 The system performed relatively poorly in Arabic azhar, but is never done in practice). than in the other languages. Detailed error analy- sis revealed the following sources of error. 7 Related work Source NER The text of the English articles au- Automatic learning of translation lexicons has tomatically extracted from Wikipedia was not very been studied in many works. Pirkola et al. clean, as compared to the newswire text used for (Pirkola et al., 2003) suggest learning trans- En-Hi and En-Ta. As a result, the source NER formation rules from dictionaries and applying wrongly identified many words as NEs, which the rules to find cross lingual spelling variants. were mapped to words on the target side, affecting Several works (Fung, 1995; Al-Onaizan and precision. E.g. words such as “best”, “foxe” were Knight, 2001; Koehn and Knight, 2002; Rapp, marked as NEs, and words with similar meaning 1999) suggest approaches to learn translation or sound were found in the target. But since the lexicons from monolingual corpora. Apart from annotator had ignored these words, the evaluation single word approaches, some works (Munteanu marked them as false positives. and Marcu, 2006; Chris Quirk, 2007) focus on mining parallel sentences and fragments from Translation model Many words were ignored ’near parallel’ corpora. by the translation model because of the presence On the other hand, out-of-vocabulary words are al in Arabic transliterated to the target language. Approaches ’ال’ .of diacritics, or affixes (e.g is frequently prefixed to words; also, in Arabic, have been suggested for automatically learning different sources of text may have different transliteration equivalents. Klementiev et al. (Kle- 7Since we cannot determine the category of false posi- mentiev and Roth, 2006) proposed the use of simi- tives, we do not report the precision here. larity of temporal distributions for identifying NEs

71 from comparable corpora. Tao et al. (Tao et al., Annual Meeting on Association for Computational 2006) used phonetic mappings for mining NEs Linguistics, pages 363–370, Morristown, NJ, USA. from comparable corpora, but their approach re- Association for Computational Linguistics. quires language specific knowledge which limits it Pascale Fung. 1995. A pattern matching method to specific languages. Udupa et al. (Udupa et al., for finding noun and proper noun translations from 2008; Udupa et al., 2009b) proposed a language- noisy parallel corpora. In IN PROCEEDINGS OF THE 33RD ANNUAL CONFERENCE OF THE AS- independent mining technique for mining single- SOCIATION FOR COMPUTATIONAL LINGUIS- word NE transliteration equivalents from compa- TICS, pages 236–243. rable corpora. In this work, we extend this ap- Alexandre Klementiev and Dan Roth. 2006. Weakly proach for mining NE equivalents from compara- supervised named entity transliteration and discov- ble corpora. ery from multilingual comparable corpora. In ACL- 44: Proceedings of the 21st International Confer- 8 Conclusion ence on Computational Linguistics and the 44th an- nual meeting of the Association for Computational Through an empirical study, we motivated the im- Linguistics, pages 817–824, Morristown, NJ, USA. portance and non-triviality of mining multi-word Association for Computational Linguistics. NE equivalents in comparable corpora. We pro- Philipp Koehn and Kevin Knight. 2002. Learning a posed a two-tier generative model for mining such translation lexicon from monolingual corpora. In equivalents, which is independent of the length of Proceedings of the ACL-02 workshop on Unsuper- vised lexical acquisition, pages 9–16, Morristown, NE. We developed a variant of the Viterbi algo- NJ, USA. Association for Computational Linguis- rithm for finding the best alignment in our gener- tics. ative model. We evaluated our approach for three Dragos Stefan Munteanu and Daniel Marcu. 2006. language pairs, and discussed the error analysis for Extracting parallel sub-sentential fragments from English-Arabic. non-parallel corpora. In ACL-44: Proceedings of Currently, unigram approaches are popular for the 21st International Conference on Computational most tasks in NLP, CLIR, MT, topic modeling, Linguistics and the 44th annual meeting of the As- etc. tasks. Phrase-based approaches are lim- sociation for Computational Linguistics, pages 81– 88, Morristown, NJ, USA. Association for Compu- ited by their efficiency and complexity, and also tational Linguistics. show limited improvement. We hope that this Ari Pirkola, Jarmo Toivonen, Heikki Keskustalo, Kari work will motivate researchers to explore princi- Visala, and Kalervo Jarvelin.¨ 2003. Fuzzy transla- pled methods that make use of NE phrases to sig- tion of cross-lingual spelling variants. In SIGIR ’03, nificantly improve the state-of-the-art in these ar- pages 345–352, New York, NY, USA. ACM. eas. The two-tier generative model is applicable Reinhard Rapp. 1999. Automatic identification of to any problem where the context of an observed word translations from unrelated english and german variable does not depend on a fixed number of past corpora. In ACL, pages 519–526. observed variables. Tao Tao, Su youn Yoon, Andrew Fister, Richard Sproat, and Chengxiang Zhai. 2006. Unsupervised named entity transliteration using temporal and phonetic References correlation. Yaser Al-Onaizan and Kevin Knight. 2001. Translat- Raghavendra Udupa, K. Saravanan, A. Kumaran, and ing named entities using monolingual and bilingual Jagadeesh Jagarlamudi. 2008. Mining named entity resources. In ACL ’02: Proceedings of the 40th An- transliteration equivalents from comparable corpora. nual Meeting on Association for Computational Lin- In CIKM ’08, pages 1423–1424. ACM. guistics, pages 400–408, Morristown, NJ, USA. As- sociation for Computational Linguistics. Raghavendra Udupa, K. Saravanan, Anton Bakalov, and Abhijit Bhole. 2009a. ”they are out there, if Arul Menezes Chris Quirk, Raghavendra Udupa U. you know where to look”: Mining transliterations 2007. Generative models of noisy translations with of oov query terms for cross-language information applications to parallel fragments extraction. In MT retrieval. In ECIR, volume 5478, pages 437–448. Summit XI, pages 377–284. European Association Springer. for Machine Translation. Raghavendra Udupa, K. Saravanan, A. Kumaran, and Jenny Rose Finkel, Trond Grenager, and Christopher Jagadeesh Jagarlamudi. 2009b. Mint: A method Manning. 2005. Incorporating non-local informa- for effective and scalable mining of named entity tion into information extraction systems by gibbs transliterations from large comparable corpora. In sampling. In ACL ’05: Proceedings of the 43rd EACL, pages 799–807.

72