P tj naibj , rj rtlat (= m tlatPtlat) tj naibj if rj = translation rtlit = mtlitPtlit (tj naibj ) if rj = transliteration
= δ tj na(ibj ) if rj = acronym ≡ = Plm[ (tj) ] if rj = none
Figure 2: Generation of a Hindi sentence from an The four probability terms on the right are ob- English NE. tained, respectively, from a translation model, a transliteration model 4, an acronym model 5, and a language model. 4.2 Generative Model Controlling target NE length In the SGeM, SGeM Let T = t1 . . . tn be the target sentence i 1 m P ai a1− ,Nai is the probability that Nai will and N = Ni i=0 be the hidden states (as before), { } generate ti. To compute this, we first note that, where N0 is the target LM state. In the SGeM, we ( ) t a = a N want to predict the hidden state used to produce for a given term i, either i i+1 i.e. ai continues to generate beyond ti, or ai = ai+1 the next target term ti. Let ai = j if ti is generated 6 i.e. Na terminates at ti. The probability of by Nj. We find an alignment A = a1 . . . an which i L N maximizes continuation depends on the length of ai and the length l of the target NE generated so far by
P (T,A N ) = Nai . Based on empirical observations, we defined | n a function f(l, L) as i 1 i 1 P ai a1− ,Nai P ti t1− ,Nai (1) i=1 f(l, L) = 0 for l / L 2,L + 2 ∏ ( ) ( ) ∈ { − } = 1 for l L 1,L By choosing which source NE generates each tar- − ∈ { − } = for l L + 1,L + 2 get term, this model also controls the length of the ∈ { } target NE equivalent to a source NE. where f (l, L) is the probability of continuation, Let tk . . . ti 1 be the context for ti (all these i − and 1 f (l, L) is the probability of termination. terms are aligned to Nai ). Then − is a very small number. We now define
i 1 i 1 P ti t1− ,Nai = P ti tk− ,Nai i 1 i P ai a1− ,N ( ) ( ) (= p NE ) if ai 1 = 0 NEGeM To model the generation of the target − i 1 = f (i ki, lai ) if ai 1 = 0, ki < i term ti given the context t1− and the substates of − − 6 the source NE Nj, we let Nj = nj1, . . . , njL = 1 f i ki 1, lai 1 if ai 1 = 0, ki = i j − − − − − 6 where n is a substate. The internal alignment jp ( ) ( ) B = bki , . . . , bi is defined such that bp = s if tp is where the probabilities on the right are for begin- generated by njs. We get ning an NE, continuing an NE, and terminating a previous NE, respectively. i 1 P ti t − ,Na = 3 ki i δ [x] = 1 if condition x is true 4 ( i ) A character-level extended HMM described in (Udupa et
P b bi P t n (2) al., 2009a). p p+1 p aibp 5A mapping from source language alphabets to target lan- B p=k ∑ ∏i ( ) ( ) guage transliterations of the alphabets.
68 4.3 Modified Viterbi algorithm Similarly, for a probability p given by • We use the dynamic programming framework to the translation model, we calculate p = m prtlat where r R, do the maximization in (1). For each target term ti, tlat0 tlat tlat ∈ m (0, + ) for each source NE Nj, the subproblem is to find tlat ∈ ∞ the best alignment a . . . a such that a = a 1 i i+1 6 i In our experiments, we found that transliteration i.e. ti is the last term in the equivalent of Nj. probabilities were quite low compared to the oth- subproblem [i, j] = ers, followed by the translation probabilities. So, i 1 i 1 we used the following procedure to tune these pa- max P ai = j = ai+1 a − ,Nj P ti t − ,Nj i 1 1 a1 6 rameters use a small hand-annotated set of docu- ( ) ( ) ment pairs. Let l be the length of the target NE ending at ti, based on the alignment so far. The first probability 1. Initially set p = + , and all other pa- NE ∞ term becomes rameters to zero. i P ai l 1 = ai l = j = ai+1 Nj − − 6 − 6 | 2. Tune rtlit to find as many of the transliter- = α f (l, L ) (1 f (l + 1,L )) ( × j − )j ations as possible. Then, use mtlit to fine- This is non-zero only for certain values of tune it to improve precision without losing l, for which we can construct the solution to too much on recall. subproblem [i, j] using solutions for i = l. 3. Next, tune rtlat to find as many of the trans- Denote k = i l, then − lations as possible. Then, use mtlat to fine- subproblem [i, j] = tune it to improve precision without losing p too much on recall. max subproblem [k 1, j] negem t ,Ni j=i − × k 6 4. The system is now finding as many NEs as where the procedure negem computes the( proba-) possible, but it is also finding noise. Keep bility that a given sequence of target words is an lowering p to allow the language model equivalent of the given source NE. This procedure NE LM to absorb more and more noise. Do this solves a second (independent) DP problem (for until NEs also begin to get absorbed by LM. the NEGeM), constructed in a similar fashion. It also models conditions such as “If a target term is 6 Empirical Evaluation a transliteration, it cannot map to more than one source substate.” In this section, we study the overall precision and The output of the system is a set of MWNE recall of our algorithm for three different language pairs. For each pair, we also give the internal pairs. English (En) is the source language, and alignment between the words of the two NEs. Hindi (Hi), Tamil (Ta) and Arabic (Ar) are the tar- get languages. Hindi belongs to the Indo-Aryan 5 Parameter Tuning family, Tamil belongs to Dravidian family, and The MWNE model has five user-set parameters. Arabic belongs to the Semitic family of languages. These need to be tuned appropriately in order to be The results show that the method is applicable for able to compare probabilities from different mod- a wide spectrum of languages. els. In the following, we describe the parameters 6.1 Linguistic Resources and a systematic way to go about tuning them. Models We need four models (translation, p (0, + ) specifies how likely are we • NE ∈ ∞ transliteration, language, and acronym) in order to to find an NE in a target sentence run the proposed algorithm. For a language pair, Given a probability p returned by the we learnt these models using the following kinds • transliteration model, the probability of data, which was available to us: value used for comparisons ptlit0 is cal- A set of pairs of NEs that are transliterations, rtlit • culated as ptlit0 = mtlit p where to train the transliteration model r R, m (0, + ). r is tuned to tlit ∈ tlit ∈ ∞ tlit boost/suppress p; m is also used similarly, A set of parallel sentences, to learn a transla- tlit • but to get more fine-grained control. tion model
69 Lang. Translit. Word Monolin. identifies transliterations in the target document. pairs pairs pairs corpus For MWNEs, the annotator also marks which En-Hi 15K 634K 23M words word in the source corresponds to each word in En-Ta 17K 509K 27M words the target MWNE. This constitutes gold standard En-Ar 30K 8.2M 47M words data that can be used to measure performance. (1K = 1 thousand, 1M = 1 million) 120 article pairs were annotated for En-Hi, 120 for En-Ta, and 36 for En-Ar. Table 1: Training data for the models. Evaluation The NEs mined from one article pair are compared with the gold standard for A monolingual corpus in the target language, • that pair, and one of three possible judgements is to train a language model made: A dictionary mapping English alphabets to • Fully matched (if it fully matches some an- their transliterations in the target language. • notated NE (both source and target)). One can get an idea of the scale of linguistic re- Partially matched (if source NEs match, and sources used by looking at Table 1. • the mined target NE is a subset of the gold Source language NER The Stanford NER tool target NE). (Finkel et al., 2005) was used for obtaining a list Incorrect match (in all other cases). of English NEs from the source document. • The algorithm is agnostic of the type of the 6.2 Corpus for MWNE mining NE (Person, Organization, etc.). So, reporting For each language pair, a set of comparable article the precision and recall for each NE type does pairs is required. The article pairs each for En- not provide much insight into the performance of 6 Hi and En-Ta were obtained from news websites , the method. Instead, we report at different lev- where the article correspondence was obtained us- els of match—full or partial, and for different ing a method described in (Udupa et al., 2009b). categories of MWNEs—single word translitera- En-Ar article pairs were extracted from Wikipedia tion equivalents (SW), multi word transliteration using inter-language links. equivalents (including acronyms) (MW-Translit) and multi word NEs having at least one translation Preprocessing The Stanford NER tags each word in the source document as a person, location, equivalent (MW-Mixed). We compute the num- organization or other. A continuous sequence of bers for each article pair and then average over all identical tags was treated as a single MWNE. pairs. Completely capitalized NEs were treated as Parameter Tuning Parameter tuning was done acronyms. For each acronym (e.g. “FIFA”), following the procedure described in Section 5. both the acronym version (“FIFA”) as well as the For En-Hi and En-Ta, the following values were abbreviation version (“F I F A”) were included used: pNE = 1, mtlit = 100, rtlit = 7, mtlat = 1, in the list of source NEs. Each target document rtlat = 1. For En-Ar, mtlit = 1, rtlit = 14 was was sentence-separated and tokenized using used, the other parameters remaining the same. simple rules based on the presence of newlines, For the tuning exercise, 40 annotated article pairs punctuation, and blank spaces. If a word can be were used for En-Hi, 40 pairs for En-Ta, and 26 constructed by concatenating strings from the pairs for En-Ar. acronym model, it is treated as an acronym, and the acronym strings are separated out (e.g. ’O;;ma:ke ’ 6.4 Results and Analysis O;;ma :ke emke is changed to ’ ’ em ke). We evaluated the algorithm on 80 article pairs for En-Hi, 80 pairs for En-Ta, and 11 pairs for En-Ar. 6.3 Experimental Setup The results are given in Table 2. Annotation Given an article pair, a human an- We observe that the results for both types of pre- notator looks through the list of source NEs, and cision (and recall) are nearly identical. This is so 6En-Hi from Webdunia, En-Ta from The New Indian Ex- because, in most cases, the system is able to mine press. the entire NE. This validates our intuition of using
70 Lang Prec. Prec. Recall Recall levels of diacritization for the same words). اﻟﺠﻤﻬﻮرﻳﻪ Pair (full) (part.) (full) (part.) E.g. The target document contained En-Hi 0.84 0.86 0.89 0.89 al-jamhooriyah “republic”; the dictionary ,al-jamhooriyat اﻟﺠﻤﻬﻮرﻳﺎت En-Ta 0.78 0.80 0.61 0.63 contained En-Ar 0.42 0.44 0.63 0.66 which has a different suffix, and hence was not En-Ar* 0.43 0.44 0.60 0.62 found. * including the data used for tuning Transliteration model The non-uniform usage Table 2: Precision and recall of the system of diacritics and affixes (across training and test data) as mentioned above affected the perfor- Category En-Hi En-Ta En-Ar mance of transliteration too. E.g. The model is prefix usually occurs ’ال’ SW 0.90 0.82 0.69 trained on data where the MW - Translit 0.91 0.64 0.63 in the Arabic NE, but not in the English NE. As اﻟﻨﻴﻮ MW - Mixed 0.77 0.40 0.66 a result, it maps the ‘new’ in ‘new york’ to al-nyoo. The annotator had mapped ‘new’ to Table 3: Category-wise recall of the system nyoo (i.e. without the prefix), causing the ﻧﻴﻮ evaluation program to mark the system’s output language models to disambiguate NE boundaries. as a false positive. (The false negatives are mostly due to limitations of transliteration model and the dictionary.) The Generative Model Some errors occurred due to precision is relatively low in Arabic, even when deficiencies in the generative model. The model we include the tuning data. This suggests that the requires every word in the source NE to be mapped problem is not because of incorrect parameter val- to a unique word in the target NE. This causes ues. The error analysis for Arabic is discussed in problems when there are function words in the Section 6.5. source NE, or when two source words are mapped We also report recall of the system for various to the same target word. E.g. ‘yale school of man- اﻻداره‘ categories of NEs in Table 3.7 Note that the MW agement’ corresponds to the 3-word NE .where ’of’ has no Arabic counterpart ’ﻣﺪرﺳﻪ ﻳﻴﻞ .cases and the SW case are mutually exclusive اﻻزﻫﺮ al azhar’ corresponds to the single word‘ al ال ازﻫﺮ Error Analysis for Arabic al-azhar(which can be split as 6.5 The system performed relatively poorly in Arabic azhar, but is never done in practice). than in the other languages. Detailed error analy- sis revealed the following sources of error. 7 Related work Source NER The text of the English articles au- Automatic learning of translation lexicons has tomatically extracted from Wikipedia was not very been studied in many works. Pirkola et al. clean, as compared to the newswire text used for (Pirkola et al., 2003) suggest learning trans- En-Hi and En-Ta. As a result, the source NER formation rules from dictionaries and applying wrongly identified many words as NEs, which the rules to find cross lingual spelling variants. were mapped to words on the target side, affecting Several works (Fung, 1995; Al-Onaizan and precision. E.g. words such as “best”, “foxe” were Knight, 2001; Koehn and Knight, 2002; Rapp, marked as NEs, and words with similar meaning 1999) suggest approaches to learn translation or sound were found in the target. But since the lexicons from monolingual corpora. Apart from annotator had ignored these words, the evaluation single word approaches, some works (Munteanu marked them as false positives. and Marcu, 2006; Chris Quirk, 2007) focus on mining parallel sentences and fragments from Translation model Many words were ignored ’near parallel’ corpora. by the translation model because of the presence On the other hand, out-of-vocabulary words are al in Arabic transliterated to the target language. Approaches ’ال’ .of diacritics, or affixes (e.g is frequently prefixed to words; also, in Arabic, have been suggested for automatically learning different sources of text may have different transliteration equivalents. Klementiev et al. (Kle- 7Since we cannot determine the category of false posi- mentiev and Roth, 2006) proposed the use of simi- tives, we do not report the precision here. larity of temporal distributions for identifying NEs
71 from comparable corpora. Tao et al. (Tao et al., Annual Meeting on Association for Computational 2006) used phonetic mappings for mining NEs Linguistics, pages 363–370, Morristown, NJ, USA. from comparable corpora, but their approach re- Association for Computational Linguistics. quires language specific knowledge which limits it Pascale Fung. 1995. A pattern matching method to specific languages. Udupa et al. (Udupa et al., for finding noun and proper noun translations from 2008; Udupa et al., 2009b) proposed a language- noisy parallel corpora. In IN PROCEEDINGS OF THE 33RD ANNUAL CONFERENCE OF THE AS- independent mining technique for mining single- SOCIATION FOR COMPUTATIONAL LINGUIS- word NE transliteration equivalents from compa- TICS, pages 236–243. rable corpora. In this work, we extend this ap- Alexandre Klementiev and Dan Roth. 2006. Weakly proach for mining NE equivalents from compara- supervised named entity transliteration and discov- ble corpora. ery from multilingual comparable corpora. In ACL- 44: Proceedings of the 21st International Confer- 8 Conclusion ence on Computational Linguistics and the 44th an- nual meeting of the Association for Computational Through an empirical study, we motivated the im- Linguistics, pages 817–824, Morristown, NJ, USA. portance and non-triviality of mining multi-word Association for Computational Linguistics. NE equivalents in comparable corpora. We pro- Philipp Koehn and Kevin Knight. 2002. Learning a posed a two-tier generative model for mining such translation lexicon from monolingual corpora. In equivalents, which is independent of the length of Proceedings of the ACL-02 workshop on Unsuper- vised lexical acquisition, pages 9–16, Morristown, NE. We developed a variant of the Viterbi algo- NJ, USA. Association for Computational Linguis- rithm for finding the best alignment in our gener- tics. ative model. We evaluated our approach for three Dragos Stefan Munteanu and Daniel Marcu. 2006. language pairs, and discussed the error analysis for Extracting parallel sub-sentential fragments from English-Arabic. non-parallel corpora. In ACL-44: Proceedings of Currently, unigram approaches are popular for the 21st International Conference on Computational most tasks in NLP, CLIR, MT, topic modeling, Linguistics and the 44th annual meeting of the As- etc. tasks. Phrase-based approaches are lim- sociation for Computational Linguistics, pages 81– 88, Morristown, NJ, USA. Association for Compu- ited by their efficiency and complexity, and also tational Linguistics. show limited improvement. We hope that this Ari Pirkola, Jarmo Toivonen, Heikki Keskustalo, Kari work will motivate researchers to explore princi- Visala, and Kalervo Jarvelin.¨ 2003. Fuzzy transla- pled methods that make use of NE phrases to sig- tion of cross-lingual spelling variants. In SIGIR ’03, nificantly improve the state-of-the-art in these ar- pages 345–352, New York, NY, USA. ACM. eas. The two-tier generative model is applicable Reinhard Rapp. 1999. Automatic identification of to any problem where the context of an observed word translations from unrelated english and german variable does not depend on a fixed number of past corpora. In ACL, pages 519–526. observed variables. Tao Tao, Su youn Yoon, Andrew Fister, Richard Sproat, and Chengxiang Zhai. 2006. Unsupervised named entity transliteration using temporal and phonetic References correlation. Yaser Al-Onaizan and Kevin Knight. 2001. Translat- Raghavendra Udupa, K. Saravanan, A. Kumaran, and ing named entities using monolingual and bilingual Jagadeesh Jagarlamudi. 2008. Mining named entity resources. In ACL ’02: Proceedings of the 40th An- transliteration equivalents from comparable corpora. nual Meeting on Association for Computational Lin- In CIKM ’08, pages 1423–1424. ACM. guistics, pages 400–408, Morristown, NJ, USA. As- sociation for Computational Linguistics. Raghavendra Udupa, K. Saravanan, Anton Bakalov, and Abhijit Bhole. 2009a. ”they are out there, if Arul Menezes Chris Quirk, Raghavendra Udupa U. you know where to look”: Mining transliterations 2007. Generative models of noisy translations with of oov query terms for cross-language information applications to parallel fragments extraction. In MT retrieval. In ECIR, volume 5478, pages 437–448. Summit XI, pages 377–284. European Association Springer. for Machine Translation. Raghavendra Udupa, K. Saravanan, A. Kumaran, and Jenny Rose Finkel, Trond Grenager, and Christopher Jagadeesh Jagarlamudi. 2009b. Mint: A method Manning. 2005. Incorporating non-local informa- for effective and scalable mining of named entity tion into information extraction systems by gibbs transliterations from large comparable corpora. In sampling. In ACL ’05: Proceedings of the 43rd EACL, pages 799–807.
72