Named Entity Transliteration with Comparable Corpora
Richard Sproat, Tao Tao, ChengXiang Zhai University of Illinois at Urbana-Champaign, Urbana, IL, 61801 [email protected], {taotao,czhai}@cs.uiuc.edu
Abstract comparable stories across the three papers.1 We wish to use this expectation to leverage translit- In this paper we investigate Chinese- eration, and thus the identification of named enti- English name transliteration using compa- ties across languages. Our idea is that the occur- rable corpora, corpora where texts in the rence of a cluster of names in, say, an English text, two languages deal in some of the same should be useful if we find a cluster of what looks topics — and therefore share references like the same names in a Chinese or Arabic text. to named entities — but are not transla- An example of what we are referring to can be tions of each other. We present two dis- found in Figure 1. These are fragments of two tinct methods for transliteration, one ap- stories from the June 8, 2001 Xinhua English and proach using phonetic transliteration, and Chinese newswires, each covering an international the second using the temporal distribu- women’s badminton championship. Though these tion of candidate pairs. Each of these ap- two stories are from the same newswire source, proaches works quite well, but by com- and cover the same event, they are not translations bining the approaches one can achieve of each other. Still, not surprisingly, a lot of the even better results. We then propose a names that occur in one, also occur in the other. novel score propagation method that uti- Thus (Camilla) Martin shows up in the Chinese
lizes the co-occurrence of transliteration
û version as í ma-er-ting; Judith Meulendijks
pairs within document pairs. This prop-
¤ × Ï Ë õ is Ú yu mo-lun-di-ke-si; and Mette
agation method achieves further improve-
¤ ÷ × Sorensen is õ mai su-lun-sen. Several ment over the best results from the previ- other correspondences also occur. While some of ous step.
the transliterations are “standard” — thus Martin
û is conventionally transliterated as í ma-er- 1 Introduction ting — many of them were clearly more novel, though all of them follow the standard Chinese As part of a more general project on multilin- conventions for transliterating foreign names. gual named entity identification, we are interested in the problem of name transliteration across lan- These sample documents illustrate an important guages that use different scripts. One particular is- point: if a document in language L1 has a set of sue is the discovery of named entities in “compara- names, and one finds a document in L2 containing ble” texts in multiple languages, where by compa- a set of names that look as if they could be translit- rable we mean texts that are about the same topic, erations of the names in the L1 document, then but are not in general translations of each other. this should boost one’s confidence that the two sets For example, if one were to go through an English, of names are indeed transliterations of each other. Chinese and Arabic newspaper on the same day, We will demonstrate that this intuition is correct. it is likely that the more important international events in various topics such as politics, business, science and sports, would each be covered in each 1Many names, particularly of organizations, may be trans- of the newspapers. Names of the same persons, lated rather than transliterated; the transliteration method we discuss here obviously will not account for such cases, though locations and so forth — which are often translit- the time correlation and propagation methods we discuss will erated rather than translated — would be found in still be useful.
73 Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 73–80, Sydney, July 2006. c 2006 Association for Computational Linguistics Dai Yun Nips World No. 1 Martin to Shake off Olympic Our work differs from this work in that we use Shadow . . . In the day’s other matches, second seed Zhou Mi comparable corpora (in particular, news data) and overwhelmed Ling Wan Ting of Hong Kong, China 11-4, 11- leverage the time correlation information naturally 4, Zhang Ning defeat Judith Meulendijks of Netherlands 11- available in comparable corpora. 2, 11-9 and third seed Gong Ruina took 21 minutes to elimi- nate Tine Rasmussen of Denmark 11-1, 11-1, enabling China 3 Chinese Transliteration with
to claim five quarterfinal places in the women’s singles. Comparable Corpora
« ò ü õ ü ú ® ¥ ¡ Ö « û Ò í Ë û
ð We assume that we have comparable corpora, con-
û õϪ ý÷É ¬øù¤ öúË ðõ
... í , 4 , ... sisting of newspaper articles in English and Chi-
Å Ó ¨ £ ñ Ô ù ö á ¡ ¤ ó ¡ Ö
ý 11:1 nese from the same day, or almost the same day. In
Ù¤ õ Ú Ï ç Ô Í Ô Ë É ø Ä
,Å 11:2 11:9 our experiments we use data from the English and