An Empirical Study of Chinese Name Matching and Applications

An Empirical Study of Chinese Name Matching and Applications Nanyun Peng1 and Mo Yu2 and Mark Dredze1 1Human Language Technology Center of Excellence Center for Language and Speech Processing Johns Hopkins University, Baltimore, MD, 21218 2Machine Intelligence and Translation Lab Harbin Institute of Technology, Harbin, China [email protected], [email protected], [email protected] Abstract or morpheme, the most popular being Chinese which uses hanzi (IW). This presents challenges Methods for name matching, an important for name matching: a small number of hanzi repre- component to support downstream tasks sent an entire name and there are tens of thousands such as entity linking and entity clustering, of hanzi in use. Current methods remain largely have focused on alphabetic languages, pri- untested in this setting, despite downstream tasks marily English. In contrast, logogram lan- in Chinese that rely on name matching (Chen et guages such as Chinese remain untested. al., 2010; Cassidy et al., 2011). Martschat et al. We evaluate methods for name matching (2012) point out errors in coreference resolution in Chinese, including both string match- due to Chinese name matching errors, which sug- ing and learning approaches. Our ap- gests that downstream tasks can benefit from im- proach, based on new representations for provements in Chinese name matching techniques. Chinese, improves both name matching This paper presents an analysis of new and ex- and a downstream entity clustering task. isting approaches to name matching in Chinese. The goal is to determine whether two Chinese 1 Introduction strings can refer to the same entity (person, orga- A key technique in entity disambiguation is name nization, location) based on the strings alone. The matching: determining if two mention strings more general task of entity coreference (Soon et could refer to the same entity. The challenge al., 2001), or entity clustering, includes the con- of name matching lies in name variation, which text of the mentions in determining coreference. In can be attributed to many factors: nicknames, contrast, standalone name matching modules are aliases, acronyms, and differences in translitera- context independent (Andrews et al., 2012; Green tion, among others. In light of these issues, exact et al., 2012). In addition to showing name match- string match can lead to poor results. Numerous ing improvements on newly developed datasets of downstream tasks benefit from improved name matched Chinese name pairs, we show improve- matching: entity coreference (Strube et al., 2002), ments in a downstream Chinese entity clustering name transliteration (Knight and Graehl, 1998), task by using our improved name matching sys- identifying names for mining paraphrases (Barzi- tem. We call our name matching tool Mingpipe, a lay and Lee, 2003), entity linking (Rao et al., Python package that can be used as a standalone 2013) and entity clustering (Green et al., 2012). tool or integrated within a larger system. We re- As a result, there have been numerous proposed lease Mingpipe as well as several datasets to sup- 1 name matching methods (Cohen et al., 2003), with port further work on this task. a focus on person names. Despite extensive explo- 2 Name Matching Methods ration of this task, most work has focused on Indo- European languages in general and English in par- Name matching originated as part of research into ticular. These languages use alphabets as repre- record linkage in databases. Initial work focused sentations of written language. In contrast, other 1The code and data for this paper are available at: languages use logograms, which represent a word https://github.com/hltcoe/mingpipe on string matching techniques. This work can Examples Notes be organized into three major categories: 1) Pho- ¸历农 v.s. 1w² simplified v.s. traditional q盟 netic matching methods, e.g. Soundex (Holmes v.s. Abbreviation and traditional 东W亚国¶T盟 v.s. simplified and McCabe, 2002), double Metaphone (Philips, 亚的¯亚贝巴 v.s. Transliteration of Addis Ababa 2000) etc.; 2) Edit-distance based measures, e.g. ?ê¯?貝巴 in Mainland and Taiwan. Dif- Levenshtein distance (Levenshtein, 1966), Jaro- / i2·ti·s1·i2·bei·b2 / ferent hanzi, similar pronuncia- Winkler (Porter et al., 1997; Winkler, 1999), v.s. / 2·ti·s1·2·bei·b2 / tions. and 3) Token-based similarity, e.g. soft TF-IDF [W&( v.s. á·翠 Transliteration of Florence in (Bilenko et al., 2003). Analyses comparing these / fo·luo·lu@n·s2 / Mainland and Hong Kong. Dif- v.s. / fei·lEN·tsh8Y / ferent writing and dialects. approaches have not found consistent improve- 鲁弗¯·I弗± v.s. 韓o弗 Transliteration of Humphrey ments of one method over another (Cohen et al., / lu·fu·sW·xan·fu·laI / Rufus in Mainland and Hong 2003; Christen, 2006). More recent work has v.s. / xan·lu·fu / Kong. The first uses a literal focused on learning a string matching model on transliteration, while the second name pairs, such as probabilistic noisy channel does not. Both reverse the name models (Sukharev et al., 2014; Bilenko et al., order (consistent with Chinese names) and change the surname 2003). The advantage of trained models is that, to sound Chinese. with sufficient training data, they can be tuned for specific tasks. Table 1: Challenges in Chinese name matching. While many NLP tasks rely on name matching, research on name matching techniques themselves has not been a major focus within the NLP com- torical sound or spelling change, loanword for- munity. Most downstream NLP systems have sim- mation, translation, transliteration, or transcription ply employed a static edit distance module to de- error (Andrews et al., 2012). In addition to all the cide whether two names can be matched (Chen et above factors, Chinese name matching presents al., 2010; Cassidy et al., 2011; Martschat et al., unique challenges (Table 1): 2012). An exception is work on training finite state transducers for edit distance metrics (Ristad • There are more than 50k Chinese characters. and Yianilos, 1998; Bouchard-Cotˆ e´ et al., 2008; This can create a large number of parameters Dreyer et al., 2008; Cotterell et al., 2014). More in character edit models, which can compli- recently, Andrews et al. (2012) presented a phylo- cate parameter estimation. genetic model of string variation using transducers that applies to pairs of names string (supervised) • Chinese characters represent morphemes, not and unpaired collections (unsupervised). sounds. Many characters can share a sin- Beyond name matching in a single language, gle pronunciation2, and many characters have several papers have considered cross lingual name similar sounds3. This causes typos (mistak- matching, where name strings are drawn from ing characters with the same pronunciation) two different languages, such as matching Arabic and introduces variability in transliteration names (El-Shishtawy, 2013) with English (Free- (different characters chosen to represent the man et al., 2006; Green et al., 2012). Addition- same sound). ally, name matching has been used as a component in cross language entity linking (McNamee et al., • Chinese has two writing systems (simplified, 2011a; McNamee et al., 2011b) and cross lingual traditional) and two major dialects (Man- entity clustering (Green et al., 2012). However, darin, Cantonese), with different pairings in little work has focused on logograms, with the ex- different regions (see Table 2 for the three ception of Cheng et al. (2011). As we will demon- dominant regional combinations.) This has a strate in § 3, there are special challenges caused by significant impact on loanwords and translit- the logogram nature of Chinese. We believe this is erations. the first evaluation of Chinese name matching. 3 Challenges 2486 characters are pronounced / tCi / (regardless of tone). Numerous factors cause name variations, includ- 3e.g. 庄 and (different orthography) are pronounced ing abbreviations, morphological derivations, his- similar (/tùuAN/ and /tùAN /). Region Writing System Dialect tance metrics for strings, and they perform bet- Hong Kong Traditional Cantonese ter than string similarity (Ristad and Yianilos, Mainland Simplified Mandarin 1998; Andrews et al., 2012; Cotterell et al., 2014). Taiwan Traditional Mandarin We use the probabilistic transducer of Cotterell et al. (2014) to learn a stochastic edit distance. Table 2: Regional variations for Chinese writing The model represent the conditional probability and dialect. p(yjx; θ), where y is a generated string based on editing x according to parameters θ. At each 4 Methods position xi, one of four actions (copy, substi- tute, insert, delete) are taken to generate charac- We evaluate several name matching methods, ter yj. The probability of each action depends representative of the major approaches to name on the string to the left of xi (x(i−N1):i), the matching described above. string to the right of xi (xi:(i+N2)), and generated string to the left of yj (y(j−N ):j). The vari- String Matching We consider two common 3 ables N1;N2;N3 are the context size. Note that string matching algorithms: Levenshtein and Jaro- characters to the right of yj are excluded as they Winkler. However, because of the issues men- are not yet generated. Training maximizes the tioned above we expect these to perform poorly observed data log-likelihood and EM is used to when applied to Chinese strings. We consider sev- marginalize over the latent edit actions. Since the eral transformations to improve these methods. large number of Chinese characters make param- First, we map all strings to a single writing sys- eter estimation prohibitive, we only train trans- tem: simplified. This is straightforward since tra- ducers on the three pinyin representations: string- ditional Chinese characters have a many-to-one pinyin (28 characters), character-pinyin (384 char- mapping to simplified characters. Second, we con- acters), segmented-pinyin (59 characters). sider a pronunciation based representation. We convert characters to pinyin4, the official pho- Name Matching as Classification An alternate netic system (and ISO standard) for transcribing learning formulation considers name matching as Mandarin pronunciations into the Latin alphabet.

Load more