An Empirical Study of Chinese Name Matching and Applications Nanyun Peng1 and Mo Yu2 and Mark Dredze1 1Human Language Technology Center of Excellence Center for Language and Speech Processing Johns Hopkins University, Baltimore, MD, 21218 2Machine Intelligence and Translation Lab Harbin Institute of Technology, Harbin, China
[email protected],
[email protected],
[email protected] Abstract or morpheme, the most popular being Chinese which uses hanzi (IW). This presents challenges Methods for name matching, an important for name matching: a small number of hanzi repre- component to support downstream tasks sent an entire name and there are tens of thousands such as entity linking and entity clustering, of hanzi in use. Current methods remain largely have focused on alphabetic languages, pri- untested in this setting, despite downstream tasks marily English. In contrast, logogram lan- in Chinese that rely on name matching (Chen et guages such as Chinese remain untested. al., 2010; Cassidy et al., 2011). Martschat et al. We evaluate methods for name matching (2012) point out errors in coreference resolution in Chinese, including both string match- due to Chinese name matching errors, which sug- ing and learning approaches. Our ap- gests that downstream tasks can benefit from im- proach, based on new representations for provements in Chinese name matching techniques. Chinese, improves both name matching This paper presents an analysis of new and ex- and a downstream entity clustering task. isting approaches to name matching in Chinese. The goal is to determine whether two Chinese 1 Introduction strings can refer to the same entity (person, orga- A key technique in entity disambiguation is name nization, location) based on the strings alone.