Exploration of Personal Name Disambiguation in Chinese News
Total Page:16
File Type:pdf, Size:1020Kb
The Chinese Persons Name Disambiguation Evaluation: Exploration of Personal Name Disambiguation in Chinese News Ying Chen *, Peng Jin †, Wenjie Li ‡,Chu-Ren Huang ‡ * China Agricultural University †Leshan Teachers’ College ‡The Hong Kong Polytechnic University [email protected] [email protected] [email protected] [email protected] 2007, 2009). Compared to the one in English, per- Abstract sonal name disambiguation in Chinese has special issues, such as Chinese text processing and Chi- Personal name disambiguation becomes hot as it nese personal naming system. Therefore, we hold provides a way to incorporate semantic under- Chinese personal name disambiguation (CPND) to standing into information retrieval. In this cam- explore those problems. In this campaign, we paign, we explore Chinese personal name mainly examine the relationships between Chinese disambiguation in news. In order to examine how word segmentation and Chinese personal name well disambiguation technologies work, we con- centrate on news articles, which is well-formatted disambiguation. and whose genre is well-studied. We then design a Moreover, from our experiences in WePS diagnosis test to explore the impact of Chinese (Chen et al., 2007, 2009), we notice that webpages word segmentation to personal name disambigua- are so noisy that text pre-processing that extracts tion. useful text for disambiguation needs much effort. In fact, text pre-processing for webpages is rather complicated, such as deleting of HTML tags, the 1 Introduction detection of JavaScript codes and so on. Therefore, the final system performance in the WePS cam- Incorporating semantic understanding technolo- paign sometimes does not reflect the disambigua- gies from the field of NLP becomes one of further tion power of the system, and instead it shows the directions for information retrieval. Among them, comprehensive result of text pre-processing as named entity disambiguation, which intends to well as disambiguation. In order to focus on per- use state-of-the-art named entity processing to sonal name disambiguation, we choose news enhance a search engine, is a hot research issue. documents in CPND. Because of the popularity of personal names in The paper is organized as follows. Section 2 de- queries, more efforts are put on personal name scribes our formal test including datasets and disambiguation. The personal name disambigua- 1 evaluation. Section 3 introduces the diagnosis test, tion used both in Web Personal Search (WePS ) which explores the impact of Chinese word seg- and our campaign is defined as follow. Given mentation to personal name disambiguation. Sec- documents containing a personal name in interest, tion 4 describes our campaign, and Section 5 the task is to cluster them according to which en- presents the results of the participating systems. tity the name in a document refers to. Finally, Section 6 concludes our main findings in WePS, which explores English personal name this campaign. disambiguation, has been held twice (Artiles et al., 1 http://nlp.uned.es/weps/ 2 The Formal Test one-in-one and all-in-one clustering. Although the hybrid cheat system can achieve fairly good per- formance, it is not useful for real applications. In 2.1 Datasets the formal test, these three systems serve as the baseline. To avoid the difficulty to clean a webpage, we choose news articles in this campaign. Given a Formula full name in Chinese, we search the character- S| i ∩ Rj | based personal name string in all documents of Precision ∑SSi ∈S ∑ d ∈ i max Rj ∈R;d∈Rj S| i | Chinese Gigaword Corpus, a large Chinese news collection. If a document contains the name, it is ∑ S| i | Si ∈S belonged to the dataset of this name. To ensure the popularity of a personal name, we keep only a | Ri ∩ Sj | personal name whose corresponding dataset com- Recall ∑Ri ∈R ∑ d ∈R i max Sj ∈S;d∈Sj prises more than 100 documents. In addition, if | Ri | there are more than 300 documents in that dataset, i ∑ ∈ | R | we randomly select 300 articles to annotate. Fi- Ri R nally, there are totally 58 personal names and Table 1: the formula of the modified B-cubed 12,534 news articles used in our data, where 32 metrics names are in the development data and 26 names are in the test data, as shown Appendix Table 4 and 5 separately. 3 The Diagnosis Test From Table 4 and 5, we can find that the ambi- Because of no word delimiter, Chinese text proc- guity (the document number per cluster) distribu- essing often needs to do Chinese word segmenta- tion is much different between the development tion first. In order to explore the relationship data and the test data. In fact, the ambiguity var- between personal name disambiguation and word ies with a personal name in interest, such as the segmentation, we provide a diagnosis data which popularity of the name in the given corpus, the attempts to examine the impact of word segmenta- celebrity degree of the name, and so on. tion to disambiguation. 2.2 Evaluation Firstly, for each personal name, its correspond- ing dataset will be manually divided into three In WePS, Artiles et al. (2009) made an intensive groups as follows. The disambiguation system study of clustering evaluation metrics, and found then runs for each group of documents. The three that B-Cubed metric is an appropriate evaluation clustering outputs are merged into the final clus- approach. Moreover, in order to handle overlap- tering for that personal name. ping clusters (i.e. a personal name in a document (1) Exactly matching: news articles contain- refers to more than one person entity in reality), ing personal names that exactly match we extend B-Cubed metric as Table 1, where S = the query personal name. {S 1, S 2, …} is a system clustering and R = {R 1, (2) Partially matching: news articles contain- R2, …} is a gold-standard clustering. The final ing personal names that are super-strings performance of a system clustering for a personal of the query personal name. For instance, name is the F score ( α= 0.5), and the final per- an article that has a person named with formance of a system is the Mac F score, the aver- “高军田” ( Gao Jun Tian) is retrieved age of the F scores of all personal names. for the query personal name “ 高军” (Gao Moreover, Artiles et al. (2009) also discuss Jun). three cheat systems: one-in-one, all-in-one, and (3) Discarded: news articles containing the hybrid cheat system. One-in-one assigns each character sequences that match the query document into a cluster, and in contrast, all-in-one personal name string and however in fact put all documents into one cluster. The hybrid are not a personal name. For instance, an cheat system just incorporates all clusters both in article that has the string “最高军事法 院” (Zui Gao Jun Shi Fa Yuan: supreme dataset for a personal name into two groups ac- military court) is also retrieved for the cording to whether the person in question is a re- personal name “ 高军” (Gao Jun). porter of the news. They then choose a different strategy to make further clustering for each group. This diagnosis test is designed to simulate the In terms of feature extraction, we find that all realistic scenario where Chinese word segmenta- systems except SoochowHY use word segmenta- tion works before personal name disambiguation. tion as pre-processing. Moreover, most systems If a Chinese word segmenter works perfectly, a choose named entity detection to enhance their word-based matching can be used to retrieve the feature extraction. In addition, character-based documents containing a personal name, and arti- bigrams are also used in some systems. In Ap- cles in Groups (2) and (3) should not be returned. pendix Table 6, we give the summary of word The personal name disambiguation task that is segmentation and named entity detection used in limited to the documents in Group (1) should be the participating systems. simpler. Regarding clustering algorithms, agglomerative Moreover, in this diagnosis test, we propose a hierarchical clustering is popular in the submis- baseline based on the gold-standard word segmen- sions. Moreover, we find that weight learning is tation as follows, namely the word-segment sys- very crucial for similarity matrix, which has a big tem. impact to the final clustering performance. Be- 1) All articles in the “exactly matching” sides the popular Boolean and TFIDF weighting group are merged into a cluster, and all schemes, SoochowHY and NEU-2 use different articles in the “discarded” group are weighting learning. NEU-2 manually assigns merged into a cluster. weights to different kinds of features. So- 2) In the “partially matching” group, enti- ochowHY develops an algorithm that iteratively ties exactly sharing the same personal learns a weight for a character-based n-gram. name are merged into a cluster. For ex- ample, all articles containing “高军田” 5 Results (Gao Jun Tian) are merged into a cluster, 高军华 We first provide the performances of the formal and all articles containing ” (Gao test, and make some analysis. We then present and Jun Hua) are merged into another cluster. discuss the performances of the diagnosis test. 4 Campaign Design 5.1 Results of the Formal test For the formal test, we show the performances of 4.1 The Participants 11 submissions from 10 teams in Table 2. For each team, we keep only the better result except The task of Chinese personal name disambigua- the NEU team because they use different tech- tion in news has attracted the participation of 10 nologies in their two submissions (NEU_1 and teams. As a team can submit at most 2 results, NEU_2).