<<

A Multi-stage Clustering Framework for Chinese Personal Name Disambiguation

Huizhen , Haibo , Yingchao Shi, Ji , , Jingbo Zhu Natural Language Processing Laboratory, Northeatern University Shenyang, Liaoning, {wanghuizhen|[email protected] {dinghb|shiyc|maji}@mail.neu.edu.cn

pated in the bakeoff of the Chinese PND task, Abstract on the test set and the diagnosis test set, our two systems are ranked at the 1 st and 2 nd position. This paper presents our systems for the The rest of this paper is organized as follows. participation of Chinese Personal Name In Section 2, we first give the key features and Disambiguation task in the CIPS- techniques used in our two systems. In Section SIGHAN 2010. We submitted two dif- 3, experimental results on the evaluation test ferent systems for this task, and both of data demonstrated that our methods are effec- them all achieve the best performance. tive to disambiguate the personal name, and This paper introduces the multi-stage discussions on some issues we found during the clustering framework and some key development of the system are given. In Section techniques used in our systems, and 4, we conclude our work. demonstrates experimental results on evaluation data. Finally, we further dis- 2 System Description cuss some interesting issues found dur- ing the development of the system. In this section, we describe the framework of our systems in more detail, involving data pre- 1 Introduction processing, discard -class document identifica- tion, feature definition, clustering algorithms, Personal name disambiguation (PND) is very and sub-system combination. important for web search and potentially other natural language applications such as question 2.1 Data Preprocessing answering. CIPS-SIGHAN bakeoffs provide a There are around 100-300 news articles per per- platform to evaluate the effectiveness of various sonal name in the evaluation corpus. Each arti- methods on Chinese PND task. cle is stored in the form of XML and encoded in Different from English PND, word segmenta- UTF-8. At first, each news article should be tion techniques are needed for Chinese PND preprocessed as follows: tasks. In practice, person names are highly am-  Use a publicly available Chinese encoding biguous because different people may have the Converter tool to convert each news article same name, and the same name can be written from UTF-8 coding into GB 1; in different ways. It’s an n-to-n mapping of per-  Remove all XML tags; son names to the specific people. There are two  Process Chinese word segmentation, part- main challenges on Chinese PND: the first one of-speech (POS) tagging and name entity is how to correctly recognize personal names in recognition (NER); the text, and the other is how to distinguish dif- The performance of word segmentation and ferent persons who have the same name. For NER tools generally affect the effectiveness of address these challenges, we designed a rule- our Chinese PND systems. During system de- based combination technique to improve NER performance and propose a multi-stage cluster- ing framework for Chinese PND. We partici- 1 http://www.mandarintools.com/ veloping process, we found that the publicly  Foreign Person Name Rules . Two identi- available NER systems obtain unsatisfactory fied person names connected by a dot are performance on evaluation data. To address this merged into a single foreign person name, challenge, we propose a new rule-based combi- e.g., “菲/./罗杰斯” => “菲.罗杰斯” nation technique to improve NER performance.  Rules . Surnames are In our combination framework, two different very important for Chinese person name NER systems are utilized, including a CRF- identification. However, some common based NER system and our laboratory’s NER surnames can be single words depending system ( et al.,2002). The latter was imple- upon the context, for example, the Chinese mented based on the maximum matching prin- word “ 张” can be either a surname or a ciple and some linguistic post-preprocessing quantifier. To tackle this problem, some rules. Since both two NER systems adopt dif- post-processing rules for “ 张, 段, 高, 刘, ferent technical frameworks, it is possible to 赵” are designed in our system. achieve a better performance by means of sys-  Query-Dependent Rules. Given a query tem combination techniques. person name A, if the string AB occurring The basic idea of our combination method is in the current document has been identified to first simply combine the results produced by as a single person name many times in both NER systems, and further utilize some other documents, our system would tend to heuristic post-processing rules to refine NE segment AB as a single person name rather identification results. To achieve this goal, we 郭伟明 first investigate error types caused by both NER than as A/B . For example, if “ ” was systems, and design some post-preprocessing identified as a true person name more than rules to correct errors or select the appropriate one time in other documents, in such a case, NER results from disagreements. Notice that “议员/郭伟/明/表示/”=> “议员/郭伟明 such rules are learned from sample data (i.e., /person 表示/” training set), not from test set. Experimental Incorporating these above post-processing results demonstrate satisfactory NER perform- rules, our NER system based on heuristic post- ance by introducing these heuristic refinement processing rules shows 98.89% precision of rules as follows: NER on training set.  Conjunction Rules. Two NEs separated 2.2 Discard-Class Document Identification by a conjunction (such as “和”,“或”,“与”, “、” ) belong to the same type, e.g., “高明 Seen from evaluation data, there are a lot of /adj. 和/吴倩莲/person”. Such a conjunc- documents belonging to a specific class, re- tion rule can help NER systems make a ferred to as discard -class. In the discard-class, the query person name occurring in the docu- consistent prediction on both NEs, e.g., “高 ment is not a true person name. For example, a 明/person” and “高峰/person”. query word “黄海” is a famous ocean name not  Professional Title Rules . Professional title a person name in the sentence “三江飘流分别 words such as “主任” are strong indicators 可达日本海、黄海黄海和鄂霍茨克海”. In such a of person names, e.g., “主任/李刚”. Such a case, the corresponding document is considered rule can be written in the form of “profes- as discard-class. Along this line, actually the sional_title+person_name ”. discard-class document identification is very  Suffix Rules . If an identified person name simple task. If a document does not contain a is followed by a suffix of another type of true person name that is the same as the query named entities such as location, it is not a or contains the query, it is a discard-class 阿萨姆邦 true person name, for example, “ document. 德马杰/person 镇/的/居民”. Since “镇” is a suffix of a location name. “阿萨姆邦德 2.3 Feature Definition 马杰/person 镇/location-suffix” should be To identify different types of person name and revised to be a new location name, namely for the PND purpose, some effective binary fea- “阿萨姆邦德马杰镇/location”. tures are defined to construct the document rep- the location occurring in the local context of the resentation as feature vectors as follows: query. Two such documents can be put into the  Personal attributes : involving profes- same cluster if they contain the same organiza- sional title, affiliation, location, co- tion or location names, otherwise not. In our occurrence person name and organization system, a location dictionary containing prov- related to the given query. ince-city information extracted from Wikipedia  NE-type Features : collecting all NEs oc- is used to identify location name. For example: curring in the context of the given query. 辽宁省 (沈阳 大连 铁岭 鞍山 …), 河北(石家 There are two kinds of NE-type features 庄 唐山 秦皇岛 邯郸 邢台…). Based on this used in our systems, local features and dictionary, it is very easy to map a city to its global features. The global features are de- corresponding province. fined with respect to the whole document while the local features are extracted only 2.4.2 PND on the Sportsman Class from the two or three adjacent sentences Like done in PND on the journalist class, we for the given query. also use rule-based clustering techniques for  BOW-type features : constructing the con- disambiguating sportsman class. The major dif- text feature vector based on bag-of-word ference is to utilize topic features for PND on model. Similarly, there are local and global the sportsman class. If the topic of the given BOW-type features with respect to the con- document is sports, this document can be con- text considered. sidered as sportsman class. The key is to how to 2.4 A Multi-stage Clustering Framework automatically identify the topic of the document containing the query. To address this challenge, Seen from the training set, 36% of person we adopt a domain knowledge based technique names indicate journalists, 10% are sportsmen, for document topic identification. The basic and the remaining are common person names. idea is to utilize a domain knowledge dictionary Based on such observations, it is necessary to NEUKD developed by our lab, which contains utilize different methodology to PND on differ- more than 600,000 domain associated terms and ent types of person names, for example, because the corresponding domain features. Some do- the most effective information to distinguish main associated terms defined in NEUKD are different journalists are the reports’ location and shown in Table 1. colleagues, instead of the whole document con- tent. To achieve a satisfactory PND perform- Domain associated term Domain feature concept ance, in our system we design three different 足球队(football team) Football, Sports modules for analyzing journalist, sportsman and 自行车队 Traffic, Sports, cycling common person name, respectively. (cycling team) 中国象棋 2.4.1 PND on the Journalist Class Sports, Chinese chess (Chinese chess) In our system, some regular expressions are 执白(white side) Sports, the game of go designed to determine whether a person name is 芝加哥公牛 Sports, basketball a journalist or not. For example: (Chicago bulls)  新华社/ni */ns */t */t 消息|电/n (/w .* [《/w */ni 》/w ]* query name/nh .*)/w Table 1: Six examples defined in the NEUKD  (/w .*query name/nh .*)/w  [*/nh]* query name/nh [*/nh] In the domain knowledge based topic identi-  记 者 | 编 辑 /n [*/nh]* query name/nh fication algorithm, all domain associated terms [*/nh]* occurring in the given document are first To disambiguate on the journalist class, our mapped into domain features such as football , system utilizes a rule-based clustering technique basketball or cycling . The most frequent do- distinguish different journalists. For each main feature is considered as the most likely document containing the query person name as topic. See Zhu and (2005) for details. journalists, we first extract the organization and Two documents with the same topic can be weight of a word with the POS of “ns”or ”nh ” grouped into the same cluster. is set to be 2, the ones of “ni” POS is set to 1.5, otherwise 1.0.

Person name Document no. sports Algorithm 1: Multi-stage Clustering Framework 杨波 081 篮球 Input : a person name pn , and its related document 杨波 094 射箭 set D={ d1, d 2, …, d m} in which each document di 杨波 098 射箭 contains the person name pn ; Output : clustering results C={C1,C2, …,Cn}, where 杨波 100 射箭 ∪ = and ∩ = Φ Ci C C i C j i ∈ Table 2: Examples of PND on Sportsman Class For each d i D do S i = {s|pn ∈s, s ∈di}; ORG i={t|t ∈s, s ∈Si, POS(t)= ni}; 2.4.3 Multi-Stage Clustering Framework PER i={t|t ∈s, s ∈Si, POS(t)=nh} ; ∈ ∈ We proposed a multi-stage clustering frame- Ldi = {t|t s, s Si }; //local feature set work for PND on common person name class, Gdi = {t|t ∈di}; //global feature set as shown In Figure 1. Ci = {d i} ; In the multi-stage clustering framework, the End for Stage 1: Strict rules-based clustering first-stage is to adopt strict rule-based hard clus- Begin tering algorithm using the feature set of per- For each C ∈ C do sonal attributes. The second-stage is to imple- i If ORG ∩ ORG ≠ Φ or ment constrained hierarchical agglomerative i j PER ∩ PER ≥ 2 clustering using NE-type local features. The i j third-stage is to design hierarchical agglomera- Then Ci = C i ∪Cj; tive clustering using BOW-type global features. ORG i = ORG i∪ORG j ; By combining those above techniques, we sub- PER i = PER i∪PER j ; mitted the first system named NEU_1. Remove C j from C ; End for 2.4.4 The second system End Besides, we also submitted another PND system Stage 2: Constrained hierarchical agglomerative clustering algorithm using local features named NEU_2 by using the single-link hierar- Begin chical agglomerative clustering algorithm in Set each c ∈C as an i nitial cluster; which the distance of two clusters is the cosine do similarity of their most similar members (Ma- = [Ci ,C j ] arg max sim (Ci ,C j ) ∈ saki et al., 2009, Duda et al., 2004). The differ- Ci ,C j C ence between our two submission systems sim (C ,C ) = max sim (d ,d ) i j ∈ ∈ x y NEU_1 and NEU_2 is the feature weighting d x Ci ,d y C j = max cos( L , L ) method. The motivation of feature weighting ∈ ∈ d x d y d x Ci ,d y C j method used in NEU_2 is to assume that words C = C ∪C ; surrounding the query person name in the given i i j Remove C j from C ; document are more important features than until sim(C i,C j) < θ. those far away from it, and person name and End location names occurring in the context are Stage 3: Constrained hierarchical agglomerative more discriminative features than common clustering algorithm using global features, i.e., util- words for PND purpose. Along this line, in the ize the same algorithm used in stage 2 by consider- feature weighting scheme used in NEU_2, for ing the global feature set G for cosine-based similar- each feature extracted from the sentence con- ity calculation instead of the local feature set L. taining the query person name, the weight of a word-type feature with the POS of “ns”, ”ni” Figure 1: Multi-stage Clustering Framework or ”nh ” is assigned as 3, Otherwise 1.5; For the features extracted from other sentences, the 2.5 Final Result Generation documents in which the personal name is tagged correctly. The performance of our two As discussed above, there are many modules for systems on the diagnosis test set of Sighan2010 PND on Chinese person name. In our NEU_1, Chinese personal name disambiguation task are the final results are produced by combining shown in Table 4. outputs of discard-class document clustering, journalist-class clustering, sportsman-class System B_Cubed P_IP clustering and multi-stage clustering modules. no. P R F P IP F In NEU-2 system, the outputs of discard-class NEU_1 95.6 89.74 92.14 96.83 93.62 95.03 document clustering, journalist-class clustering, NEU_2 94.53 89.99 91.66 96.41 93.8 94.9 sportsman-class clustering and single-link clustering modules are combined to generate Table 4: Results of the diagnosis test on test the final results. data

3 Evaluation As shown in the Table 3 and Table 4, NEU-1 system achieves the highest precision and F 3.1 Experimental Settings values on the test data and the diagnosis test  Training data: containing about 30 Chinese data. person names, and a set of about 100-300 news articles are provided for each person 3.3 Discussion name. We propose a multi-stage clustering framework  Test data: similar to the training data, and for Chinese personal name disambiguation. The containing 26 unseen Chinese personal evaluation results demonstrate that the features names, provided by the SIGHAN organizer. and key techniques our systems adopt are effec-  Performance evaluation metrics (Artiles et tive. Our systems achieve the best performance al., 2009): B_Cubed and P_IP metrics. in this competition. However, our recall values are not unsatisfactory. In such a case, there is 3.2 Results still much room for improvement. Observed Table 3 shows the performance of our two from experimental results, some interesting is- submission systems NEU_1 and NEU_2 on the sues are worth being discussed and addressed in test set of Sighan2010 Chinese personal name our future work as follows: disambiguation task. (1) For PND on some personal names, the document topic information seems not effective. System B_Cubed P_IP For example, the personal name " 郭 华 ( No. P R F P IP F Hua)" in training set represent one shooter and NEU_1 95.76 88.37 91.47 96.99 92.58 94.56 one billiards player. The PND system based on NEU_2 95.08 88.62 91.15 96.73 92.73 94.46 traditional clustering method can not effectively Table 3: Results on the test data work in such a case due to the same sports topic. To solve this problem, one solution is to suffi-

NEU-1 system was implemented by the ciently combine the personal attributes and document topic information for PND on this multi-stage clustering framework that uses sin- gle-link clustering method. In this framework, person name. there are two threshold parameters θ and µ. (2) For the journalist-class personal names, global BOW-type features are not effective in Both threshold parameters are tuned from train- ing data sets. this case as different persons can report on the same or similar events. For example, there are After the formal evaluation, the organizer 朱朱朱 provided a diagnosis test designed to explore four different journalists named “ (Zhu the relationship between Chinese word segmen- Jianjun)” in the training set, involving different locations such as , Zhengzhou, Xining or tation and personal name disambiguation. In the diagnosis test, the personal name disambigua- Guangzhou. We can distinguish them in terms of the location they are working in. tion task was simplified and limited to the (3) We found that some documents in the tribute extraction and unsupervised learning training set only contain lists of news title and approaches for Chinese personal name disam- the news reporter. In this case, we can not dis- biguation. criminate the persons with respect to the loca- tion of entire news. It’s worth studying some 5 Acknowledgements effective solution to address this challenge in This work was supported in part by the National our future work. Science Foundation of China (60873091) and (4) Seen from the experimental results, some the Fundamental Research Funds for the Cen- 李刚 personal names such as “ ( gang)” are tral Universities. wrong identified because this person is associ- ated with multiple professional titles and affili- ates. In this case, the use of exact matching References methods can not yield satisfactory results. For example, the query name “李刚(Li gang)” in Artiles, Javier, Julio Gonzalo and Satoshi Sekine. the documents 274 and 275 is the president of 2009. “WePS 2 Evaluation Campaign: overview of the Web People Search Clustering Task,” In 2nd 中国对外文化交流协会 “ (China International Web People Search Evaluation Workshop (WePS Culture Association)” while in the documents 2009), 18th WWW Conference. 202, 225 and 228, is the director of “文化部 Duda, Richard O., Peter E.Hart, and David G.Stork. 对外文化联络局(Bureau of External Cultural 2004. Pattern Classification. China Machine Press. Relations of Chinese Ministry of Culture)”. To group both cases into the same cluster, it’s Masaki, Ikeda, Shingo Ono, Issei Sato, Minoru Yo- shida, and Hiroshi Nakagawa. 2009. Person Name worth mining the relations and underlying se- Disambiguation on the Web by TwoStage Clustering. mantic relations between entities to achieve this In 2nd Web People Search Evaluation Workshop goal. (WePS 2009), 18th WWW Conference.

Yao, Tianshun, Zhu Jingbo , Li, Ying. 4 Conclusion Nov. 2002. Natural Language Processing , Second Edition, Tsinghua press. This paper presents our two Chinese personal Zhu, Jingbo and Wenliang Chen. 2005. Some Stud- name disambiguation systems in which various ies on Chinese Domain Knowledge Dictionary and constrained hierarchical agglomerative cluster- Its Application to Text Classification. In Proc. of ing algorithms using local or global features are SIGHAN4. adopted. The bakeoff results show that our sys- tems achieve the best performance. In the future, we will pay more attention on the personal at-