Chinese Personal Name Disambiguation Based on Vector Space Model

Chinese Personal Name Disambiguation Based on Vector Space Model Qing-hu Fan Hong-ying Zan Yu-mei Chai College of Information Engineering, Yu-xiang Jia Zhengzhou University, Zhengzhou, College of Information Engineering, Henan ,China Zhengzhou University, Zhengzhou, [email protected] Henan ,China {iehyzan, ieymchai, ieyx- jia}@zzu.edu.cn Gui-ling Niu Foreign Languages School, Zhengzhou University, Zhengzhou, Henan ,China [email protected] associate professor at the Department of Me- Abstract chanical and Electrical Engineering, Physics and Electrical and Mechanical Engineering College of Xiamen University, or it may be Peking Uni- versity Founder Group Corp that was established This paper introduces the task of Chinese per- by Peking University. It needs to associate with sonal name disambiguation of the Second context for disambiguating the entity Fang Zheng. CIPS-SIGHAN Joint Conference on Chinese For example, Fang Zheng who is an associate Language Processing (CLP) 2012 that Natural Language Processing Laboratory of professor at Xiamen University can be extracted Zhengzhou University took part in. In this task, with the feature that Xiamen University, Me- we mainly use the Vector Space Model to dis- chanical and Electrical Engineering College or ambiguate Chinese personal name. We extract associate professor, which can eliminate ambigu- different named entity features from diverse ity. names information, and give different weights to various named entity features with the im- 2 Related Research portance. First of all, we classify all the name documents, and then we cluster the documents In the early stages of Named Entity Disambigua- that cannot be mapped to names that have tion, Bagga and Baldwin (1998) use Vector been defined. Eventually the results of classi- Space Model to resolve ambiguities between fication and the clustering are combined. In people having the same personal name. Han and the test corpus experiments, the accuracy rate Zhao (2010) proposed a knowledge-based me- is 0.6778, the recall rate is 0.7205 and the F thod that captures structural semantic knowledge value is 0.6985 for all names. in multiple knowledge sources to disambiguate personal entities. Han and Sun (2011) proposed a 1 Introduction generative Entity-Mention model that leverages heterogeneous entity knowledge for the entity Named Entity is the fundamental information linking task. In Chinese person name disambigu- elements in text, and is the basis for understand- ation, Li, et al (2010) carried out the first confe- ing the text correctly. Named Entities include rence, Chinese Language Processing (CLP-2010), person names, organization names, place names, which contains Chinese person names disambig- time, date, and digital. Named Entity Recogni- tion is to identify the entities in the text and de- uation task. In this task Shi, et al (2011) proposed a post-processing method that is based on termine what category it is. Such as 方正 fang- multiple entity recognition system integration zheng ‘Fang Zheng’，maybe the name is an 152 Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, pages 152–158, Tianjin, China, 20-21 DEC. 2012 and heuristic rules, Zhang, et al (2010) proposed character works correspond to one specific cha- a method that extracts various person features to racter. Therefore, it is character works that plays identify different person names, and according to an important role to eliminate name disambigua- the Chinese word segmentation, we constructed tion. artificially rules that identify the names correctly. Extraction method: we extract character works We propose a method that is based on various from each figure corpus; in other words, we ex- entities recognition and initialize evaluation for tract all the contents of quotation marks. the features that are the common characteristics Format the character works: of different names, and then take Vector Space 1) If there is 之 Zhi that appears in the work, Model to calculate it. In the end, the documents then we split the work with 之 Zhi. that cannot be mapped to names that have been For example: defined in the knowledge base are clustered into 白云(孙皓晖先生的长篇小说《大秦帝国之 different types. 黑色裂变》中所虚构的女主角) CLP2012 Named Entity disambiguation is a Bai-Yun(Sun-hao-hui-xian-sheng-de-chang- task pre-classification and later clustering prob- pian-xiao-shuo-da-qin-di-guo-zhi-hei-se-lie-bian- lems. The task provides a knowledge base of zhong-suo-xu-gou-de-nv-zhu-jiao) Chinese names which include multiple defini- Bai-Yun(she is the fictional actress in the tions of personal names, and some documents Danqin Empire with The Black Fission that is about person names. It is the purpose of the task Mr.Sun Haohui‟s novel) that makes each name that appears in documents 大秦帝国之黑色裂变 to link corresponding definition of the know- We will extract da-qin- ledge base, and makes the documents that cannot di-guo-zhi-hei-se-lie-bian „Danqin Empire with link to corresponding definition of the know- The Black Fission‟ that is the work, however the ledge base to cluster which two documents have work cannot be identified. As 大秦帝国之黑色 the same named Entity feature. Task input: 裂变 da-qin-di-guo-zhi-hei-se-lie-bian „Danqin Names knowledge base of named Entity, text set Empire with The Black Fission‟ is only the first corresponding each name. Task output: if the novel of 大秦帝国 da-qin-di-guo „Danqin Em- name of each text links to the knowledge base of pire‟ in literature works.2 We split 大秦帝国之 a definition, then output the corresponding id, if 黑色裂变 da-qin-di-guo-zhi-hei-se-lie-bian the name of each text is ordinary words, then „Danqin Empire with The Black Fission‟ into 大 output “other”, if the name of each text does not 秦帝国 da-qin-di-guo „Danqin Empire‟ and 黑 belong to the above two kinds, then output Num- 色裂变 bers: Out_XX that have been put into. hei-se-lie-bian „The Black Fission‟with This paper is organized as follows: in section 之 Zhi, and then they can be identified correctly. 3 we will introduce the method that extract the 2) If the length of works' name is less than 2, named entities related to figures. In section 4 we it is required to extract works and quota- will introduce the calculation model of the tion marks. named entities. In section 5 we will describe ex- Eg:马啸担任河南卫视《旅游》栏目主持人. periments and results. In the last section we will Ma-xiao-dan-ren-hen-nan-wei-shi-lv-you-lan- make conclusions and future work. mu-zhu-chi-ren „Ma Xiao is appointed host of Traveling program in Henan TV‟In this sentence 3 Extract the Named Entities Related to 旅游 lv-you „Traveling‟ is the work name. It is Figures known that Traveling has different part of speech, which can be a verb or noun. The Traveling is a 3.1 Character works TV program in the sentence, which is a noun. It Works have the originality and are the intellec- will reduce accuracy rate that we take Traveling tual creations that can be copied in a certain as the feature. physical form in the field of literature and science.1 Works include literature works, music, 3.2 Character Aliases drama, folk art forms, dance works, photographs, Aliases are the names other than the formal or films, television, video works, etc. Character specific. They are used in writing, oral.3 Charac- works is the significant characteristic to identify ter aliases are an essential feature for eliminating figures. In evaluation corpus, it is generally that a 2 http://baike.baidu.com/view/525001.htm 1 http://baike.baidu.com/view/94574.htm 3 http://baike.baidu.com/view/343250.htm 153 the disambiguation. We define that each file- Eg1: the content of 白雪 Bai Xue that 白百 name in KB folder is the figure‟s original name, 何，中国内地女演员，别名白雪 Bai-bai-he- others are character aliases. We use the methods zhong-guo-nei-di-nv-yan-yuan-bie-ming-bai-xue that are based on pattern matching to extract cha- „Bai baihe is Chinese mainland actress and her racter aliases Lu and HOU (2006), as is shown aliase is Bai Xue‟ in the sentence the family below following methods: name of Bai Xue is Bai. End identifier is 1) Synonymy keywords + Synonyms + End “,” ,then we could extract “白百何” as charac- identifier ter aliase from the first method. 本名| 别号 | ，字 Synonymy keywords: Eg2: the content of Baixue that 陈大威，号白 |^(字)|，号|^(号) |又号|^(名)|笔名|自号雪，碧松斋主人 chen-da-wei-hao-bai-xue-bi- |又名|乳名|别名|原名|艺名|本名|曾用名| song-zhai-zhu-ren „Chen Dawei's art-name Bai 俗称|亦称|又称, the symbol “|”means choose, Xue and he is the host of Bi-Song-Zhai‟, in this “^”means that matches the beginning of the sentence the family name of Bai Xue is Bai,and string. we know that 陈大威(Chen Dawei) is character End identifier: it means the end of extracting aliase, the family Bai is different from 陈 Chen. the synonyms, the end signs are always (，or,) Therefore, according to second method we use and (。or .), which mean comma symbol and full the family name Chen. End identifier is “,” ,then stop. If we extract character aliases equal with we extract character aliase as Chen Dawei. original names, then we should use the feature that synonyms combine with synonymy. 3.3 Named Entity Eg:白云(原名杨维汉，广东省潮安县人) Named Entity is the feature to discriminate fig- Bai-Yun(Yuan-ming-yang-wei-han-guang-zhou- ures. The features related to figure, Learning chao-an-xian-ren) Bai-Yun(Her original family Unit, organizations, living space, and other enti- name is Yang Weihan and she was born in ties, can mark different figures.

Chinese Personal Name Disambiguation Based on Vector Space Model

Pinyin Conversion Project Specifications for Conversion of Name and Series Authority Records October 1, 2000

The Later Han Empire (25-220CE) & Its Northwestern Frontier

2017 36Th Chinese Control Conference (CCC 2017)

Names of Chinese People in Singapore

Mating-Induced Male Death and Pheromone Toxin-Regulated Androstasis

Rethinking Chinese Kinship in the Han and the Six Dynasties: a Preliminary Observation

Download: Brill.Com/Brill-Typeface

Following the Troops, Carrying Along Our Brushes: Jian'an (196–220 Ad

A Prosopographical Study of Princesses During the Western and Eastern Han Dynasties

Back Matter (PDF)

Universtiy of California, San Diego

UC San Diego UC San Diego Previously Published Works