Research on Building Family Networks Based on Bootstrapping and Coreference Resolution
Total Page:16
File Type:pdf, Size:1020Kb
Research on Building Family Networks Based on Bootstrapping and Coreference Resolution Jinghang Gu, Ya’nan Hu, Longhua Qian, and Qiaoming Zhu Natural Language Processing Lab, Soochow University, Suzhou, Jiangsu, 215006 School of Computer Science & Technology, Soochow University, Suzhou, Jiangsu, 215006 [email protected], {20114227025,qianlonghua,qmzhu}@suda.edu.cn Abstract. Personal Family Network is an important component of social networks, therefore, it is of great importance of how to extract personal family relationships. We propose a novel method to construct personal families based on bootstrapping and coreference resolution on top of a search engine. It begins with seeds of personal relations to discover relational patterns in a bootstrapping fashion, then personal relations are further extracted via these learned patterns, finally family networks are fused using cross-document coreference resolution. The experimental results on a large-scale corpus of Gigaword show that, our method can build accurate family networks, thereby laying the foundation for social network analysis. Keywords: Family Network, Social Network, Bootstrapping, Cross-Document Coreference Resolution. 1 Introduction In recent years, social network becomes more and more important in people’s daily life with the rapid development of social digitalization, and its analysis and application could help improve the living quality and efficiency. As is known to all, a family is the basic unit of human society, thus family network should be the core of social network. Traditional Social Network Analysis (SNA) focuses on independent individuals and their functions, ignoring the influence of the whole family which is indispensible in social network. This paper starts with extracting family relationships, and then builds rich personal family networks in turn, laying the foundation for the research of constructing large-scale social networks. Social Network Analysis is an active topic in the field of computer science and social science, and the construction of social network is the basis of SNA. Early research in building social networks mainly exploited the co-occurrence of personal names, such as Referral Web/Flink [1, 2]. Recently, machine learning methods come into fashion in order to excavate social networks in some specific fields, such as ArnetMiner [3], a kind of social network in academics; in literature works [4, 5], or in character biographies [6]. Traditional social network construction takes independent G. Zhou et al. (Eds.): NLPCC 2013, CCIS 400, pp. 200–211, 2013. © Springer-Verlag Berlin Heidelberg 2013 Research on Building Family Networks Based on Bootstrapping 201 individuals of interest as the central subject and mines mutual relationships between them, without acknowledging families as the core of the social network. In addition, it does not perform well due to the over-simplicity of dealing with the name ambiguity issue. In this paper, we extract personal relationships via bootstrapping with the notion of families as the core of social networks in mind. Then we aggregate these family relationships into family networks by addressing the problems of name variation and name ambiguity, i.e. Cross-Document Coreference Resolution (CDCR), using simple and effective methods. The performance of our work shows that it can construct family networks successfully from a large-scale Chinese text corpus. This paper is structured as follows. After a brief survey of related work in Section 2, we describe our method of constructing personal family networks in Section 3. In Section 4 we present our evaluation measurement. The results of the experiments are reported in Section 5. We discuss our findings, draw conclusions, and identify future work in Section 6. 2 Related Work The primary task of constructing social networks is to extract social relationships between persons, which is a specific branch of semantic relation extraction. Semantic relation extraction is an important part of Nature Language Processing (NLP) and Information Extraction (IE), whose objective is to extract the semantic relations between named entities. When named entities are limited to persons, semantic relation extraction is reduced to personal relation extraction. Most research of relation extraction adopts machine learning approaches. In terms of the scale of annotated labor it needs, we can divide relation extraction methods into supervised learning [7], weakly supervised learning [8], and unsupervised learning [9]. Usually both the quantity and quality of the annotated corpus determine the performance of relation extraction. As a kind of the weakly supervised learning methods, bootstrapping comes into fashion because of its less demand of manual annotations, and can mine a great majority of instances using only a small scale of seeds at the beginning. Hearst [10] took the lead in using bootstrapping to extract the hyponym relation (is-a). It starts with several seeds and discovers more patterns and then more instances in turn in an iterative manner. Pantel et al. [11] proposed a bootstrapping system—Espresso, which is based on the framework in (Hearst [10]). Espresso effectively resolves the problem of the confidence calculation of patterns and instances when extracting relations from a large scale corpus like the Web. Yao et al. [12] and Gan et al. [13] took the advantage of the redundant information in Web pages to discover personal relations by employing a simulated annealing algorithm on top of bootstrapping. Peng et al. [14] explored the tree kernel-based method for personal relation extraction, expanding personal relations to static (family relations and business relations) and dynamic (personal interactive relations) ones. 202 J. Gu et al. In the field of social network construction, early research leverages the statistics of name co-occurrence in web pages to extract the personal relationships. Kautz et al. [1] propose a system named as Referral Web, which is based on personal name co- occurrence in order to automatically mine social networks. Mika et al. [2] adopted the same strategy as Referral Web with the exception of using personal emails as additional source data besides Web pages. Recent research turns to machine learning methods, with the purpose to extract more types of social relations. Tang et al. [3] proposed the ArnetMiner system to build social networks among academic scholars using SVM and CRF classifiers to classify the relations. Elson et al. [4] and Agarwal et al. [5] discuss social networks in literature works. They propose the notion of implicit social relations among different characters participating in the same event, such as Interaction and Observation. Camp and Bosch et al. [6] extract personal relations that entail emotional polarity, and build social networks with SVM classifiers. It is worth noting that the StatSnowball system (Zhu et al. [15]) extracts social relations by bootstrapping in conjunction with Probability Models and Markov Logic Networks. StatSnowball achieves a promising performance on constructing social networks from the large scale Web corpus. Although our paper chooses the same method of bootstrapping as before to mine social networks, we take a different perspective from previous works. We extract personal social relations in a family unit from a large scale text corpus, further, we investigate the issue of person coreference resolution in constructing family networks, laying the foundation for building large scale social networks. 3 Personal Family Network Construction There are three phases in the procedure of constructing family networks from a large scale corpus, including Personal Family Relation Extraction (PFRE), Cross- Document Coreference Resolution (CDCR) and Family Network Aggregation (FNA). Family relations are the ones between persons in the same family, including husband- wife, father-son, mother-son, father-daughter, mother-daughter, brotherhood and sisterhood etc. Cross-Document Coreference Resolution attempts to group personal names from different documents into equivalence classes referred to as “coreference chains” with each chain representing the same person. The task of aggregating personal family networks is to cluster persons having family relationships into the same family network. After the first two phases of relation extraction and coreference resolution, family network construction would become easy and direct. 3.1 Personal Family Relation Extraction In this phase, we adopt a minimally supervised learning method, i.e. bootstrapping, for the sake of reducing the amount of training set. Taking several seed instances of a particular relation type as input, our algorithm iteratively learns surface patterns to extract more instances and vice versa. When extracting patterns and instances, we calculate their reliability scores and retain the ones with high reliability. Research on Building Family Networks Based on Bootstrapping 203 In terms of the importance of family relations, we define two major types of family relationships, i.e. “Parent-Child” and “Husband-Wife”. We further divide “Parent- Child” into four subtypes, such as “Father-Son”, “Father-Daughter”, “Mother-Son” and “Mother-Daughter” to facilitate the bootstrapping procedure. Totally, there are five types of family relations available for bootstrapping. However, the four subtypes should be combined into one just before the phase of Family Network Aggregation. We take Espresso as our prototype system, which include four steps as follows: