Research on Building Family Networks Based on Bootstrapping and Coreference Resolution

Jinghang Gu, Ya’nan Hu, Longhua Qian, and Qiaoming Zhu

Natural Language Processing Lab, Soochow University, Suzhou, , 215006 School of Computer Science & Technology, Soochow University, Suzhou, Jiangsu, 215006 [email protected], {20114227025,qianlonghua,qmzhu}@suda.edu.cn

Abstract. Personal Family Network is an important component of social networks, therefore, it is of great importance of how to extract personal family relationships. We propose a novel method to construct personal families based on bootstrapping and coreference resolution on top of a search engine. It begins with seeds of personal relations to discover relational patterns in a bootstrapping fashion, then personal relations are further extracted via these learned patterns, finally family networks are fused using cross-document coreference resolution. The experimental results on a large-scale corpus of Gigaword show that, our method can build accurate family networks, thereby laying the foundation for social network analysis.

Keywords: Family Network, Social Network, Bootstrapping, Cross-Document Coreference Resolution.

1 Introduction

In recent years, social network becomes more and more important in people’s daily life with the rapid development of social digitalization, and its analysis and application could help improve the living quality and efficiency. As is known to all, a family is the basic unit of human society, thus family network should be the core of social network. Traditional Social Network Analysis (SNA) focuses on independent individuals and their functions, ignoring the influence of the whole family which is indispensible in social network. This paper starts with extracting family relationships, and then builds rich personal family networks in turn, laying the foundation for the research of constructing large-scale social networks. Social Network Analysis is an active topic in the field of computer science and social science, and the construction of social network is the basis of SNA. Early research in building social networks mainly exploited the co-occurrence of personal names, such as Referral Web/Flink [1, 2]. Recently, machine learning methods come into fashion in order to excavate social networks in some specific fields, such as ArnetMiner [3], a kind of social network in academics; in literature works [4, 5], or in character biographies [6]. Traditional social network construction takes independent

G. Zhou et al. (Eds.): NLPCC 2013, CCIS 400, pp. 200–211, 2013. © Springer-Verlag Berlin Heidelberg 2013 Research on Building Family Networks Based on Bootstrapping 201 individuals of interest as the central subject and mines mutual relationships between them, without acknowledging families as the core of the social network. In addition, it does not perform well due to the over-simplicity of dealing with the name ambiguity issue. In this paper, we extract personal relationships via bootstrapping with the notion of families as the core of social networks in mind. Then we aggregate these family relationships into family networks by addressing the problems of name variation and name ambiguity, i.e. Cross-Document Coreference Resolution (CDCR), using simple and effective methods. The performance of our work shows that it can construct family networks successfully from a large-scale Chinese text corpus. This paper is structured as follows. After a brief survey of related work in Section 2, we describe our method of constructing personal family networks in Section 3. In Section 4 we present our evaluation measurement. The results of the experiments are reported in Section 5. We discuss our findings, draw conclusions, and identify future work in Section 6.

2 Related Work

The primary task of constructing social networks is to extract social relationships between persons, which is a specific branch of semantic relation extraction. Semantic relation extraction is an important part of Nature Language Processing (NLP) and Information Extraction (IE), whose objective is to extract the semantic relations between named entities. When named entities are limited to persons, semantic relation extraction is reduced to personal relation extraction. Most research of relation extraction adopts machine learning approaches. In terms of the scale of annotated labor it needs, we can divide relation extraction methods into supervised learning [7], weakly supervised learning [8], and unsupervised learning [9]. Usually both the quantity and quality of the annotated corpus determine the performance of relation extraction. As a kind of the weakly supervised learning methods, bootstrapping comes into fashion because of its less demand of manual annotations, and can mine a great majority of instances using only a small scale of seeds at the beginning. Hearst [10] took the lead in using bootstrapping to extract the hyponym relation (is-a). It starts with several seeds and discovers more patterns and then more instances in turn in an iterative manner. Pantel et al. [11] proposed a bootstrapping system—Espresso, which is based on the framework in (Hearst [10]). Espresso effectively resolves the problem of the confidence calculation of patterns and instances when extracting relations from a large scale corpus like the Web. Yao et al. [12] and Gan et al. [13] took the advantage of the redundant information in Web pages to discover personal relations by employing a simulated annealing algorithm on top of bootstrapping. Peng et al. [14] explored the tree kernel-based method for personal relation extraction, expanding personal relations to static (family relations and business relations) and dynamic (personal interactive relations) ones.

202 J. Gu et al.

In the field of social network construction, early research leverages the statistics of name co-occurrence in web pages to extract the personal relationships. Kautz et al. [1] propose a system named as Referral Web, which is based on personal name co- occurrence in order to automatically mine social networks. Mika et al. [2] adopted the same strategy as Referral Web with the exception of using personal emails as additional source data besides Web pages. Recent research turns to machine learning methods, with the purpose to extract more types of social relations. Tang et al. [3] proposed the ArnetMiner system to build social networks among academic scholars using SVM and CRF classifiers to classify the relations. Elson et al. [4] and Agarwal et al. [5] discuss social networks in literature works. They propose the notion of implicit social relations among different characters participating in the same event, such as Interaction and Observation. Camp and Bosch et al. [6] extract personal relations that entail emotional polarity, and build social networks with SVM classifiers. It is worth noting that the StatSnowball system (Zhu et al. [15]) extracts social relations by bootstrapping in conjunction with Probability Models and Markov Logic Networks. StatSnowball achieves a promising performance on constructing social networks from the large scale Web corpus. Although our paper chooses the same method of bootstrapping as before to mine social networks, we take a different perspective from previous works. We extract personal social relations in a family unit from a large scale text corpus, further, we investigate the issue of person coreference resolution in constructing family networks, laying the foundation for building large scale social networks.

3 Personal Family Network Construction

There are three phases in the procedure of constructing family networks from a large scale corpus, including Personal Family Relation Extraction (PFRE), Cross- Document Coreference Resolution (CDCR) and Family Network Aggregation (FNA). Family relations are the ones between persons in the same family, including husband- wife, father-son, mother-son, father-daughter, mother-daughter, brotherhood and sisterhood etc. Cross-Document Coreference Resolution attempts to group personal names from different documents into equivalence classes referred to as “coreference chains” with each chain representing the same person. The task of aggregating personal family networks is to cluster persons having family relationships into the same family network. After the first two phases of relation extraction and coreference resolution, family network construction would become easy and direct.

3.1 Personal Family Relation Extraction

In this phase, we adopt a minimally supervised learning method, i.e. bootstrapping, for the sake of reducing the amount of training set. Taking several seed instances of a particular relation type as input, our algorithm iteratively learns surface patterns to extract more instances and vice versa. When extracting patterns and instances, we calculate their reliability scores and retain the ones with high reliability.

Research on Building Family Networks Based on Bootstrapping 203

In terms of the importance of family relations, we define two major types of family relationships, i.e. “Parent-Child” and “Husband-Wife”. We further divide “Parent- Child” into four subtypes, such as “Father-Son”, “Father-Daughter”, “Mother-Son” and “Mother-Daughter” to facilitate the bootstrapping procedure. Totally, there are five types of family relations available for bootstrapping. However, the four subtypes should be combined into one just before the phase of Family Network Aggregation. We take Espresso as our prototype system, which include four steps as follows:

Pattern Induction In order to quickly search the whole corpus for patterns and instances, we preprocess the corpus and load it into the Solr1 search engine. As to a person pair {x, y} of a certain type of relation Ri, we submit both x and y as keywords to Solr and obtain text from the corresponding pages, then we extract the patterns co-occurring with {x, y}. For instance, given a seed of “Husband-Wife” relation type as “江泽民, 王冶坪” (, Yeping), we can acquire the text “中国国家主席江泽民的夫人 王冶坪今天下午……” (This afternoon Chinese President Jiang Zemin's wife Wang Yeping…). We select the string between the two person names as the candidate pattern, and here we can obtain the pattern “ 的夫人” (’s wife) for the “Husband-Wife” relation.

Pattern Ranking/Selection We use Pointwise Mutual Information (PMI) [16] to measure the strength of association between the pattern p and the instance. The higher the PMI value is, the stronger the association is. Here, we adopt the PMI method suggested in (Pantel [11]). A well-known issue is that PMI is usually biased towards infrequent events, therefore we multiply the PMI value with the discounting factor suggested in (Pantel and Ravichandran) [17]. We define the reliability score of a pattern p, rπ(p), as the average strength of association across all instances, with each instance weighted by the product of its reliability score and the PMI value between the instance and the pattern:

⎛⎞pmi(, i p ) × ri ∑ ⎜⎟l () ∈ max = iI⎝⎠pmi rpπ () (1) ||I

Where rl(i) is the reliability score of an instance. The reliability score for each seed instance is 1.0 and for instance discovered in the subsequent iterations it is calculated as described in Formula (4). After calculating all patterns’ reliability scores, we choose the top 10% patterns carried over to the next iteration.

1 http://lucene.apache.org/solr/

204 J. Gu et al.

Instance Induction In this phase, we retrieve from the corpus the set of instances that match any of the patterns acquired in the previous iteration. We submit the patterns as keywords to Solr and extract text from the corresponding pages. As to the sentences containing the pattern, we utilize ICTCLAS2 to perform Chinese word segmentation and named entity recognition. If two personal names surround the pattern, we take them as an instance. For example, given the pattern “的夫人” (’s wife), we submit “的夫人” (’s wife) as keywords to Solr and get a sentence like “中国国务院总理李鹏的夫人朱琳上午来到……” (On the morning, Chinese Premier Li Peng’s wife Zhu Lin came to…). After word segmentation and named entity recognition, a new instance “李鹏, 朱琳” (Li Peng, Zhu Ling) arises.

Instance Ranking/Selection We define the reliability score of an instance i, rl(i), as its average strength of association across all patterns, with each pattern weighted by the product of its reliability score and the PMI value between the pattern and the instance:

⎛⎞ pmi(, i p )× ∑ ⎜⎟rpπ () pP∈ max ri()= ⎝⎠pmi (2) l ||p

After calculating all instances’ reliability scores, we choose the top 15% instances fed to the next iteration. After we obtained all the patterns of the above five types of family relations by bootstrapping, we merged “Father-Son”, “Father-Daughter”, “Mother-Son” and “Mother-Daughter” relations into the “Parent-Child” relation and we could integrate their subtype patterns if possible. For example, the pattern “的父亲” (’s father) can be discovered by the relationship of “Father-Son” and the pattern “的父亲” (’s father) can also be discovered by “Father-Daughter”. These two patterns are the same and can be combined into one pattern “的父亲” (’s father). Finally, we search the corpus again to acquire all the instances that satisfy the patterns corresponding to the two types of relations “Parent-Child” and “Husband-Wife”, and take the instances as input to Family Network Aggregation. The reason in so doing is that, on one hand, the patterns are highly reliable so that the instances induced by them are highly reliable as well; on the other hand, this can help improve the system recall.

3.2 Personal Family Network Aggregation Based on Coreference Resolution After finishing Personal Family Relation Extraction, we can start on aggregating our family networks, which is meant to assemble persons into families while merging

2 http://www.ictclas.org/

Research on Building Family Networks Based on Bootstrapping 205 their redundant relations. A family under consideration in this paper should contain no less than three valid persons and two valid relationships. When assembling family networks, an inevitable issue is the complexity of cross-document coreference. Here are several example sentences: (a)“卡恩的妻子西蒙娜却站出来维护自己的丈夫,说……”(But Kahn’s wife Simone stood up for her husband and said…) (b)“巴博的妻子西蒙娜当日早些时候在执政党明确表示,法国并没有在……” (Earlier the day, as a representative of the ruling party, Gbagbo's wife Simone expressed clearly that French did not …) (c)“尼日利亚国家元首阿巴查的夫人玛丽亚姆・阿巴查26日在接受本社记 者采访……” (The wife of the head of Nigeria Abacha, Mariam Abacha received our interview on the 26th…) (d)“ 尼日利亚国家元首阿巴查 和夫人玛利亚姆・阿巴查 、外交部 长 ……”(The Head of Nigeria Abacha and his wife Mariam Abacha and the Foreign Minister…) (e)“这些人员中主要包括 阿巴查 的儿子 穆罕默德・阿巴查 和商人 ……”(These individuals included Abacha’s son Mohammed Abacha and merchant…) These five Chinese example sentences are from five respective documents. Among them, the person with the name of “西蒙娜” (Simone) in (a) is different from that in (b), so that the “Husband-Wife” relationships in (a) and (b) are different; the person whose name is “玛丽亚姆・阿巴查” (Mariam Abacha) mentioned in (c) is obviously the same person as “玛利亚姆・阿巴查” (Mariam Abacha) mentioned in (d), so that the relations in both (c) and (d) should be the same. Thus, the persons in (a) and (b) cannot constitute a family while (c), (d) and (e) can. Through the above examples can we find that it is critical to address the problems of name ambiguity and name variation in Family Network Aggregation. The former means the same name can point to multiple persons in different documents. The latter means the same person can be referred to by different names in different documents. Therefore, during the phase of Family Network Aggregation, it is necessary to put the same name that refers to different persons into their respective coreference chains, i.e. name disambiguation, as well to put the various mentions of a person into one equivalence coreference chain, i.e. name variation clustering.

Cross-Document Coreference Resolution General techniques of CDCR include unsupervised clustering [18] and supervised classification [19] etc. The features they adopt contain local context word features, global word features, global named entity features, entity semantic features and so on. Particularly, the relationship between persons often demonstrates to be a more powerful feature than the others, thus it should be taken into full consideration for CDCR. We follow the principle that if there are two instances (each instance contains a pair of persons) discovered from different documents with the same names and the same relationship, they should be regarded as the same instance and thus merged.

206 J. Gu et al.

In this paper, we follow a simple, yet effective method of name string matching to aggregate multiple persons into the same family. The processing steps are as follows: 1. Exact Name Matching (ENM): compare person names involved in all instances regard those with the exactly same names as one person. These persons are taken as the linking point for families, through which initial families are formed. It is worth noting that different persons may have the same name, and in this step they are merged into the same family. 2. Name Disambiguation (ND): remove the namesake, i.e. the different person with the same name, from the initial family. We adopt global document entity features to compute the cosine similarity between the same names from different documents, and discard the names whose similarity score is below the empirical threshold. 3. Name Variation Aggregation (NVA): Inside a family, we adopt minimum Levenshtein distance to calculate the similarity score between two names. If the similarity score between two names is above a threshold (how to fix it), then these two names are merged into the same coreference chain. As such, family networks are generated.

Name Disambiguation and Name Variation Aggregation For Name Disambiguation, the similarity score between two names is computed as the cosine between their corresponding feature vectors. We take names of entities that co-occur with the target name as its features. Only if the similarity score is above the empirical threshold can we regard the names as referring to the same person. The score is calculated as follows: ⋅ wwii cos()name , name = ∑ 12 (3) 12 22⋅ ∑ ww12ii∑

Where w1i and w2i are the weights of the feature ti which is shared by both name1 and name2. The frequency of an entity occurring in the document is taken as its weight. Named entities are recognized using ICTCLAS, including person names tagged as “-nr”, geographic names tagged as “-ns”, organization names tagged as “- nt” etc. For Name Variation Aggregation inside a family, we use the minimum Levenshtein distance to merge name variations. Minimum Levenshtein distance (a.k.a. edit distance) measures the similarity between two names by the number of edit operations needed to transform one string to another. Usually there are three operations, i.e. insertion, deletion and replacement. Each of them is given a value as its operation cost. The costs for three operations are 0.5, 1.5 and 2.0 for insertion, deletion and replacement. It should be pointed out that, there are a very few instances that belong to the same family, yet they cannot be fused into a family due to diverse name expressions by our method. For example, in “Parent-Child” relation type, between the instances “苏哈多,

Research on Building Family Networks Based on Bootstrapping 207

哈迪扬蒂” (Suharto, Hardiyanti) and “苏哈托, 鲁克马纳” (Suharto, Rukmana), the names of “苏哈多” (Suharto) and “苏哈托” (Suharto) actually refer to the same person, i.e. the former Indonesian president “苏哈托” (Suharto); and the names of “ 哈迪扬蒂” (Hardiyanti) and “鲁克马纳” (Rukmana) also refer to the same person, i.e. the daughter of “苏哈托” (Suharto), whose full name is “西蒂·哈迪扬蒂·鲁克马纳” (Siti Hardiyanti Rukmana). Nevertheless, due to the diversity in their name expressions, the two instances cannot be linked as one by Exact Name Matching, let alone the construction of the family thereafter.

4 System Performance Evaluation

There are three phases in constructing family networks, thus, for the sake of evaluating our work comprehensively, we devise three metrics for them.

4.1 Evaluation for Personal Family Relation Extraction

It is obvious that the performance of relation extraction phase determines the performance in the following phases. Because of lacking effective approaches to compute the recall from a large scale of corpus, instead we pay more attention to the precision in this phase. The precision score is defined as follows: C Precision = (4) T

Where C is the number of correct instances extracted by our system, and T is the total number of instances we extracted. Since it is difficult to check the validity of all extracted instances owing to its relatively large number, we select n (n=40) instances randomly and judge their validity manually. This process is repeated four times. Finally, the precision is calculated as the average of four times.

4.2 Evaluation for Cross-Document Coreference Resolution We used the standard B-CUBE [21] scoring algorithm to evaluate the co-reference chain inside a family. The algorithm adopts object-oriented method to compute the score of each coreference chains created by our system. Then the Precision, Recall and F1 scores are calculated by comparing the ground-truth chains with the chains formed by the system. It should be pointed out that when computing the scores for isolated mentions, which cannot be fused with any family, the below principle is followed: if the mention is linked to a coreference chain by mistake, we will calculate the loss of performance it causes; otherwise it is ignored.

208 J. Gu et al.

4.3 Evaluation for Family Network Aggregation

As for family networks, there is not any existing measure available. Since the number of constructed families is not quite large, we annotate the families generated by our system manually. A family is regarded as correct only if it matches an annotated one exactly, then the corresponding precision, recall and F1 scored are normally calculated.

5 Experimental Results and Analysis

5.1 Corpus and Preprocessing

In this paper, we choose the Chinese corpus of Gigaword, which is gathered from various types of newspapers, as the experimental dataset, owing to its uniform in language expression. The dataset contains 1,033,679 articles of news, including Xinhua News Agency and Lianhe Zaobao. After transformed into XML format, the dataset was loaded into Solr for local retrieval. When constructing family networks, the baseline is built with the exact matching of person names. Manual annotations are then performed on the baseline. We manually go through each family to rectify errors in coreference chains and families, i.e. separate ambiguous names for distinctive persons and merge various names for the same people, and then finalize the golden set. During annotation we follow the principle that the names that cannot be linked to a family are ignored. This will help us avoid the problem of prohibitive amount of manual annotation, though at the loss of a very few families (cf. Section 3.2). The statistics of the golden set is shown in Table1.

Table 1. Statistics on manual tagging results

# of coreference chains # of families # of persons in families 867 149 510

5.2 The Performance of Relation Extraction Table 2 shows patterns and instances extracted in relation extraction phase. Note that each pattern can reliably describe the corresponding relationship between persons. From the table, we can see a sufficient number of patterns obtained by the bootstrapping procedure.

Table 2. Patterns learned from bootstrapping

Relation types # of patterns Examples 的儿子、< Parent >的次子< Parent-Child 26 Child >、< Parent >同志的女儿< Child >… 的夫人、< Husband >的遗 Husband-Wife 33 孀< Wife >、的妻子

Research on Building Family Networks Based on Bootstrapping 209

Table 3. Person pairs learned from bootstrapping

# of pairs discovered Precision (%) 2167 94.0

Table 3 gives the total number of instances and its precision score. It shows that the precision score reaches 94.0%. Through error analysis, we find that the major error is caused by word segmentation. A person name would be segmented into various forms because of its high frequency of occurrence in documents. For instance, in the sentence “下午三点半, 当李鹏总理和夫人朱琳来到厂区时……” (When Chinese Premier Li Peng and his wife Zhu Lin came to the plant at 3:30pm…), the name of “ 朱琳” (Zhu Lin) is segmented as “朱琳来” (Zhu Linlai) by mistake.

5.3 The Performance of Coreference Resolution Table 4 compares the results, i.e. the performance of coreference chains discovered in families, from the different methods (i.e. Exact Name Matching, Exact Name Matching plus Name Variation Aggregation, Name Disambiguation and Name Disambiguation plus Name Variation Aggregation) for Cross-Document Coreference Resolution in family networks. From the table we can see:

(i) Using the Exact Name Matching method the precision can reach as high as 97.3%, nevertheless the recall is only 77.3%. It means that the phenomenon of name variation is far more serious than that of name ambiguity. (ii) After Name Variation Aggregation, the performance improves significantly, particularly when it is performed on top of Exact Name Matching. (iii) Compared with Exact Name Matching, Name Disambiguation decreases recall significantly, though it can help to increase the precision moderately. This means that name disambiguation will split the correct person names as well as the wrong ones.

Table 4. Performance of CDCR with different methods

Methods # of chains # of correct chains P (%) R (%) F1 (%) ENM 1041 593 97.3 77.3 86.2 ENM+NVA 753 564 96.1 90.9 93.4 ND 896 479 99.1 73.7 84.5 ND+NVA 609 447 98.2 88.3 93.0

5.4 The Performance of Family Networks

Table 5 reports the total number of families discovered, precision, recall and F1 for various CDCR methods on the corpus. The table shows that Name Variation aggregation has significant effect on Family Network Fusion, increasing both precision and recall. There is an interesting phenomenon that though using name

210 J. Gu et al.

Table 5. Performance of family networks with different CDCR methods

# of correct Methods # of families P (%) R (%) F1 (%) families ENM 264 93 35.2 62.4 45.0 ENM+NVA 211 107 50.7 71.8 59.4 ND 232 85 36.6 57.0 44.6 ND+NVA 176 98 55.7 65.8 60.3 disambiguation alone will decrease the performance for both CDCR and family networks, it can indeed improve the performance of family networks when used in conjunction with name variation aggregation. This suggests that Name Disambiguation can promote the precision, therefore it has a positive effect on family network fusion. Taking the sentences (a) and (b) in Section 3.2 as examples, accurate name matching will merge “卡恩” (Kahn) and “巴博” (Gbagbo) into one family with the linking point “西蒙娜” (Simone), and thus generate a spurious family containing two “Husband-Wife” relations. However name disambiguation will distinguish the mentions of the same name “西蒙娜” (Simone) in (a) and (b), thus it can remove the noise to improve family networks.

6 Conclusion and Future Work

In this paper, we present a method to use bootstrapping and coreference resolution techniques for constructing personal family networks. By adopting a bootstrapping architecture, we first learn patterns of different relations, and then discover pairs of persons as instances of patterns iteratively. After bootstrapping, we fuse pairs of persons into family networks by Cross-Document Coreference Resolution techniques. Experiments on the Chinese large-scale corpus of Gigaword show that our method can construct personal family networks accurately and efficiently. However, the family networks we generated still have several shortcomings. For example, the types of family relations are not sufficient enough; we don’t take the association between different families into consideration; the recall of performance is not good enough. For future work, we intend to import more types of family relations and the relations between different families to further expand the scale of families so as to enrich our family networks.

Acknowledgement. This work is funded by Jiangsu NSF Grants BK2010219 and 11KJA520003.

References

1. Kautz, H., Selman, B., Shah, M.: Referral Web: combining social networks and collaborative filtering. Communications of the ACM 40(3), 63–65 (1997) 2. Flink, M.P.: Semantic web technology for the extraction and analysis of social networks. Web Semantics: Science, Services and Agents on the World Wide Web 3(2), 211–223 (2005)

Research on Building Family Networks Based on Bootstrapping 211

3. Tang, J., Zhang, J., Yao, L., et al.: ArnetMiner: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 990–998. ACM (2008) 4. Elson, D.K., Dames, N., McKeown, K.R.: Extracting social networks from literary fiction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 138–147. Association for Computational Linguistics (2010) 5. Agarwal, A., Corvalan, A., Jensen, J., et al.: Social Network Analysis of Alice in Wonderland. NAACL-HLT 2012, 88 (2012) 6. van de Camp, M., van den Bosch, A.: A link to the past: constructing historical social networks. In: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, pp. 61–69. Association for Computational Linguistics (2011) 7. Zhou, G.D., Zhang, M.: Extracting relation information from text documents by exploring various types of knowledge. Information Processing & Management 43(4), 969–982 (2007) 8. Oh, J.H., Uchimoto, K., Torisawa, K.: Bilingual co-training for monolingual hyponymy- relation acquisition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and The 4th International Joint Conference on Natural Language Processing of the AFNLP, vol. 1, pp. 432–440. Association for Computational Linguistics (2009) 9. Zhang, M., Su, J., Wang, D., Zhou, G., Tan, C.-L.: Discovering relations between named entities from a large raw corpus using tree similarity-based clustering. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 378–389. Springer, Heidelberg (2005) 10. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational linguistics, vol. 2, pp. 539–545. Association for Computational Linguistics (1992) 11. Pantel, P., Pennacchiotti, M.: Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 113–120. Association for Computational Linguistics (2006) 12. Conglei, Y., Nan, D.: An Extraction Method on Web. Pattern Recognition and Artificial Intelligence 2(6) (2007) (in Chinese) 13. Tian, G., Qian, M., Huaping, Z.: A Research on Social Network Extraction Based on Web Search Engine. Chinese Product Reviews Filtering (2009) (in Chinese) 14. Peng, C., Gu, J., Qian, L.: Research on Tree Kernel-Based Personal Relation Extraction. In: Zhou, M., Zhou, G., Zhao, D., Liu, Q., Zou, L. (eds.) NLPCC 2012. CCIS, vol. 333, pp. 225–236. Springer, Heidelberg (2012) 15. Zhu, J., Nie, Z., Liu, X., et al.: StatSnowball: a statistical approach to extracting entity relationships. In: Proceedings of the 18th International Conference on World Wide Web, pp. 101–110. ACM (2009) 16. Cover, T., Thomas, J., Proakis, J.G., et al.: Elements of information theory. telecommunications. Wiley series (1991) 17. Pantel, P., Ravichandran, D.: Automatically labeling semantic classes. In: Proceedings of HLT/NAACL, vol. 4, pp. 321–328 (2004) 18. Gooi, C.H., Allan, J.: Cross-document coreference on a large scale corpus. In: HLT- NAACL, pp. 9–16 (2004) 19. Mayfield, J., Alexander, D., Dorr, B., et al.: Cross-document coreference resolution: A key technology for learning by reading. In: AAAI Spring Symposium on Learning by Reading and Learning to Read (2009) 20. Malin, B.: Unsupervised name disambiguation via social network similarity. In: Workshop on Link Analysis, Counterterrorism, and Security, vol. 1401, pp. 93–102 (2005) 21. Bagga, A.: Evaluation of coreferences and coreference resolution systems. In: Proceedings of the First Language Resource and Evaluation Conference, pp. 563–566 (1998)