Learning CNF Blocking for Large-Scale Author Name Disambiguation

Learning CNF Blocking for Large-scale Author Name Disambiguation Kunho Kimy∗, Athar Sefidz, C. Lee Gilesz yMicrosoft Corporation, Redmond, WA, USA z The Pennsylvania State University, University Park, PA, USA [email protected], [email protected], [email protected] Abstract publications) in a digital library search engine. An- other is to calculate author-related statistics such as Author name disambiguation (AND) algo- an h-index, and collaboration relationships between rithms identify a unique author entity record authors. from all similar or same publication records Typically, a clustering method is used to cal- in scholarly or similar databases. Typically, a clustering method is used that requires calculate AND. Such clustering calculates pairwise culation of similarities between each possible similarities between each possible pairs of records record pair. However, the total number of pairs that then determines whether each pair should be grows quadratically with the size of the author in the same cluster. Since the number of possible database making such clustering difficult for pairs in a database with the number of records n is millions of records. One remedy is a block- n(n − 1)=2, it grows as O(n2). Since n can be mil- ing function that reduces the number of pair- lions of authors in some databases such as PubMed, wise similarity calculations. Here, we intro- duce a new way of learning blocking schemes AND algorithms need methods that scale, such as by using a conjunctive normal form (CNF) in a blocking function (Christen, 2012). The blocking contrast to the disjunctive normal form (DNF). function produces a reduced list of candidate pairs, We demonstrate on PubMed author records and only the pairs on the list are considered for that CNF blocking reduces more pairs while clustering. preserving high pairs completeness compared Blocking usually consists of blocking predicates. to the previous methods that use a DNF and Each predicate is a logical binary function with that the computation time is significantly reduced. In addition, we also show how to en- a combination of an attribute and a similarity cri- sure that the method produces disjoint blocks terion. One example can be exact match of the so that much of the AND algorithm can be effi- last name. A simple but effective way of blocking ciently paralleled. Our CNF blocking method involves manually selecting the predicates, with is tested on the entire PubMed database of respect to the data characteristics. Much recent 80 million author mentions and efficiently re- work on large-scale AND uses a heuristic that is moves 82.17% of all author record pairs in 10 the initial match of first name and exact match minutes. of last name (Torvik and Smalheiser, 2009; Liu et al., 2014; Levin et al., 2012; Kim et al., 2016). 1 Introduction Although this gives reasonable completeness, it Author name disambiguation (AND) refers to the can be problematic when the database is extremely problem of identifying each unique author entity large, such as the author mentions in CiteSeerX record from all publication records in scholarly (10M publications, 32M authors), PubMed (24M databases (Ferreira et al., 2012). It is also an impor- publications, 88M authors), and Web of Science tant preprocessing step for a variety of problems. (45M publications, 163M authors)1. One example is processing author-related queries The blocking results on PubMed using this properly (e.g., identify all of a particular author’s heuristic are shown in Table1. Note that most of the block sizes are less than 100 names, but a ∗Work done while the author was at Pennsylvania State University few blocks are extremely large. Since the number A shorter preprint version of this paper was published at arXiv (Kim et al., 2017) 1Numbers were as of 2016. 72 Proceedings of the First Workshop on Scholarly Document Processing, pages 72–80 Online, November 19, 2020. c 2020 Association for Computational Linguistics https://doi.org/10.18653/v1/P17 Table 1: Block Size Distribution of PubMed author describe learning of CNF blocking and how to use mentions using the simple blocking heuristic. it to ensure the production of disjoint blocks. Next, we evaluate our methods on the PubMed dataset. Block Size Frequency Percentage Finally, the last section consists of a summary work 2 ≤ n < 10 1,586,677 59.91% with possible future directions. 10 ≤ n < 100 910,272 34.37% 100 ≤ n < 1000 144,361 5.45% 1000 ≤ n < 10000 6,998 0.26% 2 Related Work 10000 ≤ n < 50000 184 0.01% Blocking has been widely studied for record link- n ≥ 50000 9 < 0.01% age and entity disambiguation. Standard block- Total 2,648,501 100.0 % ing is the simplest but most widely used method (Fellegi and Sunter, 1969). It is done by con- of pairs grows quadratically, those few blocks can sidering only pairs that meet all blocking predi- dominate the computation time. This imbalance of cates. Another is the sorted neighborhood approach the block size is due to the popularity of certain sur- (Hernandez´ and Stolfo, 1995) which sorts the data names, especially Asian names (Kim et al., 2016). by a certain blocking predicate, and forms blocks To make matters worse, this problem increases in with pairs of those records within a certain window. time, since the growth rates of publication records Yan et al. (2007) further improved this method to are rapidly increasing. adaptively select the size of the window. Aizawa To improve the blocking, there has been work on and Oyama (2005) introduced a suffix array-based learning the blocking (Bilenko et al., 2006; Michel- indexing method, which uses an inverted index of son and Knoblock, 2006; Cao et al., 2011; Kejriwal suffixes to generate candidate pairs. Canopy clus- and Miranker, 2013; Das Sarma et al., 2012; Fisher tering (McCallum et al., 2000) generates blocks et al., 2015). These can be categorized into two dif- by clustering with a simple similarity measure and ferent methods. One is a disjoint blocking, where use loose & tight thresholds to generate overlap- each block is separated so each record belongs to ping clusters. Recent surveys (Christen, 2012; Pa- a single block. Another is non-disjoint blocking, padakis et al., 2016, 2020) imply that there are where some blocks have shared records. Each has no clear winners and proper parameter tuning is advantages. Disjoint blocking can make the clus- required for a specific task. tering step easily parallelized, while non-disjoint Much work optimized the blocking function for blocking often produces smaller blocks. and also standard blocking. The blocking function is typi- has more degrees of freedom from which to select cally presented with a logical formula with block- the similarity criterion. ing predicates. Two studies focused on learning a Here, we propose to learn a non-disjoint block- disjunctive normal form (DNF) blocking (Bilenko ing with a conjunctive normal form (CNF). Our et al., 2006; Michelson and Knoblock, 2006) were main contributions are: published in the same year. Making use of manually labeled record pairs, they used a sequential • Propose a CNF blocking, which reduces more covering algorithm to find the optimal blocking pairs compared to DNF blocking, in order to predicates in a greedy manner. Additional unla- achieve a large number of pairs completeness. beled data was used to estimate the reduction ra- This also reduces the processing time, which tio of their cost function (Cao et al., 2011) while benefits various applications such as online an unsupervised algorithm was used to automati- disambiguation, author search, etc. cally generate labeled pairs with rule-based heuris- • Extend the method to produce disjoint blocks, tics used to learn DNF blocking (Kejriwal and Mi- so that the AND clustering step can be easily ranker, 2013). parallelized. All the work above proposed to learn non- disjoint blocking because of the logical OR terms • Compare different gain functions, which are in the DNF. However, other work learns the block- used to find the best blocking predicates for ing function with a pure conjunction, to ensure the each step of learning. generation of disjoint blocks. Das et al. (2012) Previous work is discussed in the next session. learns a conjunctive blocking tree, which has dif- This is followed by problem definition. Next, we ferent blocking predicates for each branch of the 73 tree. Fisher et al. (2015) produces blocks with respect to a size restriction, by generating candidate blocks with a list of predefined blocking predicates and then performs a merge and split to generate the Algorithm 1 DNF Blocking block with the desired size. 1: function LEARNCONJTERMS(L; P; p; k) Our work proposes a method for learning a non- 2: Let P os be set of positive samples in L disjoint blocking function in a conjunctive normal 3: Let Neg be set of negative samples in L form (CNF). Our method is based on a previous 4: T erms fpg CNF learner(Mooney, 1995), which uses the fact 5: CurT erm p that a CNF can be a logical dual of a DNF. 6: i 1 7: while i < k do 3 Problem Definition 8: Find pi 2 P that maximizes gain function CALCGAIN(P os; Neg; CurT erm ^ p ) Our work tackles the same problem with base- i 9: CurT erm CurT erm ^ p line DNF blocking (Bilenko et al., 2006; Michel- i 10: Add CurT erm to T erms son and Knoblock, 2006), but in a different way 11: i i + 1 to get the optimized blocking function. Let 12: end while R = fr1; r2; ··· ; rng be the set of records in the 13: return T erms database, where n is the number of records.

Learning CNF Blocking for Large-Scale Author Name Disambiguation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support