Journal of the American Medical Informatics Association, 28(9), 2021, 1919–1927 doi: 10.1093/jamia/ocab095 Advance Access Publication Date: 28 June 2021 Research and Applications
Research and Applications Aggregating large-scale databases for PubMed author
name disambiguation Downloaded from https://academic.oup.com/jamia/article/28/9/1919/6310425 by guest on 01 October 2021
Li Zhang, Yong Huang, Jinqing Yang, and Wei Lu
School of Information Management, Wuhan University, Wuhan, China Corresponding Author: Wei Lu, PhD, School of Information Management, Wuhan University, 299, Bayi Street, Wuchang District, Wuhan, China; [email protected]
Received 10 March 2021; Revised 13 April 2021; Editorial Decision 30 April 2021; Accepted 7 May 2021
ABSTRACT Objective: PubMed has suffered from the author ambiguity problem for many years. Existing studies on author name disambiguation (AND) for PubMed only used internal metadata for development. However, some of them are incomplete (eg, a large number of names are only abbreviated and their full names are not available) or less discriminative. To this end, we present a new disambiguation method, namely AggAND, by aggregating information from external databases. Materials and Methods: We address this issue by exploring Microsoft Academic Graph, Semantic Scholar, and PubMed Knowledge Graph to enhance the built-in name metadata, and extend the internal metadata with some external and more discriminative metadata. Results: Experimental results on enhanced name metadata demonstrate comparable performance to 3 author identifier systems, as well as show superiority over the original name metadata. More importantly, our method, AggAND, incorporating both enhanced name and extended metadata, yields F1 scores of 95.80% and 93.71% on 2 datasets and outperforms the state-of-the-art method by a large margin (3.61% and 6.55%, respectively). Conclusions: The feasibility and good performance of our methods not only help better understand the impor- tance of external databases for disambiguation, but also point to a promising direction for future AND studies in which information aggregated from multiple bibliographic databases can be effective in improving disambig- uation performance. The methodology shown here can be generalized to broader bibliographic databases be- yond PubMed. Our code and data are available online (https://github.com/carmanzhang/PubMed-AND-method).
Key words: PubMed, author name disambiguation, bibliographic database, digital library
INTRODUCTION ing researchers to quickly identify the latest literature despite the Background and significance fast proliferation of scientific research. However, most of them do As a special case of the more general problem of entity resolution,1–4 not disambiguate authors.5 This problem not only hinders the com- the author ambiguity problem is ubiquitous in large-scale biblio- munication of valuable discoveries produced by others, but also graphic databases. It is usually not easy to determine whether 2 cita- restricts many downstream pieces of research or applications, such tions with similar names are authored by the same individual, and as funding fairness or gender inequality studies for scientists. Zhang this uncertainty is even higher for abbreviated names. Nowadays, et al6 claimed that the author identifier (ID) constructed by a disam- scholarly databases/libraries, eg, PubMed (https://pubmed.ncbi.nlm. biguation algorithm is a core component of AMiner (https://www. nih.gov), provide easy-to-use interfaces for academic search, allow- aminer.cn), a large-scale bibliographic database covering many jour-
VC The Author(s) 2021. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: [email protected] 1919 1920 Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 9
nals and conference papers. A study of the query log analysis in f : fci; cjjang!rij; rij 2f0; 1g (1) PubMed suggested that nearly a quarter of queries used author In doing so, a classification model fhðxÞ with parameters h and names exclusively.7 extracted features x is used to contextualize the function f , where x This article focuses on the name ambiguity problem in PubMed. is derived from author profiles. Here, author profiles are used to To alleviate this problem, the research team at the National Library characterize authors in author level (ie, author name, email, organi- of Medicine has provided a useful functionality: ranking citations zation, and location) and paper level (ie, paper title and abstract). that contain the query name in the PubMed interface.8 In addition, Author profiles in citation ci can be represented by a sequence of many academic services have been developed for researchers, such ð1Þ ð2Þ ðpÞ metadata. For instance, ci ¼ < m ; m ; ...; m >, where p is the as ORCID (https://orcid.org), ResearcherID (https://www.research- i i i number of PubMed internal metadata in use. Then, the disambigua- erid.com), and ResearchGate (https://www.researchgate.net),9,10 tion result can be formalized as: but none of them are widely used in today’s author community. For 8 example, only 15.4% of 214 million publications have an ORCID < ð1Þ ð1Þ ð1Þ ð2Þ ð2Þ ð2Þ ðpÞ ðpÞ ðpÞ xij ¼ < v mi ; mj ; v mi ; mj ; ...; v mi ; mj > iD. Recent computational author name disambiguation (AND) : Downloaded from https://academic.oup.com/jamia/article/28/9/1919/6310425 by guest on 01 October 2021 methods for PubMed fall into 3 groups: rule based, graph based, rij ¼ fhðxijÞ and machine learning (ML) based. The principle of the rule-based (2) approaches is to define a set of rules in advance and to use them for , where vðkÞð Þ represents the similarity measurement of the kth heuristically determining authors.11–13 The major drawback is the metadata. lack of supportive evidence; a preset threshold is manually chosen In this study, we address the metadata-related problem by con- without clarification. In recent years, some graph-based disambigua- sidering external databases; the associated metadata are utilized to tion approaches14,15 have been proposed. Although not specifically enhance or extend the internal author profiles. As shown in equation designed for PubMed, most of them can be extended to AND in 3, y and z are joined with the internal feature vector x to form a new PubMed. However, high computational cost with graph representa- representation of ambiguous authors, where y and z denote the fea- tion learning techniques makes such methods inefficient to disam- ture vector of enhanced metadata and extended metadata, respec- biguate on a PubMed-scale database. Others approach AND using tively. The range of enhanced metadata is from mðpþ1Þ to mðqÞ, and ML-based methods.16–18 Models are trained through a learning pro- the range of extended metadata is from mðqþ1Þ to mðrÞ. cess based on prior observations (ie, features extracted from meta- 8 19 > ð1Þ ð1Þ ðpÞ ðpÞ data) and can therefore be used to infer the class of unseen data. > x ¼ < vð1Þ m ; m ; ...; vðpÞ m ; m > > ij i j i j However, the metadata sparsity problem has not been adequately > > considered. Existing ML-based approaches only used metadata > < ðpþ1Þ ðpþ1Þ; ðpþ1Þ ; ...; ðqÞ ðqÞ; ðqÞ > <> yij ¼ v mi mj v mi mj within PubMed, and for some of those frequently used by prior- 16 (3) > ðqþ1Þ ðqþ1Þ ðrÞ ðrÞ works, they may be not always available for all the PubMed > z ¼ < vðqþ1Þ m ; m ; ...; vðrÞ m ; m > > ij i j i j authors. For example, only 58.03% of all PubMed authors have full > > names, and the low percentage has been reported as a limitation of > : 0 PubMed data by many studies.5,20,21 Such a problem has raised even rij ¼ fhðxij yij zijÞ more ambiguities, as authors who were originally distinguishable by In this way, more discriminative information can be obtained name variance become indistinguishable. and thus helps the AND task. Our research framework is illustrated in Figure 1. To obtain such metadata, multiple comprehensive data- Objective bases are jointly used through linking databases; the associated To resolve name ambiguities in PubMed, this study tries to handle metadata are incorporated to build more accurate author profiles, ci the metadata-related issue by leveraging multiple databases. Specifi- and cj. Then, by modeling the profiles with a supervised learning cally, 3 well-known databases, Microsoft Academic Graph model on the training set, model predictions on the test set for each (MAG),22 Semantic Scholar (S2),23 and PubMed Knowledge Graph pair of ambiguous authors can be determined, and performance can (PKG)24 are exploited to aggregate additional information to en- be measured with respect to ground truth labels. hance the internal metadata and extend with new, discriminative metadata. This work extends a previous study: Vishnyakova et al17 However, different from them, our work tries to boost AND meth- Database linking ods using the external databases beyond PubMed and explore their In order to enrich author profiles, we made considerable efforts to impact on PubMed author disambiguation. link PubMed to MAG, S2, and PKG. These databases have opened their corpus either in full or in part for availability. We downloaded, MATERIALS AND METHODS parsed, and stored them in a database management system (note that the versions of the 4 databases are all 2019 versions). All of the Problem definition and method overview external databases contain comprehensive publication records and In this section, we describe the materials and methods used in our have their own author ID system, of which MAG and S2 include method. We first give a formal definition of the disambiguation over 170 million articles and have high coverage of PubMed papers problem and then describe our method in a detailed manner. Our (MAG-PubMed: 80.76%; S2-PubMed: approximately 100%) and work follows the ML-based approach because it is more effective authors (MAG-PubMed: 77.28%; S2-PubMed: 96.66%). As dem- than rule-based approaches and is more efficient than graph-based onstrated in Figure 1, the linking approach includes 2 steps: publica- approaches. For scholarly name disambiguation, the task is to char- tion linking and author linking. The publication linkages are acterize a function f that can map 2 citations ci and cj with a similar established on the official PubMed publication identifier (PMID), name an to a binary value r, designating whether ci and cj are auth- which can be extracted from the “outbound links” of the databases. ored by the same person in the real world. For example, “Paper Urls” (https://docs.microsoft.com/en-us/aca- Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 9 1921 Downloaded from https://academic.oup.com/jamia/article/28/9/1919/6310425 by guest on 01 October 2021
Figure 1. Overview of our method for PubMed author name disambiguation; external databases are jointly used to boost author profiles for better disambigua- tion. PMID: PubMed publication identifier. demic-services/graph/reference-data-schema#paper-urls) in MAG Author profile building schema refers to the original links to the publications that MAG has Most author profiles can be represented directly from metadata, obtained, and these links usually contain the identifiable metadata, whereas some profiles, such as enhanced name, affiliation-related PMID. The author linkages are based on the publication linkages. profiles (Organization, Location, Country, City), and content- Because the author sequence or position is explicitly noted within related profiles (Journal Descriptors, Semantic Types), require fur- each citation, we can connect the authorships between PubMed and ther refinement or the assistance of pretrained models. these databases when there are the same PMID and same author po- In terms of the enhanced name profile, we consider the longest sition. Details of the linking approach are provided in our prior first name among the collected names as the well-formed first name study.25 Note that the amount of data involved in the linking pro- to improve its density. For the affiliation-related profiles, we follow cess is the entire PubMed, rather than the evaluation datasets, be- the same settings as that are described in the competing methods— cause the densities of the metadata obtained in this way are Song et al16 and Vishnyakova et al17—to build these profiles. Specif- computed globally, so it is more appropriate to demonstrate the fea- ically, named entity recognition provided in Stanford CoreNLP27 sibility of our method when generalizing to the entire PubMed. and Natural Language Toolkit (http://www.nltk.org/) are utilized to detect geographic information because various variants of valuable information are mixed together. The disambiguation features, Jour- nal Descriptors (JDs), and Semantic Types (STs), proposed by the studies17,28 are used to capture the content information of the 29 Metadata extraction works. We utilize the Journal Descriptor Indexing tool to generate Table 1 summarizes the metadata harvested from the respective a ranked list of JDs or STs as an output to a given textual content databases. For PubMed, we extract a variety of metadata. For the (ie, title, abstract, and other textual matter of a citation where avail- external metadata, we extract the corresponding author name and able). identifiers. The author names collected here are used to enhance the built-in name metadata, and the identifiers are used as an extension to PubMed metadata. We introduce such metadata extensions for Features and supervised learning several reasons. First, they are discriminating by nature. By borrow- Based on the author profiles, we calculate the similarity of pairwise ing the ideology of ensemble meta-algorithms in machine learning,26 author profiles (hereafter referred to as features) and divide them the homogeneous IDs allow us to develop more powerful disambig- into 4 groups according to the type of metadata or data source: In- uation methods. This ideology has been successfully applied in many ner Group (metadata from PubMed only), InnerName Group (name machine learning models, like random forest (RF), which consists of metadata from PubMed only), EnhancedName Group (name meta- multiple, less impressive trees. Second, they can be easily retrieved data enhanced by external databases), and ExtendedIDs Group (ID- from the aforementioned databases; also, it is very convenient to related metadata from external databases only). Table 2 describes transform them into features by determining whether the author ID the feature groups and the corresponding author profiles, metadata, is exactly the same or the intersections of coauthor IDs. and data sources. Note that we no longer itemize the Inner Group
Table 1. Internal and external metadata
Source Metadata
PubMed (Internal) Author Name, Affiliation, Coauthor, Paper Title, Abstract, Keyword, MeSH Heading, Publication Date, Journal Title, Language, Reference S2/MAG/PKG (External) Author Name, Author ID, Coauthor IDs
MAG: Microsoft Academic Graph; PKG: PubMed Knowledge Graph; S2: Semantic Scholar. 1922
Table 2. Feature group settings
Feature Group Feature Name Similarity Measurement Author Profile Metadata Source
InnerName/EnhancedName Full Name Similarity jcharðfniÞ\charðfnjÞj=jcharðfniÞ Full Name (fn) Author Name PubMed/(PubMed, MAG, S2)