Aggregating Large-Scale Databases for Pubmed Author Name

Home , Microsoft Academic, Semantic Scholar

Journal of the American Medical Informatics Association, 28(9), 2021, 1919–1927 doi: 10.1093/jamia/ocab095 Advance Access Publication Date: 28 June 2021 Research and Applications

Research and Applications Aggregating large-scale databases for PubMed author

name disambiguation Downloaded from https://academic.oup.com/jamia/article/28/9/1919/6310425 by guest on 01 October 2021

Li Zhang, Yong Huang, Jinqing Yang, and Wei Lu

School of Information Management, Wuhan University, Wuhan, China Corresponding Author: Wei Lu, PhD, School of Information Management, Wuhan University, 299, Bayi Street, Wuchang District, Wuhan, China; [email protected]

Received 10 March 2021; Revised 13 April 2021; Editorial Decision 30 April 2021; Accepted 7 May 2021

ABSTRACT Objective: PubMed has suffered from the author ambiguity problem for many years. Existing studies on author name disambiguation (AND) for PubMed only used internal metadata for development. However, some of them are incomplete (eg, a large number of names are only abbreviated and their full names are not available) or less discriminative. To this end, we present a new disambiguation method, namely AggAND, by aggregating information from external databases. Materials and Methods: We address this issue by exploring Microsoft Academic Graph, Semantic Scholar, and PubMed Knowledge Graph to enhance the built-in name metadata, and extend the internal metadata with some external and more discriminative metadata. Results: Experimental results on enhanced name metadata demonstrate comparable performance to 3 author identiﬁer systems, as well as show superiority over the original name metadata. More importantly, our method, AggAND, incorporating both enhanced name and extended metadata, yields F1 scores of 95.80% and 93.71% on 2 datasets and outperforms the state-of-the-art method by a large margin (3.61% and 6.55%, respectively). Conclusions: The feasibility and good performance of our methods not only help better understand the importance of external databases for disambiguation, but also point to a promising direction for future AND studies in which information aggregated from multiple bibliographic databases can be effective in improving disambiguation performance. The methodology shown here can be generalized to broader bibliographic databases beyond PubMed. Our code and data are available online (https://github.com/carmanzhang/PubMed-AND-method).

Key words: PubMed, author name disambiguation, bibliographic database, digital library

INTRODUCTION ing researchers to quickly identify the latest literature despite the Background and significance fast proliferation of scientific research. However, most of them do As a special case of the more general problem of entity resolution,1–4 not disambiguate authors.5 This problem not only hinders the com- the author ambiguity problem is ubiquitous in large-scale biblio- munication of valuable discoveries produced by others, but also graphic databases. It is usually not easy to determine whether 2 cita- restricts many downstream pieces of research or applications, such tions with similar names are authored by the same individual, and as funding fairness or gender inequality studies for scientists. Zhang this uncertainty is even higher for abbreviated names. Nowadays, et al6 claimed that the author identifier (ID) constructed by a disam- scholarly databases/libraries, eg, PubMed (https://pubmed.ncbi.nlm. biguation algorithm is a core component of AMiner (https://www. nih.gov), provide easy-to-use interfaces for academic search, allow- aminer.cn), a large-scale bibliographic database covering many jour-

VC The Author(s) 2021. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: [email protected] 1919 1920 Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 9

nals and conference papers. A study of the query log analysis in f : fci; cjjang!rij; rij 2f0; 1g (1) PubMed suggested that nearly a quarter of queries used author In doing so, a classification model fhðxÞ with parameters h and names exclusively.7 extracted features x is used to contextualize the function f , where x This article focuses on the name ambiguity problem in PubMed. is derived from author profiles. Here, author profiles are used to To alleviate this problem, the research team at the National Library characterize authors in author level (ie, author name, email, organi- of Medicine has provided a useful functionality: ranking citations zation, and location) and paper level (ie, paper title and abstract). that contain the query name in the PubMed interface.8 In addition, Author profiles in citation ci can be represented by a sequence of many academic services have been developed for researchers, such ð1Þ ð2Þ ðpÞ metadata. For instance, ci ¼ < m ; m ; ...; m >, where p is the as ORCID (https://orcid.org), ResearcherID (https://www.research- i i i number of PubMed internal metadata in use. Then, the disambigua- erid.com), and ResearchGate (https://www.researchgate.net),9,10 tion result can be formalized as: but none of them are widely used in today’s author community. For 8 example, only 15.4% of 214 million publications have an ORCID < ð1Þ ð1Þ ð1Þ ð2Þ ð2Þ ð2Þ ðpÞ ðpÞ ðpÞ xij ¼ < v mi ; mj ; v mi ; mj ; ...; v mi ; mj > iD. Recent computational author name disambiguation (AND) : Downloaded from https://academic.oup.com/jamia/article/28/9/1919/6310425 by guest on 01 October 2021 methods for PubMed fall into 3 groups: rule based, graph based, rij ¼ fhðxijÞ and machine learning (ML) based. The principle of the rule-based (2) approaches is to define a set of rules in advance and to use them for , where vðkÞðÞ represents the similarity measurement of the kth heuristically determining authors.11–13 The major drawback is the metadata. lack of supportive evidence; a preset threshold is manually chosen In this study, we address the metadata-related problem by con- without clarification. In recent years, some graph-based disambigua- sidering external databases; the associated metadata are utilized to tion approaches14,15 have been proposed. Although not specifically enhance or extend the internal author profiles. As shown in equation designed for PubMed, most of them can be extended to AND in 3, y and z are joined with the internal feature vector x to form a new PubMed. However, high computational cost with graph representa- representation of ambiguous authors, where y and z denote the fea- tion learning techniques makes such methods inefficient to disam- ture vector of enhanced metadata and extended metadata, respec- biguate on a PubMed-scale database. Others approach AND using tively. The range of enhanced metadata is from mðpþ1Þ to mðqÞ, and ML-based methods.16–18 Models are trained through a learning pro- the range of extended metadata is from mðqþ1Þ to mðrÞ. cess based on prior observations (ie, features extracted from meta- 8 19 > ð1Þ ð1Þ ðpÞ ðpÞ data) and can therefore be used to infer the class of unseen data. > x ¼ < vð1Þ m ; m ; ...; vðpÞ m ; m > > ij i j i j However, the metadata sparsity problem has not been adequately > > considered. Existing ML-based approaches only used metadata > < ðpþ1Þ ðpþ1Þ; ðpþ1Þ ; ...; ðqÞ ðqÞ; ðqÞ > <> yij ¼ v mi mj v mi mj within PubMed, and for some of those frequently used by prior- 16 (3) > ðqþ1Þ ðqþ1Þ ðrÞ ðrÞ works, they may be not always available for all the PubMed > z ¼ < vðqþ1Þ m ; m ; ...; vðrÞ m ; m > > ij i j i j authors. For example, only 58.03% of all PubMed authors have full > > names, and the low percentage has been reported as a limitation of > : 0 PubMed data by many studies.5,20,21 Such a problem has raised even rij ¼ fhðxij yij zijÞ more ambiguities, as authors who were originally distinguishable by In this way, more discriminative information can be obtained name variance become indistinguishable. and thus helps the AND task. Our research framework is illustrated in Figure 1. To obtain such metadata, multiple comprehensive data- Objective bases are jointly used through linking databases; the associated To resolve name ambiguities in PubMed, this study tries to handle metadata are incorporated to build more accurate author profiles, ci the metadata-related issue by leveraging multiple databases. Specifi- and cj. Then, by modeling the profiles with a supervised learning cally, 3 well-known databases, Microsoft Academic Graph model on the training set, model predictions on the test set for each (MAG),22 Semantic Scholar (S2),23 and PubMed Knowledge Graph pair of ambiguous authors can be determined, and performance can (PKG)24 are exploited to aggregate additional information to en- be measured with respect to ground truth labels. hance the internal metadata and extend with new, discriminative metadata. This work extends a previous study: Vishnyakova et al17 However, different from them, our work tries to boost AND meth- Database linking ods using the external databases beyond PubMed and explore their In order to enrich author profiles, we made considerable efforts to impact on PubMed author disambiguation. link PubMed to MAG, S2, and PKG. These databases have opened their corpus either in full or in part for availability. We downloaded, MATERIALS AND METHODS parsed, and stored them in a database management system (note that the versions of the 4 databases are all 2019 versions). All of the Problem definition and method overview external databases contain comprehensive publication records and In this section, we describe the materials and methods used in our have their own author ID system, of which MAG and S2 include method. We first give a formal definition of the disambiguation over 170 million articles and have high coverage of PubMed papers problem and then describe our method in a detailed manner. Our (MAG-PubMed: 80.76%; S2-PubMed: approximately 100%) and work follows the ML-based approach because it is more effective authors (MAG-PubMed: 77.28%; S2-PubMed: 96.66%). As dem- than rule-based approaches and is more efficient than graph-based onstrated in Figure 1, the linking approach includes 2 steps: publica- approaches. For scholarly name disambiguation, the task is to char- tion linking and author linking. The publication linkages are acterize a function f that can map 2 citations ci and cj with a similar established on the official PubMed publication identifier (PMID), name an to a binary value r, designating whether ci and cj are auth- which can be extracted from the “outbound links” of the databases. ored by the same person in the real world. For example, “Paper Urls” (https://docs.microsoft.com/en-us/aca- Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 9 1921 Downloaded from https://academic.oup.com/jamia/article/28/9/1919/6310425 by guest on 01 October 2021

Figure 1. Overview of our method for PubMed author name disambiguation; external databases are jointly used to boost author proﬁles for better disambiguation. PMID: PubMed publication identiﬁer. demic-services/graph/reference-data-schema#paper-urls) in MAG Author profile building schema refers to the original links to the publications that MAG has Most author profiles can be represented directly from metadata, obtained, and these links usually contain the identifiable metadata, whereas some profiles, such as enhanced name, affiliation-related PMID. The author linkages are based on the publication linkages. profiles (Organization, Location, Country, City), and content- Because the author sequence or position is explicitly noted within related profiles (Journal Descriptors, Semantic Types), require fur- each citation, we can connect the authorships between PubMed and ther refinement or the assistance of pretrained models. these databases when there are the same PMID and same author po- In terms of the enhanced name profile, we consider the longest sition. Details of the linking approach are provided in our prior first name among the collected names as the well-formed first name study.25 Note that the amount of data involved in the linking pro- to improve its density. For the affiliation-related profiles, we follow cess is the entire PubMed, rather than the evaluation datasets, be- the same settings as that are described in the competing methods— cause the densities of the metadata obtained in this way are Song et al16 and Vishnyakova et al17—to build these profiles. Specif- computed globally, so it is more appropriate to demonstrate the fea- ically, named entity recognition provided in Stanford CoreNLP27 sibility of our method when generalizing to the entire PubMed. and Natural Language Toolkit (http://www.nltk.org/) are utilized to detect geographic information because various variants of valuable information are mixed together. The disambiguation features, Jour- nal Descriptors (JDs), and Semantic Types (STs), proposed by the studies17,28 are used to capture the content information of the 29 Metadata extraction works. We utilize the Journal Descriptor Indexing tool to generate Table 1 summarizes the metadata harvested from the respective a ranked list of JDs or STs as an output to a given textual content databases. For PubMed, we extract a variety of metadata. For the (ie, title, abstract, and other textual matter of a citation where avail- external metadata, we extract the corresponding author name and able). identifiers. The author names collected here are used to enhance the built-in name metadata, and the identifiers are used as an extension to PubMed metadata. We introduce such metadata extensions for Features and supervised learning several reasons. First, they are discriminating by nature. By borrow- Based on the author profiles, we calculate the similarity of pairwise ing the ideology of ensemble meta-algorithms in machine learning,26 author profiles (hereafter referred to as features) and divide them the homogeneous IDs allow us to develop more powerful disambig- into 4 groups according to the type of metadata or data source: In- uation methods. This ideology has been successfully applied in many ner Group (metadata from PubMed only), InnerName Group (name machine learning models, like random forest (RF), which consists of metadata from PubMed only), EnhancedName Group (name meta- multiple, less impressive trees. Second, they can be easily retrieved data enhanced by external databases), and ExtendedIDs Group (ID- from the aforementioned databases; also, it is very convenient to related metadata from external databases only). Table 2 describes transform them into features by determining whether the author ID the feature groups and the corresponding author profiles, metadata, is exactly the same or the intersections of coauthor IDs. and data sources. Note that we no longer itemize the Inner Group

Table 1. Internal and external metadata

Source Metadata

PubMed (Internal) Author Name, Afﬁliation, Coauthor, Paper Title, Abstract, Keyword, MeSH Heading, Publication Date, Journal Title, Language, Reference S2/MAG/PKG (External) Author Name, Author ID, Coauthor IDs

MAG: Microsoft Academic Graph; PKG: PubMed Knowledge Graph; S2: Semantic Scholar. 1922

Table 2. Feature group settings

Feature Group Feature Name Similarity Measurement Author Profile Metadata Source

InnerName/EnhancedName Full Name Similarity jcharðfniÞ\charðfnjÞj=jcharðfniÞ Full Name (fn) Author Name PubMed/(PubMed, MAG, S2)

[charðfnjÞj Association Informatics Medical American the of Journal Last Name Length jcharðlnjÞj þ jcharðlnjÞj =2 Last Name (ln) Author Name PubMed/(PubMed, MAG, S2) Full Name Difference jcharðfniÞ[charðfnjÞj jcharðfniÞ Full Name Author Name PubMed/(PubMed, MAG, S2) \charðfnjÞj = Last Name Ambiguity Score TermFreq Pu ðlniÞþTermFreq ðlnjÞ Last Name Author Name PubMed/(PubMed, MAG, S2) 2 TermFreqðlnkÞ k¼0 ExtendedIDs MAG Author ID Consistency midi midj MAG Author ID (mid) Author ID MAG S2 Author ID Consistency sidi sidj S2 Author ID (sid) Author ID S2 PKG Author ID Consistency pidi pidj PKG Author ID (pid) Author ID PKG MAG Coauthor IDs Overlap jmcidi \ mcidjj MAG Coauthor IDs (mcid) Coauthor IDs MAG S2 Coauthor IDs Overlap jscidi \ scidjj S2 Coauthor IDs (scid) Coauthor IDs S2 PKG Coauthor IDs Overlap jpcidi \ pcidjj PKG Coauthor IDs (pcid) Coauthor IDs PKG

To demonstrate feature group contributions in the Results, the features are organized into several groups. MAG: Microsoft Academic Graph; PKG: PubMed Knowledge Graph; S2: Semantic Scholar.

01 o.2,N.9 No. 28, Vol. 2021, , Downloaded from https://academic.oup.com/jamia/article/28/9/1919/6310425 by guest on 01 October 2021 October 01 on guest by https://academic.oup.com/jamia/article/28/9/1919/6310425 from Downloaded Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 9 1923

used by the competing methods,16,17 as it has been sufficiently de- Metrics scribed in their studies. In this article, we use accuracy, precision, recall, and F1 metrics for We derive these features from author profiles via the similarity evaluation, as they have been widely used in classification-based measurements in Table 2. By definition,charðÞ and wordðÞ mean AND research.16,17,30 Each instance (paired citations) in the 2 data- extracting characters or word sequences from the input string. sets is predicted to be a binary value (0/1), designated as the same

TermFreqðlniÞ denotes the number of occurrences of a particular person or not. The ground truth is also a binary value (0/1), indicat- last name lni in the entire PubMed authors, and u in this formulation ing whether 2 authors are the same person in the real world. Then, a is the number of unique last names. confusion matrix can be constructed by considering predicted labels Among all classification algorithms, RF is the most popular algo- and ground truth labels, and the performance can be thus calcu- rithm for AND tasks.5,16 Extensive evaluations indicated that the in- lated. trinsic complexity of AND is relatively high, and some simple models (eg, logistic regression, bayesian classifier) have a limited ability to handle this. Hence, in this article, we employ an RF classi- Comparable methods Downloaded from https://academic.oup.com/jamia/article/28/9/1919/6310425 by guest on 01 October 2021 fier as a way of learning to disambiguate. We introduce 7 baselines for comparison and further categorize them into 2 groups: ID Group and Internal Group. The ID Group consists of 3 types of author IDs disambiguated by respective databases and the Internal Group comprises 4 disambiguation methods based on PubMed metadata only. ID Group: 22 EXPERIMENT SETUP 1. MAG Author ID: Author identifier of Microsoft Academic Graph. Evaluation datasets 2. S2 Author ID:23 Author identifier of Semantic Scholar. In recent years, 2 manually annotated disambiguation datasets of 3. PKG Author ID:24 Author identifier of PubMed Knowledge PubMed have been available for performance evaluation, they were Graph. created by Song et al16 and Vishnyakova et al,17 respectively; hereafter referred to as SONG and GS. In this article, we use them to ex- Internal Group: amine the effectiveness of AND models. Both datasets were 1. MinSong16: A disambiguation method designed for the first carefully curated. In SONG, multiple annotation iterations were author. conducted to ensure quality. It is worth mentioning that SONG was 2. Vishnyakova17: Current state-of-the-art (sota) method for all specifically designed for the highly productive first author (7.5 aver- author positions. age citations per author), as the authors argued that first author– 3. Vishnyakova&MinSong: A combination of the MinSong and based disambiguation is more important than other positions, and it Vishnyakova methods. Features of MinSong and Vishnya- is also easier to disambiguate as more information about the first au- kova are jointly used to develop this method. thor can be obtained. GS is the first unbiased gold standard based 4. InnerName: Corresponding to the InnerName group in Table on 1900 publication pairs. Though small in size, the gold standard 2. Features are restricted to those derived from internal name shows close similarities to PubMed in terms of chronological distri- metadata. bution and information completeness (eg, coverage of East Asian last names). Note that, in SONG, citations are grouped into For comparison, our methods are the following. “blocks” by names. For evaluation, we transformed the blocks into 1. EnhancedName: Corresponding to the feature group of pairwise form as Vishnyakova et al17 did. After transformation, EnhancedName in Table 2. Features are extracted from the 74.6% of samples in SONG are negatives, identical to the figure enhanced name metadata for the purpose of examining the ef- reported by them.17 fect of the enhanced name metadata exclusively.

Table 3. Evaluation results on 2 PubMed-related datasets

Method Methods GS SONG (only for the first author) Group Accuracy (%) Precision (%) Recall (%) F1 (%) Accuracy (%) Precision (%) Recall (%) F1 (%)

ID group MAG Author ID 71.77 99.07a 56.32 71.81 94.22 96.8a 78.45 86.66 S2 Author ID 85.16 94.01 81.77 87.46 90.55 96.06 65.9 78.17 PKG Author ID 89.76 92.39 91.28 91.83 93.31 95.6 77.2 85.42 Internal MinSong 80.28 82.38 86.9 84.52 93.5 80.3 86.99 83.18 group Vishnyakova (sota) 90.02 91.8 92.6 92.19 95.04 86.72 88.05 87.16 Vishnyakova&MinSong 89.95 91.17 93.02 92.06 94.94 85.37 87.95 86.17 InnerName 79.11 83.35 84.65 83.98 86.95 65.9 70.32 67.36 Our EnhancedName 85.22 88.36 87.99 88.14 89.53 68.56 88.16 76.18 ExtendedIDs 92.07 92.56 95.67 94.07 96.93 95.93 90.28 92.83 AggAND 94.66a 94.61 97.04a 95.8a 97.1a 94.21 93.43a 93.71a

Note that all results generated by a learning algorithm in Table 3 were calculated by 10-fold cross-validation. In other words, those metrics were averaged over 10 consecutive runs. MAG: Microsoft Academic Graph; PKG: PubMed Knowledge Graph; S2: Semantic Scholar. aThe best value of each metric (column). 1924 Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 9 Downloaded from https://academic.oup.com/jamia/article/28/9/1919/6310425 by guest on 01 October 2021

Figure 2. Leave-one-out feature experiments of AggAND; the evaluation results on (A) GS and (B) SONG gold standards. “X-Y” represents removing Y from X. The degree of performance reduction caused by removing an individual feature group indicates the importance of that feature group.

Table 4. Metadata density at author level or paper level

Metadata Group Metadata Density on the Whole Density on GS (%) Density on SONG (%) Level PubMed (%)

Enhanced metadata Enhanced full name 90.85 89.01 98.78 Author level Extended metadata MAG Author ID 78.78 92.86 93.67 Author level S2 Author ID 98.53 99.13 98.96 Author level PKG Author ID 99.13 98.95 99.86 Author level Internal metadata Paper title 99.94 99.97 100 Paper level Vernacular title 12 8.73 0.14 Paper level Original full name 58.03 59.05 77.6 Author level Email 7.28 7.14 57.15 Author level Afﬁliation 38.5 35.85 94.92 Author level Coauthor 77.42 95.73 86.53 Paper level Keyword 20.56 22.56 4.25 Paper level MeSH Heading 86.67 88.03 93.56 Paper level Abstract 65.54 82.82 89.38 Paper level Reference 18.12 26.91 26.21 Paper level Journal title 100 100 100 Paper level Publication date 100 100 100 Paper level Databank 1.18 2.11 1.64 Paper level Grant support 10.56 12.73 21.58 Paper level Publication type 100 100 100 Paper level

MAG: Microsoft Academic Graph; PKG: PubMed Knowledge Graph; S2: Semantic Scholar.

relatively lower recall, suggesting that MAG is able to precisely at- 2. ExtendedIDs: Corresponding to the ExtendedIDs group in tribute a proportion of citations to the correct authors. Table 2. Features are extracted from the extended metadata Second, regarding the combination method: Vishnyakova&Min- for the purpose of examining the effect of the external ID Song, we did not observe performance gains when MinSong was in- metadata exclusively. corporated with Vishnyakova, suggesting that this combination 3. AggAND: Assumed to be the most effective disambiguation strategy does not contribute more to this task. Thus, in the following method, corresponding to the combination of the Vishnya- evaluation phrase, the Vishnyakova method is joined as a part of kova, EnhancedName, and ExtendedIDs methods. Features AggAND instead of MinSong and Vishnyakova&MinSong, because are largely derived from external metadata. not only is it more effective, but also it contains necessary internal information that is not present in the external databases. Third, 2 name-based approaches, InnerName and Enhanced- RESULTS Name, show significant performance differentiation. The Enhanced- PubMed author name disambiguation benchmark Name approach gains additional F1 scores of 4.16% and 8.82% on We conducted experiments on 2 gold standard datasets. Results are the 2 datasets, respectively. Also, we found that the enhanced name– shown in Table 3. Based on it, several findings can be outlined. based approach could achieve comparable performance with state- First, among the 3 ID systems, PKG author ID presents the best of-the-art author ID systems. Interestingly, the ExtendedIDs method performance overall. Though MAG is slightly better than PKG on alone outperforms all baseline methods and was found to be the SONG, it performs less well on the full author positions-based data- second-best performer. We also observed that, on top of Vishnya- set (GS). Note that MAG shows the highest precision overall and a kova method, our method achieved an even better performance Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 9 1925

when the feature group of EnhancedName and ExtendedIDs joined. Advantages of our metadata extension method Our method AggAND, based on metadata enhancement and exten- The impressive result achieved by ExtendedIDs alone outperforms sion, produced an F1 score of 95.8%, an absolute 3.61% improve- all baseline methods, suggesting that integrating less impressive au- ment over the current state-of-the-art method (Vishnyakova) on GS. thor IDs can produce an even better ID, which confirms our previ- A similar trend can be observed on SONG: a 6.55% improvement ous hypothesis about the bagging ideology on external author on F1 than the state-of-the-art method. identifiers. Apart from its effectiveness, we note that our Extende- dIDs method has other advantages. It is easy to obtain the dependent metadata (Extended Metadata in Table 4) from external databases and transform them into features. In addition, such metadata covers Feature group importance almost all PubMed authors, with densities ranging from 78% to To investigate the importance of the 3 parts of AggAND, we con- 99%, and most of them over 90%, as Table 4 demonstrated. High ducted a leave-one-out feature experiment (see Figure 2); this step density is critically important as it can not only show good perfor- helps to better understand the importance of external databases to mance on the evaluation dataset but, more importantly, provide a Downloaded from https://academic.oup.com/jamia/article/28/9/1919/6310425 by guest on 01 October 2021 disambiguation works as 2 parts of AggAND are established on ex- feasible and practical way to scale to the entire PubMed. Interested ternal databases. From Figure 2, we can identify that ExtendedIDs scholars can try to build a more accurate author ID system for and Vishnyakova are the 2 most important feature groups on both PubMed using ExtendedIDs. GS and SONG. This is especially the case for ExtendedIDs, which has the highest impact on performance. Disambiguation method based on metadata enhancement and extension Experimental results demonstrate the superiority of AggAND over existing methods or systems. We attribute this to the rich metadata of DISCUSSION PubMed. National Library of Medicine has made considerable efforts to strengthen PubMed for decades; many rich, discriminative metadata Enhanced Name vs Original Name can be extracted for building accurate author profiles so that many ex- The Enhanced Name disambiguation method (EnhancedName) led cellent methods like Vishnyakova can be developed. In this work, a the Original Name method (InnerName) by a large margin, which widely used metadata identifier, PMID, which is generally available for can be explained by the strong evidence in Table 4, in which 32% over 30 million PubMed citations, enables us to enhance and extend the additional abbreviated names were restored to their full names. This internal metadata by linking to other large-scale databases. exercise makes disambiguation algorithms work easily because more authors with restored full names can be distinguished directly, and may not need further judgment by their affiliation, and research Importance of external databases for disambiguation topics. Therefore, it can be understood that increasing the density of The leave-one-out experiment shows that extended metadata (Extende- full name could substantially improve the ML-based disambiguation dIDs) has the most impact on performance. For the enhanced metadata method. (EnhancedName), although, the performance reduction when excluding

Figure 3. Randomly selected 2 false negatives and 2 false positives. Note that they were all selected from GS, because this dataset does not restrict to the first author and more importantly, is very proximate to PubMed. In each case, a prediction possibility denotes the confidence of it to be a positive sample (0.5 is the decision value). For simplicity, we only report the top 6 most important features (all feature importance is shown in Supplementary Table 1), the vertical bar above each feature represents the corresponding contribution calculated by random forest using the mean squared error criterion, which is a widely used tool for ex- plainable analysis. The cells below each feature contain 3 lines; the first 2 lines are author profiles and the third line represents a feature value. MAG: Microsoft Academic Graph; PKG: PubMed Knowledge Graph; PMID: PubMed publication identifier. 1926 Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 9

the EnhancedName feature group is not as large as excluding other fea- case that an author’s names in bibliographic databases differ from their ture groups, we can still observe 0.79% and 2.08% reductions of F1 on actual names due to many reasons, such as typographical errors, errors the respective datasets. We believe it is a nontrivial performance margin caused by the publishing process, etc. This problem limits our methods in as it is generally more difficult to improve the performance on top of an examining the performance in a more realistic scenario. existing F1 score of more than 90%. Based on the analysis, it is clear that external databases are critically important for disambiguation works, and the methods incorporated with external metadata can be CONCLUSION even more effective than the “internal” methods constructed by a vari- Despite such limitations, our work still makes valuable contributions. ety of PubMed metadata. We for the first time explored large-scale external databases to help disambiguate authors in PubMed. The new disambiguation method, Error analysis AggAND, based on metadata enhancement and extension, outper- Figure 3 shows the cases that AggAND fails to predict. As shown, formed the current state-of-the-art method by a large margin on the 2

the importance distribution clearly demonstrates “PKG Author ID gold standards. Moreover, the feature analysis revealed that external Downloaded from https://academic.oup.com/jamia/article/28/9/1919/6310425 by guest on 01 October 2021 Consistency” contributes the most, indicating that the dominant fea- databases can be crucially important for PubMed AND. Though our re- ture may be the most important factor in leading to the errors. For search focuses only on PubMed, we believe that future disambiguation example, in cases 1 and 3, although other features have positive works on other large-scale bibliographic databases, such as Web of Sci- effects on correct predicting, eg, very close publication year gap and ence or DBLP, will benefit from our findings. similar JDs in case 1, and different names in case 3, the dominant feature values of them are opposite to the ground truth. For cases 2 and 4, values of the dominant feature are in line with FUNDING the ground truth. The errors are caused by different reasons. In case This work was supported by the Major Project of the National Social Science 2, the paired authors share the same name; however, the ambiguity Foundation of China grant number 17ZDA292. score is much higher than other cases, suggesting that “Huang” is a very popular last name, which further implies that they are more likely to be different authors, besides, the larger gap of publication AUTHOR CONTRIBUTIONS years and the diverse JDs also exacerbate the matter. Similarly, in WL designed the study. LZ conducted this work and analyzed the data under case 4, identical author names and close publication years tend to WL’s supervision. LZ, YH, and JY wrote the manuscript. All authors revised imply that the 2 authorships are the same author. and approved the manuscript for publishing.

Suggestions for further improvement SUPPLEMENTARY MATERIAL Although our models achieved such success, over 5% of authors in the GS remain ambiguous. We note that the performance can be fur- Supplementary material is available at Journal of the American Medical Infor- ther improved by considering 1 of the following 2 aspects. First, fu- matics Association online. ture studies can explore other metadata like we did, for example, the affiliation metadata, which was proved to be one of the most effective metadata for AND when it is available for most authors.16 ACKNOWLEDGMENTS However, our evaluation shows that the density is only 38% (see The authors would like to acknowledge the National Center for Biotechnol- Table 4). Second, future studies can consider author similarity from ogy Information team, the Microsoft Academic team, and the team of Allen a semantic perspective. To measure the similarity, we simply used Institute for AI for their data collections. word-level measurements; it seems to be very straightforward, and the semantic information in this way was not exploited adequately. CONFLICT OF INTEREST STATEMENT Therefore, some cutting-edge semantic-matching algorithms such as DSSM31 are also worth exploring. None declared.

Limitations DATA AVAILABILITY Our first limitation is that efficiency is not fully considered. Parsing geo- graphically relevant author profiles (used by Vishnyakova et al and our The data collections underlying this article were provided by the National Li- method) like organization or location from affiliation metadata using the brary of Medicine, Microsoft Research, the Allen Institute for AI, and the Texas Advanced Computing Center at the University of Texas at Austin under named entity recognition model is performed inefficiently according to licenses, and they are publicly available at ftp://ftp.ncbi.nlm.nih.gov/pubmed/ our empirical observation. In the widely accepted namespace-based dis- baseline, https://zenodo.org/record/2628216, http://s2-public-api-prod.us- ambiguation framework, in which ambiguous authors are aggregatedinto west-2.elasticbeanstalk.com/corpus/, and http://er.tacc.utexas.edu/datasets/ namespaces according to the same last name and first initial, there are ped. The code for this article is available in GitHub at https://github.com/car- nearly 10 000 namespaces with a size of over 1000 in PubMed. Thus, manzhang/PubMed-AND-method. more computational resources will be required to build this kind of author profile. Another limitation we are aware of is that none of the available datasets addresses the name synonymy problem (ie, name variants of the REFERENCE same person) adequately. This is especially the case for SONG, in which 1. Getoor L, Machanavajjhala A. Entity resolution: theory, practice & open all last names of the same author are the same. However, it is often the challenges. Proc VLDB Endow 2012; 5 (12): 2018–9. Journal of the American Medical Informatics Association, 2021, Vol. 28, No. 9 1927

2. Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: a 18. Kim K, Seﬁd A, Weinberg BA, et al. A web service for author name disam- survey. IEEE Trans Knowl Data Eng 2007; 19 (1): 1–16. biguation in scholarly databases. In: 2018 IEEE International Conference 3. Christen P. A survey of indexing techniques for scalable record linkage on Web Services (ICWS); 2018. and deduplication. IEEE Trans Knowl Data Eng 2012; 24 (9): 1537–55. 19. Hussain I, Asghar S. A survey of author name disambiguation techniques: 4. Shen W, Wang J, Han J. Entity linking with a knowledge base: Issues, techni- 2010-2016. Knowl Eng Rev 2017; 32: e22. ques, and solutions. IEEE Trans Knowl Data Eng 2015; 27 (2): 443–60. 20. Torvik VI, Smalheiser NR. Author name disambiguation in MEDLINE. 5. Sanyal DK, Bhowmick PK, Das PP. A review of author name disambigua- ACM Trans Knowl Discov Data TKDD 2009; 3 (3): 11. tion techniques for the PubMed bibliographic database. J Inf Sci 2021; 47 21. Kim J, Kim J. Effect of forename string on author name disambiguation. J (2): 227–54. Assoc Inf Sci Technol 2020; 71 (7): 839–55. 6. Zhang Y, Zhang F, Yao P, et al. Name disambiguation in a miner: cluster- 22. Sinha A, Shen Z, Song Y, et al. An overview of Microsoft Academic Ser- ing, maintenance, and human in the loop. In: Proceedings of the 24th vice (mas) and applications. In: WWW ’15 Companion: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery & the 24th International Conference on World Wide Web; 2015: 243–6. Data Mining 2018: 1002–11. 23. Ammar W, Groeneveld D, Bhagavatula C, et al. Construction of the Liter- 7. Herskovic JR, Tanaka LY, Hersh W, et al. A day in the life of PubMed: ature Graph in Semantic Scholar, AWS Simple Cloud Storage [dataset].

analysis of a typical day’s query log. J Am Med Inform Assoc 2007; 14 (2): http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus. Downloaded from https://academic.oup.com/jamia/article/28/9/1919/6310425 by guest on 01 October 2021 212–20. Accessed December 1, 2019. 8. Liu W, Islamaj Dogan R, Kim S, et al. Author name disambiguation for 24. Xu J, Kim S, Song M, et al. Building a PubMed knowledge graph, Texas PubMed. J Assoc Inf Sci Technol 2014; 65 (4): 765–81. Advanced Computing Center Storage [dataset]. http://er.tacc.utexas.edu/ 9. Lerchenmueller MJ, Sorenson O. Author disambiguation in PubMed: evi- datasets/ped. Accessed December 20, 2019. dence on the precision and recall of author-ity among NIH-funded scien- 25. Zhang L, Huang Y, Cheng Q, et al. Mining Author Identifiers for PubMed tists. PLoS One 2016; 11 (7): e0158731. by Linking to Open Bibliographic Databases. In: 2020 IEEE 20th Interna- 10. Harrison AM, Harrison AM. Necessary but not sufficient: unique author tional Conference on Software Quality, Reliability and Security Compan- identifiers. BMJ Innov 2016; 2 (4): 141–3. ion (QRS-C); 2020. 11. Varadharajalu A, Liu W, Wong W. Author name disambiguation for 26. Breiman L. Bagging predictors. Mach Learn 1996; 24 (2): 123–40. ranking and clustering PubMed data using NetClus. In: Australasian Joint 27. Manning CD, Surdeanu M, Bauer J, et al. The Stanford CoreNLP nat- Conference on Artificial Intelligence; 2011: 152–61. ural language processing toolkit. In: Proceedings of 52nd Annual 12. Strotmann A, Zhao D, Bubela T. Author name disambiguation for collab- Meeting of the Association for Computational Linguistics; 2014: oration network analysis and visualization. Proc Am Soc Info Sci Tech 55–60. 2009; 46 (1): 1–20. 28. Vishnyakova D, Rodriguez-Esteban R, Ozol K, et al. Author name disam- 13. Johnson SB, Bales ME, Dine D, et al. Automatic generation of investigator biguation in MEDLINE based on journal descriptors and semantic types. bibliographies for institutional research networking systems. J Biomed In- In: Proceedings of the Fifth Workshop on Building and Evaluating Resour- form 2014; 51: 8–14. ces for Biomedical Text Mining; 2016: 134–42. 14. Wang H, Wan R, Wen C, et al. Author name disambiguation on heteroge- 29. Humphrey SM, Lu CJ, Rogers WJ, et al. Journal descriptor indexing tool neous information network with adversarial representation learning. for categorizing text according to discipline or semantic type. AMIA Annu AAAI Proc 2020; 34 (1): 238–45. Symp Proc 2006; 2006: 960. 15. Qiao Z, Du Y, Fu Y, et al. Unsupervised Author Disambiguation using 30. Treeratpituk P, Giles CL. Disambiguating authors in academic publica- Heterogeneous Graph Convolutional Network Embedding. In: 2019 tions using random forests. In: Proceedings of the 9th ACM/IEEE-CS IEEE International Conference on Big Data (Big Data); 2019. Joint Conference on Digital Libraries; 2009: 39–48. 16. Song M, Kim EH-J, Kim HJ. Exploring author name disambiguation on 31. Huang P-S, He X, Gao J, et al. Learning deep structured semantic models PubMed-scale. J Informetr 2015; 9 (4): 924–41. for web search using clickthrough data. In: Proceedings of the 22nd ACM 17. Vishnyakova D, Rodriguez-Esteban R, Rinaldi F. A new approach and International Conference on Information & Knowledge Management; gold standard toward author disambiguation in MEDLINE. J Am Med 2013. Inform Assoc 2019; 26 (10): 1037–45.