©Copyright 2021 Jason Portenoy Harnessing Scholarly Literature As Data to Curate, Explore, and Evaluate Scientific Research
Total Page:16
File Type:pdf, Size:1020Kb
©Copyright 2021 Jason Portenoy Harnessing Scholarly Literature as Data to Curate, Explore, and Evaluate Scientific Research Jason Portenoy A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2021 Reading Committee: Jevin D. West, Chair Emma Stuart Spiro William Gregory Howe Program Authorized to Offer Degree: Information School University of Washington Abstract Harnessing Scholarly Literature as Data to Curate, Explore, and Evaluate Scientific Research Jason Portenoy Chair of the Supervisory Committee: Associate Professor Jevin D. West Information School There currently exist hundreds of millions of scientific publications, with more being created at an ever-increasing rate. This is leading to information overload: the scale and complexity of this body of knowledge is increasing well beyond the capacity of any individual to make sense of it all, overwhelming traditional, manual methods of curation and synthesis. At the same time, the availability of this literature and surrounding metadata in structured, digital form, along with the proliferation of computing power and techniques to take advantage of large-scale and complex data, represents an opportunity to develop new tools and techniques to help people make connections, synthesize, and pose new hypotheses. This dissertation consists of several contributions of data, methods, and tools aimed at addressing information overload in science. My central contribution to this space is Autoreview, a framework for building and evaluating systems to automatically select relevant publications for literature reviews, starting from small sets of seed papers. These automated methods have the potential to help researchers save time and effort when keeping up with relevant literature, as well as surfacing papers that more manual methods may miss. I show that this approach can work to recommend relevant literature, and can also be used to systematically compare different features used in the recommendations. I also present the design, implementation, and evaluation of several visualization tools. One of these is an animated network visualization showing the influence of a scholar over time. Another is SciSight, an interactive system for recommending new authors and research by finding similarities along different dimensions. Additionally, I discuss the current state of available scholarly data sets; my work curating, linking, and building upon these data sets; and methods I developed to scale graph clustering techniques to very large networks. TABLE OF CONTENTS Page List of Figures . iv List of Tables . vii Acknowledgements . viii Chapter 1: Introduction . .1 1.1 Science of Science . .3 1.2 Research and Projects . .6 Chapter 2: Scholarly Publication Data and Citation Networks . .8 2.1 Data Sets . .8 2.1.1 Web of Science . .9 2.1.2 Microsoft Academic Graph . 10 2.1.3 Other comprehensive scholarly data sets . 11 2.1.4 Other data sources . 14 2.1.5 Building on top of existing data sources . 15 2.1.6 Data limitations . 17 2.1.7 What comes next? The future of scholarly data . 20 2.2 Network Analysis . 23 2.2.1 Infomap, the Map Equation, and RelaxMap . 24 2.2.2 Parallel hierarchical clustering . 28 2.2.3 Applications, computation, and runtime . 29 Chapter 3: Autoreview . 31 3.1 Author Preface . 31 3.2 Abstract . 33 i 3.3 Introduction . 33 3.4 Background . 35 3.5 Data and Methods . 36 3.5.1 Data . 36 3.5.2 Identifying candidate papers and setting up the supervised learning problem . 37 3.5.3 Features . 38 3.6 Results . 40 3.6.1 Application to a single review article . 40 3.6.2 Large-scale study on multiple review papers . 43 3.6.3 Extended analysis . 46 3.6.4 Exploring scientific fields using automated literature review . 49 3.7 Discussion . 50 3.8 Acknowledgements . 52 Chapter 4: Visual exploration and evaluation of the scholarly literature . 57 4.1 Nautilus Diagram: Visualizing Academic Influence Over Time and Across Fields 58 4.1.1 Author Preface . 58 4.1.2 Abstract . 62 4.1.3 Introduction . 62 4.1.4 Background . 64 4.1.5 Methods . 66 4.1.6 Design . 68 4.1.7 Results . 76 4.1.8 Discussion and future work . 80 4.1.9 Conclusion . 82 4.1.10 Acknowledgements . 82 4.2 Case Studies in Science Visualization . 83 4.3 SciSight / Bridger . 86 4.3.1 Author Preface . 86 4.3.2 Abstract . 90 4.3.3 Introduction . 90 4.3.4 Related Work . 93 ii 4.3.5 Bridger: System Overview . 95 4.3.6 Experiment I: Author Depiction . 103 4.3.7 Experiment II: Author Discovery . 107 4.3.8 User Interviews: Analysis & Discussion of Author Discovery . 113 4.3.2 Conclusion . 116 Chapter 5: Conclusion . 117 iii LIST OF FIGURES Figure Number Page 2.1 Growth of the scientific literature over time. For MAG, journal and conference articles and book chapters are included; patents, repositories, and data sets are excluded. 12 3.1 Schematic of the framework used to collect data for development and testing of a supervised literature review classifier. (a) Start with an initial set of articles (i.e., the bibliography of an existing review article). (b) Split this set into seed papers (S) and target papers (T). (c) Collect a large set of candidate papers (C) from the seed papers by collecting in- and out-citations, two degrees out. Label these papers as positive or negative based on whether they are among the target papers (T). (d) Split the candidate papers into a training set and a test set to build a supervised classifier, with features based on similarity to the seed papers (S). 38 3.2 Violin plot showing the distribution of R-Precision scores (number of correctly predicted target papers divided by total number of target papers) for 2,500 classifiers, each trained on one of 500 different review articles. The violin plot shows a box plot in the center, surrounded by a mirrored probability distribution for the scores. The distribution is annotated with the titles of three review articles. The review article in the lower tail was one of those which the classifiers did most poorly at predicting references (mean score: 0.14). The one in the upper tail is an example of a review paper whose classifiers performed best (0.65). The one in the middle at the fattest part of the distribution is more or less typical for the review articles in our set (0.39). 53 3.3 Box plots of the R-Precision scores for the 500 review articles by subject. 50 seed papers, network and TF-IDF title features. See text for discussion. 54 3.4 R-precision scores for autoreview, varying the number of seed/target papers, and the sets of features used. Each point represents the mean of the R-Precision scores for 500 models—5 each for different seed/target splits of the references of 100 review papers. The error bars represent 95% confidence intervals. 55 iv 3.5 Average R-Precision scores for different size review articles. The middle (red) bar for each feature set represents the average score for the same 100 review articles using the same procedure as in Fig. 3.4 (seed size 50). The other two bars in each group represent a different set of review articles, the left a set of 100 smaller reviews (50 references on average), the right a set of 100 larger reviews (945 references on average). Error bars represent 95% confidence intervals. 56 4.1 Example nautilus diagram, showing the influence of an author over time. 60 4.2 Top Left: (A) The center node represents all publications of a particular scholar. (B) Nodes that appear around the center represent publications that cited work by this scholar. (C) The size of the nodes show a citation-based indicator (Eigenfactor) of how influential that paper has been. (D) Colors show different fields to which the papers apply. Bottom Left: Integrated timeline charts below the network visualization. (E) Number of publications by the central scholar by year. (F) Number of citations received by the central scholar by year. (G) Sum of the Eigenfactor for all of the publications published by the central author in each year. Colors show the periods before, during, and after funding from the Pew program. Right side: Comparing the densities of two different graphs. (H) is a sparse graph that shows a diffuse influence across fields (i.e., interdisciplinary influence). (I) is a dense graph that shows a close-knit citation community within one domain. 69 4.3 Four stories that emerged from demonstrations with the scholars. A) shows a scholar who had influence in a field she hadn’t expected. B) shows a career shift reflected in changing color bands in the graph. C) shows an early-career peak in influence that prompted a scholar to reflect on the freedoms afforded by different research positions. D) shows a scholar with influences in very diverse areas. 79 4.4 Visualizations for collections of papers. Top left: The cluster network diagram shows citation relationships between clusters of papers related to Information Security and Ethics. Clusters are colored according to the ratio of InfoSec or Ethics papers within, and links show citations between the clusters. Top right: Coauthorship network for researchers publishing in the fields of science communication and misinformation. Nodes represent authors; links represent joint authorship on the same paper. The colored clusters often correspond to research labs or groups. Bottom left: Interactive visualizations showing collections of articles: a timeline of papers by year (above) and a citation network (below). Bottom right: Screenshot of the SciSight visualization for COVID-19 research. The nodes of the network are “cards” representing groups of researchers, and links represent different types of relationships between them.