Indexing Graph Structured Data

Total Page:16

File Type:pdf, Size:1020Kb

Indexing Graph Structured Data MASARYK UNIVERSITY FACULTY OF INFORMATICS Û¡¢£¤¥¦§¨ª«¬­Æ°±²³´µ·¸¹º»¼½¾¿Ý Indexing Graph Structured Data PH.D. THESIS Stanislav Barto ˇn Brno, January 2007 Acknowledgement Though the following dissertation is an individual work, I could never have reached the heights or explored the depths without the help, support, guidance and efforts of a lot of people. Firstly, I would like to thank my supervisor Prof. Pavel Zezula for instillinginmethe qualities of being a good researcher and scientist. A very special thank you to my friends and colleagues Dr. Vlastislav Dohnal, Dr. Michal Batko and David Nov´ak for the sup- port they have lent me over all these years and also for providing me with contributions and comments on this work that have madea difference. Finally, I would like to acknowledge and sincerely thank my beloved parents for their inexhaustible faith and confidence in me and my abilities. iii Abstract This Ph.D. thesis concerns the problem of indexing techniques to- wards efficient discovery of complex relationships among entities in graph structured data. We propose a novel structure for evalu- ation of special type of operator denoting queries for all paths to a certain limiting length lying between a pair of inspected vertices in the indexed graph, ρ-path operator queries. We have compared it to various approaches suitable for this task and conducted numerous experiments to verify its properties. The proposed approach is based on a graph simplification method that we call graph segmentation. Using the recursive process of the graph segmentation a multilevel tree-like indexing structure called ρ-index is acquired. In this tree, each level represents a simpli- fied graph and each node a path type matrix describing the particular graph segment. An algorithm concerning the ρ-path queries using ρ -index is proposed and evaluated. The experiments are conducted on synthetic randomly generated graphs. These were acquired using our own incremental algorithm. These graphs represent the most general case for studying ρ-index’s properties since they lack any structural behavior. Although this the- sis also contains the evaluation on real-life data represented by a ci- tation graph of 30,000 scientific publication taken from the CiteSeer database. That part of this thesis also proposes novel approaches to discovery of important publications in the network influenced by the user’s predefined context. Supervisor: prof. Ing. Pavel Zezula, CSc. v Keywords index structures graph structured data path search algorithms semantic web citation networks citation analysis context citation publication search vii Contents 1 Introduction 13 1.1 StatementoftheProblem. 14 1.2 ResearchObjectives. 15 2 Background 17 2.1 GraphQueries........................ 17 2.1.1 GraphContainmentQuery . 17 2.1.2 ρ-operators ..................... 18 2.1.3 VertexReachabilityQuery . 21 2.2 GraphTheory ........................ 21 2.2.1 BasicDefinitions . 22 2.2.2 GraphSegmentation . 24 2.3 SegmentationHypotheses . 28 2.3.1 Correctness of a Proper Sequence of Segments Representation . 28 2.3.2 ConnectingPathofaSequenceofSegments . 29 2.3.3 Imposing a Weight Limit l ............. 33 2.3.4 IterationStep . 35 3 IndicesforGraphStructuredDataandRelatedWork 39 3.1 AlgorithmicApproaches . 39 3.1.1 GraphAlgorithms . 39 Single Source Path Expression Problem . 40 3.1.2 Transitive Closure Computation Algorithms . 41 Matrix-based Direct Algorithms . 41 Graph-basedDirectAlgorithms . 42 HybridAlgorithms . 43 3.1.3 Summary ...................... 44 3.2 GraphStructuredDataIndices . 44 3.2.1 Graphcontainmentqueries . 44 1 3.2.2 Reachabilityqueries . 46 IntervalBasedApproach . 46 2-hopApproach. 48 Hierarchical Labeling of Sub-Structures . 49 3.2.3 PathwayOrientedIndexingSchemes . 51 ClassandPathIndex. 51 An Indexing Scheme for RDF and RDF Schema BasedonSuffixArrays . 53 3.2.4 Summary ...................... 56 4 ρ-index 59 4.1 StructureoftheIndex. 59 4.1.1 PathTypeMatrix . 60 4.1.2 TablesofTransitionsAmongSegments . 63 4.1.3 ρ-index’s Structure Outline . 65 4.2 TranscriptionGraph . 66 4.2.1 FormalTranscriptionMethods . 69 Transforming the existPathTo TypeofEdge . 70 Transforming the transitionTo TypeofEdge . 71 Transforming the Dependency Types of Edges . 72 SoftandHardMinimalPathWeights . 73 4.2.2 Strategy of the Transcription Process . 73 Maintaining and Utilizing the Soft and Hard MinimalWeights. 75 4.3 ρ-indexCreationAlgorithm . 76 4.3.1 Graph Segmentation Methods and Strategies . 77 GeneralVertexClusteringMethod . 77 Segmentation Using Topological Order . 78 Summary ...................... 81 4.3.2 Transitive Closure Computation of the Path TypeMatrix ..................... 82 Sequence of Segments Weight Computation . 83 Suffix Tree for Disconnected Sequences of Seg- ments ................... 86 4.4 SearchAlgorithms . 88 4.4.1 ρ-path Algorithm .................. 88 The Initial State of the Transcription Graph . 89 TheResult...................... 90 2 4.4.2 ρ-connection AlgorithmOutline . 91 5 ρ-index Evaluation 93 5.1 DataCollection ....................... 93 5.1.1 RandomGraphModels . 94 5.1.2 Incremental Random Graph Generation Algo- rithm......................... 95 5.2 ρ-indexCreationTime . 98 5.3 SearchComplexity . .101 5.3.1 SearchComplexityofPositiveSearch . 103 5.3.2 SearchComplexityofNegativeSearch . 105 5.3.3 Search Complexity of Queries with Limited MaximalPathLength . .107 5.3.4 Search Complexity Affected by the Parameter Settings .......................110 5.3.5 Summary ......................110 6 Applying ρ-index in Citation Analysis 113 6.1 Bibliometrics. .113 6.1.1 CitationAnalysis . .114 6.1.2 MaterialSearchStrategies . 115 6.1.3 IndirectCitationRelationships . 116 6.1.4 RankingofPublications . .117 HITS .........................118 PageRank ......................119 SCEAS ........................119 6.1.5 Topological Studies of a Citation Network . 120 6.1.6 MappingofScience. .122 6.2 Implementing Indirect Complex Relationships Using ρ-operators..........................123 6.2.1 DataSet .......................125 6.2.2 ρ-path Results....................126 6.2.3 SemanticAnalysis . .129 Implementing Forward Chaining by ρ-index . 132 6.2.4 Summary ......................133 3 7 Conclusion 135 7.1 Summary...........................135 7.2 Contribution.. .. .. .. .. .. .. .137 7.3 FurtherResearchDirections . .138 Bibliography 139 4 List of Figures 2.1 Graph G′ is isomorphic to the subgraph of a graph G that is induced by a set of vertices {b, c, f} where the function f is defined as follows: f(b) = x, f(c) = y, f(d)= z. ........................... 18 2.2 ρ-path applied to vertices a and b. The result com- prises of all possible paths lying between inspected vertices: ρ − path(a, b)= {e1e2, e3e4, e5e6}......... 19 2.3 A result of ρ-connectionT o(a, b). One connection is de- noted by a pair of paths (e1e2, e5e4) terminated in ver- tex g and the another denoted by (e3e6, e7e8) termi- nated in vertex h. ...................... 20 2.4 A result of ρ-connectionF rom(g, h). One connection is denoted by a pair of paths (e1e2, e3e6) initiated in ver- tex a and the another denoted by (e5e4, e7e8) initiated in vertex b........................... 21 2.5 Segmentation of a graph G and its segment graph SG(G). ............................ 26 2.6 An example of a segmentation where sequence of seg- ments (S5S6S4S3) does not represent any path in G. .. 30 2.7 An example of partial connecting paths PCPi and a set of minimal connecting path CPs for a sequence of segments............................ 32 2.8 Demonstration ofasegment weightassignment. 36 3.1 A short example of a directed acyclic graph (DAG) anditstransitiveclosurematrix. 46 3.2 Two sets of vertices connected through a single vertex and the corresponding submatrix in transitive closure matrix. ............................ 48 5 3.3 An example illustrating Definition 3.2.3. The solid edges belong to spanning forest T . Grey vertices in- dicate exposed vertices. Vertex 6 is not exposed but it is the out-portal of 3 since it is the least common an- cestor all 3’s exposed descendants, vertices 8 and 9. .. 50 3.4 AnexampleofaRDFgraph.. 52 3.5 A directed acyclic graph together with two extracted path expressions. The character . denotes a delimiter ofcharactersinthepathexpression. 54 3.6 All possible suffixes generated from the pair of ex- tracted path expressions from Figure 3.5. The suffixes are then lexicographically ordered and duplicates are removed. The resulting suffix array is [1,1], [2,1], [1,3], [2,3], [1,5], [2,5], [1,7], [1,9], [1,2], [2,2], [1,4], [2,4], [1,6], [2,6],[1,8]. .......................... 55 4.1 ApathtypematrixMforadirectedgraph. 60 4.2 One step in the computation of the transitive closure ofthepathtypematrix.. 62 4.3 A fragment of a graph segmentation accompanied with the segment graph and a segment transition ta- blesforeachoftheparticipatingsegment. 63 4.4 Reversed transition tables for segments S1, S2 and S3 fromFigure4.3. ....................... 64 4.5 Visual outline of the ρ-index’s structure. 65 4.6 Initial state of a transcription graph for a search for all paths between vertices 1 and 10. ............. 68 4.7 Transcription of a transition to a lower level. 69 4.8 Transformation of a existsPathTo type of edge where the entry pXY of the path type matrix P is pXY = {(XA1A2A4A5Y ), (XA1A3A4A5Y ), (XB1A2A4A5Y ), (XB1B2Y )}. ............... 70 4.9 Transformation of a transition where the bor- der edges between the segments X and Y are EDGES OUT (X) ∩ EDGES IN(Y ) = {(A1, B1), (A2, B2),..., (An, Bn′ )}.............
Recommended publications
  • Citation Analysis for the Modern Instructor: an Integrated Review of Emerging Research
    CITATION ANALYSIS FOR THE MODERN INSTRUCTOR: AN INTEGRATED REVIEW OF EMERGING RESEARCH Chris Piotrowski University of West Florida USA Abstract While online instructors may be versed in conducting e-Research (Hung, 2012; Thelwall, 2009), today’s faculty are probably less familiarized with the rapidly advancing fields of bibliometrics and informetrics. One key feature of research in these areas is Citation Analysis, a rather intricate operational feature available in modern indexes such as Web of Science, Scopus, Google Scholar, and PsycINFO. This paper reviews the recent extant research on bibliometrics within the context of citation analysis. Particular focus is on empirical studies, review essays, and critical commentaries on citation-based metrics across interdisciplinary academic areas. Research that relates to the interface between citation analysis and applications in higher education is discussed. Some of the attributes and limitations of citation operations of contemporary databases that offer citation searching or cited reference data are presented. This review concludes that: a) citation-based results can vary largely and contingent on academic discipline or specialty area, b) databases, that offer citation options, rely on idiosyncratic methods, coverage, and transparency of functions, c) despite initial concerns, research from open access journals is being cited in traditional periodicals, and d) the field of bibliometrics is rather perplex with regard to functionality and research is advancing at an exponential pace. Based on these findings, online instructors would be well served to stay abreast of developments in the field. Keywords: Bibliometrics, informetrics, citation analysis, information technology, Open resource and electronic journals INTRODUCTION In an ever increasing manner, the educational field is irreparably linked to advances in information technology (Plomp, 2013).
    [Show full text]
  • Arxiv:2010.04147V1 [Cs.LG] 8 Oct 2020
    Automatic generation of reviews of scientific papers Anna Nikiforovskaya∗y, Nikolai Kapralovz, Anna Vlasovax, Oleg Shpynov{k and Aleksei Shpilman∗{ ∗National Research University Higher School of Economics, Saint Petersburg, Russia yIDMC, Universite´ de Lorraine, Nancy, France zSechenov Institute of Evolutionary Physiology and Biochemistry RAS, Saint Petersburg, Russia xSaint Petersburg State University, Saint Petersburg, Russia {JetBrains Research, Saint Petersburg, Russia kCorresponding author, Email: [email protected] Abstract—With an ever-increasing number of scientific papers area. However, it was shown that the author and the place of published each year, it becomes more difficult for researchers publication of the paper affect the number of citations [3]. to explore a field that they are not closely familiar with Also, the meaning of citation is actively studied. For exam- already. This greatly inhibits the potential for cross-disciplinary ple, it has been shown that there are 15 different meanings research. A traditional introduction into an area may come in of the citation [4]. the form of a review paper. However, not all areas and sub- Another aspect of the paper analysis is the summariza- areas have a current review. In this paper, we present a method tion of the scientific papers [5]. Studies in this area show that for the automatic generation of a review paper corresponding the citation context, i.e., the text surrounding the link to the to a user-defined query. This method consists of two main paper, can be used for its summarization [6]. Moreover, it parts. The first part identifies key papers in the area by their was demonstrated that citation context reflects the meaning bibliometric parameters, such as a graph of co-citations.
    [Show full text]
  • H-Index Manipulation by Merging Articles
    Artificial Intelligence 240 (2016) 19–35 Contents lists available at ScienceDirect Artificial Intelligence www.elsevier.com/locate/artint H-index manipulation by merging articles: Models, theory, ✩ and experiments ∗ René van Bevern a,b,d, , Christian Komusiewicz c,d, Rolf Niedermeier d, Manuel Sorge d, Toby Walsh d,e,f a Novosibirsk State University, Novosibirsk, Russian Federation b Sobolev Institute of Mathematics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russian Federation c Institut für Informatik, Friedrich-Schiller-Universität Jena, Germany d Institut für Softwaretechnik und Theoretische Informatik, TU Berlin, Germany e University of New South Wales, Sydney, Australia f Data61, Sydney, Australia a r t i c l e i n f o a b s t r a c t Article history: An author’s profile on Google Scholar consists of indexed articles and associated data, Received 9 March 2016 such as the number of citations and the H-index. The author is allowed to merge Received in revised form 26 July 2016 articles; this may affect the H-index. We analyze the (parameterized) computational Accepted 5 August 2016 complexity of maximizing the H-index using article merges. Herein, to model realistic Available online 10 August 2016 manipulation scenarios, we define a compatibility graph whose edges correspond to Keywords: plausible merges. Moreover, we consider several different measures for computing the Citation index citation count of a merged article. For the measure used by Google Scholar, we give Hirsch index an algorithm that maximizes the H-index in linear time if the compatibility graph has Parameterized complexity constant-size connected components.
    [Show full text]
  • Exploiting Citation Networks for Large-Scale Author Name Disambiguation Christian Schulz1,Aminmazloumian1, Alexander M Petersen2, Orion Penner2 and Dirk Helbing1*
    Schulz et al. EPJ Data Science 2014, 2014:11 http://www.epjdatascience.com/content/2014/1/11 REGULAR ARTICLE OpenAccess Exploiting citation networks for large-scale author name disambiguation Christian Schulz1,AminMazloumian1, Alexander M Petersen2, Orion Penner2 and Dirk Helbing1* *Correspondence: [email protected] Abstract 1Department of Humanities and Social Sciences, Chair of Sociology, We present a novel algorithm and validation method for disambiguating author in particular of Modeling and names in very large bibliographic data sets and apply it to the full Web of Science Simulation, ETH Zurich, (WoS) citation index. Our algorithm relies only upon the author and citation graphs Clausiusstrasse 50, CH-8092 Zurich, Switzerland available for the whole period covered by the WoS. A pair-wise publication similarity Full list of author information is metric, which is based on common co-authors, self-citations, shared references and available at the end of the article citations, is established to perform a two-step agglomerative clustering that first connects individual papers and then merges similar clusters. This parameterized model is optimized using an h-index based recall measure, favoring the correct assignment of well-cited publications, and a name-initials-based precision using WoS metadata and cross-referenced Google Scholar profiles. Despite the use of limited metadata, we reach a recall of 87% and a precision of 88% with a preference for researchers with high h-index values. 47 million articles of WoS can be disambiguated on a single machine in less than a day. We develop an h-index distribution model, confirming that the prediction is in excellent agreement with the empirical data, and yielding insight into the utility of the h-index in real academic ranking scenarios.
    [Show full text]
  • Translating the Translators:Following the Development of Actor-Network Theory
    Translating the Translators: Following the Development of Actor-Network Theory Susanna Evarts First Reader: Cornel Ban, PhD, JD Second Reader: Gianpaolo Baiocchi, PhD Thesis submitted in partial fulfillment for the degree of BACHELOR OF ARTS in DEVELOPMENT STUDIES Development Studies Brown University April 15, 2011 __________________________________ SUSANNA EVARTS __________________________________ Cornel Ban, PhD, JD __________________________________ Gianpaolo Baiocchi, PhD © Susanna Evarts, 2011 Abstract In this thesis, I will trace the development and spread of Actor-Network Theory (ANT), which emerged as a way of studying technological change and innovation in the early 1980s and was conceived of by three main theorists: Bruno Latour, Michel Callon, and John Law. Since then, however, it has evolved and been used in many disciplines, which has fundamentally changed to what and how it is employed. There have been few previous studies that examine its spread using empirical methods, and the ones that do, only focus on one particular sub-discipline. As such, little is actually known about how ANT moved from its initial birthplace in Europe to become one of the most influential approaches in Science and Technology Studies (STS). Additionally, while ANT first started in Europe, it is cited most in the United States, a phenomenon that has also eluded previous research. This study is important to the field of Development Studies, as both are inherently interdisciplinary approaches to studying a heterogeneous mix of entities. Additionally, ANT has much to offer Development Studies research; in particular ANT’s reconceptualization of how power becomes enacted, which forces the researcher to avoid short cuts in explaining inequalities and identities is quite relevant to development research.
    [Show full text]
  • ©Copyright 2021 Jason Portenoy Harnessing Scholarly Literature As Data to Curate, Explore, and Evaluate Scientific Research
    ©Copyright 2021 Jason Portenoy Harnessing Scholarly Literature as Data to Curate, Explore, and Evaluate Scientific Research Jason Portenoy A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2021 Reading Committee: Jevin D. West, Chair Emma Stuart Spiro William Gregory Howe Program Authorized to Offer Degree: Information School University of Washington Abstract Harnessing Scholarly Literature as Data to Curate, Explore, and Evaluate Scientific Research Jason Portenoy Chair of the Supervisory Committee: Associate Professor Jevin D. West Information School There currently exist hundreds of millions of scientific publications, with more being created at an ever-increasing rate. This is leading to information overload: the scale and complexity of this body of knowledge is increasing well beyond the capacity of any individual to make sense of it all, overwhelming traditional, manual methods of curation and synthesis. At the same time, the availability of this literature and surrounding metadata in structured, digital form, along with the proliferation of computing power and techniques to take advantage of large-scale and complex data, represents an opportunity to develop new tools and techniques to help people make connections, synthesize, and pose new hypotheses. This dissertation consists of several contributions of data, methods, and tools aimed at addressing information overload in science. My central contribution to this space is Autoreview, a framework for building and evaluating systems to automatically select relevant publications for literature reviews, starting from small sets of seed papers. These automated methods have the potential to help researchers save time and effort when keeping up with relevant literature, as well as surfacing papers that more manual methods may miss.
    [Show full text]
  • Multiplex Flows in Citation Networks
    Renoust et al. Applied Network Science (2017) 2:23 Applied Network Science DOI 10.1007/s41109-017-0035-2 RESEARCH Open Access Multiplex flows in citation networks Benjamin Renoust1,3* , Vivek Claver1,3,4 and Jean-François Baffier1,2,3 *Correspondence: [email protected] 1National Institute of Informatics, Abstract Tokyo, Japan 3Japanese-French Laboratory for Knowledge is created and transmitted through generations, and innovation is often Informatics CNRS UMI 3527, Tokyo, seen as a process generated from collective intelligence. There is rising interest in Japan studying how innovation emerges from the blending of accumulated knowledge, and Full list of author information is available at the end of the article from which path an innovation mostly inherits. A citation network can be seen as a perfect example of one generative process leading to innovation. However, the impact and influence of scientific publication are always difficult to capture and measure. We offer a new take on investigating how the knowledge circulates and is transmitted, inspired by the notion of “stream of knowledge”. We propose to look at this question under the lens of flows in directed acyclic graphs (DAGs). In this framework inspired by the work of Strahler, we can also account for other well known measures of influence such as the h-index. We propose then to analyze flows of influence in a citation networks as an ascending flow. From this point on, we can take a finer look at the diffusion of knowledge through the lens of a multiplex network. In this network, each citation of a specific work constitutes one layer of interaction.
    [Show full text]
  • Assessing Italian Research in Statistics: Interdisciplinary Or Multidisciplinary?
    Assessing Italian Research in Statistics: Interdisciplinary or Multidisciplinary? Sandra De Francisci Epifani*, Maria Gabriella Grassia**, Nicole Triunfo**, Emma Zavarrone* [email protected] ; [email protected] ; [email protected] ; [email protected] Abstract In this paper, we assess cross disciplinary of research produced by the Italian Academic Statisticians (IAS) combining text mining and bibliometrics techniques Textual and bibliometric approaches have together advantages and disadvantages, and provide different views on the same interlinked corpus of scientific publications. In addition textual information in such documents, jointly citations also constitute huge networks that yield additional information. We incorporate both points of view and show how to improve on existing text-based and bibliometric methods. In particular, we propose an hybrid clustering procedure based on Fisher ╒s inverse chi-square method as the preferred method for integrating textual content and citation information. Given clustered papers, it’s possible to evaluate ISI subject categories (SCs) as descriptive labels for statistical documents, and to address individual researchers interdisciplinary. Keywords : Bibliometrics, Text mining, Social network Analysis, Hybrid Clustering 1 Introduction Increasing dissemination of scientific and technological publications via web sides, and their availability in large-scale bibliographic databases, opened to massive opportunities for improving classification and bibliometric cartography for science and technology. This metascience benefits of the continuous arise of computing power and development of new algorithms. The purpose of mapping, charting or cartography of scientific fields is the knowledge of the structure and the evolution for different areas of research and link other fields, based on scientific publications. Research fields can be profiled using different keywords i.e.
    [Show full text]
  • Multiplex Flows in Citation Networks
    Renoust et al. Applied Network Science (2017) 2:23 Applied Network Science DOI 10.1007/s41109-017-0035-2 RESEARCH Open Access Multiplex flows in citation networks Benjamin Renoust1,3* , Vivek Claver1,3,4 and Jean-François Baffier1,2,3 *Correspondence: [email protected] 1National Institute of Informatics, Abstract Tokyo, Japan 3Japanese-French Laboratory for Knowledge is created and transmitted through generations, and innovation is often Informatics CNRS UMI 3527, Tokyo, seen as a process generated from collective intelligence. There is rising interest in Japan studying how innovation emerges from the blending of accumulated knowledge, and Full list of author information is available at the end of the article from which path an innovation mostly inherits. A citation network can be seen as a perfect example of one generative process leading to innovation. However, the impact and influence of scientific publication are always difficult to capture and measure. We offer a new take on investigating how the knowledge circulates and is transmitted, inspired by the notion of “stream of knowledge”. We propose to look at this question under the lens of flows in directed acyclic graphs (DAGs). In this framework inspired by the work of Strahler, we can also account for other well known measures of influence such as the h-index. We propose then to analyze flows of influence in a citation networks as an ascending flow. From this point on, we can take a finer look at the diffusion of knowledge through the lens of a multiplex network. In this network, each citation of a specific work constitutes one layer of interaction.
    [Show full text]
  • Microsoft Academic Graph: When Experts Are Not Enough
    ARTICLE Microsoft Academic Graph: When experts are not enough Kuansan Wang , Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia Microsoft Research, Redmond, WA, 98052, USA an open access journal Keywords: citation networks, eigenvector centrality measure, knowledge graph, research assessments, saliency ranking, scholarly database ABSTRACT Citation: Wang, K., Shen, Z., Huang, An ongoing project explores the extent to which artificial intelligence (AI), specifically in the C., Wu, C.-H., Dong, Y., & Kanakia, A. (2020). Microsoft Academic Graph: areas of natural language processing and semantic reasoning, can be exploited to facilitate the When experts are not enough. Quantitative Science Studies, 1(1), studies of science by deploying software agents equipped with natural language understanding 396–413. https://doi.org/10.1162/ qss_a_00021 capabilities to read scholarly publications on the web. The knowledge extracted by these AI agents is organized into a heterogeneous graph, called Microsoft Academic Graph (MAG), DOI: https://doi.org/10.1162/qss_a_00021 where the nodes and the edges represent the entities engaging in scholarly communications and the relationships among them, respectively. The frequently updated data set and a few Received: 09 July 2019 Accepted: 10 December 2019 software tools central to the underlying AI components are distributed under an open data license for research and commercial applications. This paper describes the design, schema, Corresponding Author: and technical and business motivations behind MAG and elaborates how MAG can be used in Kuansan Wang [email protected] analytics, search, and recommendation scenarios. How AI plays an important role in avoiding various biases and human induced errors in other data sets and how the technologies can be Handling Editors: Ludo Waltman and Vincent Larivière further improved in the future are also discussed.
    [Show full text]
  • Biblioranking Fundamental Physics (Updated to 2021/1/1) Arxiv
    CERN-TH-2018-066 Biblioranking fundamental physics (updated to 2021/1/1) Alessandro Strumiaa, Riccardo Torreb;c a Dipartimento di Fisica dell’Universit`adi Pisa, Italy b CERN, Theory Division, Geneva, Switzerland c INFN, sezione di Genova, Italy Abstract Counting of number of papers, of citations and the h-index are the simplest bibliometric indices of the impact of research. We discuss some improvements. First, we replace citations with individual ci- tations, fractionally shared among co-authors, to take into account that different papers and different fields have largely different average number of co-authors and of references. Next, we improve on citation counting applying the PageRank algorithm to citations among papers. Being time-ordered, this reduces to a weighted counting of citation de- scendants that we call PaperRank. We compute a related AuthorRank applying the PageRank algorithm to citations among authors. These metrics quantify the impact of an author or paper taking into account arXiv:1803.10713v2 [cs.DL] 6 Apr 2021 the impact of those authors that cite it. Finally, we show how self- and circular- citations can be eliminated by defining a closed market of Citation-coins. We apply these metrics to the InSpire database that covers fundamental physics, presenting results for papers, authors, journals, institutes, towns, countries, and continents, for all-time and in recent time periods. Contents 1 Introduction2 2 Ranking papers7 2.1 PaperRank . .7 2.2 PaperRank of papers: results . .8 2.3 PaperRank as the number of citations-of-citations . 10 2.4 Top-referred (recent) papers . 11 2.5 Paper metrics: correlations .
    [Show full text]
  • Scientific Collaboration and Endorsement: Network Analysis of Coauthorship and Citation Networks
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by IUScholarWorks Scientific collaboration and endorsement: Network analysis of coauthorship and citation networks Ying Ding School of Library and Information Science, Indiana University, Bloomington, IN, 47405 Abstract Scientific collaboration and endorsement are well-established research topics which utilize three kinds of methods: survey/questionnaire, bibliometrics, and complex network analysis. This paper combines topic modeling and path-finding algorithms to determine whether productive authors tend to collaborate with or cite researchers with the same or different interests, and whether highly cited authors tend to collaborate with or cite each other. Taking information retrieval as a test field, the results show that productive authors tend to directly coauthor with and closely cite colleagues sharing the same research interests; they do not generally collaborate directly with colleagues having different research topics, but instead directly or indirectly cite them; and highly cited authors do not generally coauthor with each other, but closely cite each other. Keywords Scientific collaboration, scientific endorsement, topic modeling, path-finding algorithm 1. Introduction Bibliometrics measures the standing or influence of an author, journal or article in scholarly networks based on various citation analyses. Citations are understood to serve as carriers of authority and correspond to different endorsements. Scientific collaboration and endorsement are well-established research topics utilizing three kinds of methods (Milojevic, 2010): qualitative methods (e.g., surveys/questionnaires, interviews, or observations), bibliometric methods (e.g., publication counting, citation counting, or co-citation analysis), and complex network methods (e.g., shortest path, centralities, network parameters, or PageRank/HITS).
    [Show full text]