Analysing Ranking Algorithms and Publication Trends on Scholarly Citation Networks
Total Page:16
File Type:pdf, Size:1020Kb
Analysing Ranking Algorithms and Publication Trends on Scholarly Citation Networks by Marcel Dunaiski Thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Computer Science in the Faculty of Science at Stellenbosch University Computer Science Division, Department Mathematical Sciences, University of Stellenbosch, Private Bag X1, Matieland 7602, South Africa. Supervisors: W. Visser J. Geldenhuys Stellenbosch University http://scholar.sun.ac.za Declaration By submitting this thesis electronically, I declare that the entirety of the work contained therein is my own, original work, that I am the sole author thereof (save to the extent explicitly otherwise stated), that reproduction and publication thereof by Stellenbosch University will not infringe any third party rights and that I have not previously in its entirety or in part submitted it for obtaining any qualification. 2014/08/31 Date: . Copyright © 2014 Stellenbosch University All rights reserved. i Stellenbosch University http://scholar.sun.ac.za Abstract Analysing Ranking Algorithms and Publication Trends on Scholarly Citation Networks M.P. Dunaiski Computer Science Division, Department Mathematical Sciences, University of Stellenbosch, Private Bag X1, Matieland 7602, South Africa. Thesis: MSc August 2014 Citation analysis is an important tool in the academic community. It can aid universities, funding bodies, and individual researchers to evaluate scientific work and direct resources appropriately. With the rapid growth of the scientific enterprise and the increase of online libraries that include citation analysis tools, the need for a systematic evaluation of these tools becomes more important. The research presented in this study deals with scientific research output, i.e., articles and citations, and how they can be used in bibliometrics to measure academic success. More specifically, this research analyses algorithms that rank academic entities such as articles, authors and journals to address the question of how well these algorithms can identify important and high-impact entities. A consistent mathematical formulation is developed on the basis of a categorisation of bibliometric measures such as the h-index, the Impact Factor for journals, and ranking algorithms based on Google’s PageRank. Furthermore, the theoretical properties of each algorithm are laid out. The ranking algorithms and bibliometric methods are computed on the Microsoft Academic Search citation database which contains 40 million papers and over 260 million citations that span across multiple academic disciplines. We evaluate the ranking algorithms by using a large test data set of papers and authors that won renowned prizes at numerous Computer Science conferences. The results show that using citation counts is, in general, the best ranking metric. However, for certain tasks, such as ranking important papers or identifying high-impact authors, algorithms based on PageRank perform better. As a secondary outcome of this research, publication trends across academic disciplines are analysed to show changes in publication behaviour over time and differences in publication patterns between disciplines. ii Stellenbosch University http://scholar.sun.ac.za Opsomming Analise van rangalgoritmes en publikasie tendense op wetenskaplike sitasienetwerke M.P. Dunaiski Rekenaarwetenskap Afdeling, Departement van Wiskundige Wetenskappe, Universiteit van Stellenbosch, Privaatsak X1, Matieland 7602, Suid Afrika. Tesis: MSc Augustus 2014 Sitasiesanalise is ’n belangrike instrument in die akademiese omgewing. Dit kan universi- teite, befondsingsliggams en individuele navorsers help om wetenskaplike werk te evalueer en hulpbronne toepaslik toe te ken. Met die vinnige groei van wetenskaplike uitsette en die toename in aanlynbiblioteke wat sitasieanalise insluit, word die behoefte aan ’n sistematiese evaluering van hierdie gereedskap al hoe belangriker. Die navorsing in hierdie studie handel oor die uitsette van wetenskaplike navorsing, dit wil sê, artikels en sitasies, en hoe hulle gebruik kan word in bibliometriese studies om akademiese sukses te meet. Om meer spesifiek te wees, hierdie navorsing analiseer algoritmes wat akademiese entiteite soos artikels, outeers en journale gradeer. Dit wys hoe doeltreffend hierdie algoritmes belangrike en hoë-impak entiteite kan identifiseer. ’n Breedvoerige wiskundige formulering word ontwikkel uit ’n versameling van bibli- ometriese metodes soos byvoorbeeld die h-indeks, die Impak Faktor vir journaale en die rang-algoritmes gebaseer op Google se PageRank. Verder word die teoretiese eienskappe van elke algoritme uitgelê. Die rang-algoritmes en bibliometriese metodes gebruik die sitasiedatabasis van Mi- crosoft Academic Search vir berekeninge. Dit bevat 40 miljoen artikels en meer as 260 miljoen sitasies, wat oor verskeie akademiese dissiplines strek. Ons gebruik ’n groot stel toetsdata van dokumente en outeers wat bekende pryse op talle rekenaarwetenskaplike konferensies gewen het om die rang-algoritmes te evalueer. Die resultate toon dat die gebruik van sitasietellings, in die algemeen, die beste rang- metode is. Vir sekere take, soos die gradeering van belangrike artikels, of die identifisering van hoë-impak outeers, presteer algoritmes wat op PageRank gebaseer is egter beter. ’n Sekondêre resultaat van hierdie navorsing is die ontleding van publikasie tendense in verskeie akademiese dissiplines om sodoende veranderinge in publikasie gedrag oor tyd aan te toon en ook die verskille in publikasie patrone uit verskillende dissiplines uit te wys. iii Stellenbosch University http://scholar.sun.ac.za Contents Declaration i Abstract ii Opsomming iii Contents iv List of Figures vii List of Tables ix Nomenclature xi 1 Introduction 1 1.1 Motivation . 2 1.2 Research Questions and Objectives . 3 1.3 Thesis Overview . 4 2 Background Information 6 2.1 From Bibliometrics to Cybermetrics . 6 2.2 Notation and Terminology . 7 2.2.1 Graph Notation . 8 2.3 Using Citations for Ranking . 8 2.4 Markov Chains . 10 2.4.1 Modelling Citation Networks using Markov Chains . 12 2.5 The PageRank Algorithm . 13 2.5.1 The Damping Factor α ......................... 14 2.5.2 The Power Method . 15 2.6 Chapter Summary . 16 3 Literature Review 17 3.1 The History of Scientometrics and Bibliometrics . 17 3.2 What Citation Counts Can and Cannot Measure . 18 3.2.1 Do High Citation Counts Indicate Quality Work? . 19 3.2.2 The Impact of Self-Citations . 19 3.2.3 Varying Citation Potentials . 20 3.2.4 Where Citation Counts Fall Short . 21 3.2.5 The Impact of Article Visibility on Citation Counts . 22 3.2.6 Citation Analysis, Data Quality and Coverage . 23 iv Stellenbosch University http://scholar.sun.ac.za CONTENTS v 3.3 Ranking Publications . 24 3.4 Ranking Authors and Venues . 25 3.5 Chapter Summary . 27 4 Ranking Methods 28 4.1 Counting Citations . 28 4.1.1 The Journal Impact Factor . 28 4.1.2 The i10 -index . 29 4.1.3 The h-index . 29 4.1.4 The g-index . 31 4.2 Paper Ranking Algorithms . 31 4.2.1 PageRank . 32 4.2.2 SceasRank . 34 4.2.3 CiteRank . 36 4.2.4 NewRank . 37 4.2.5 Yet Another Paper Ranking Algorithm . 37 4.2.6 Graph Examples . 38 4.3 Venue Ranking Algorithms . 42 4.3.1 The Eigenfactor Metric . 42 4.3.2 The Author-Level Eigenfactor Metric . 43 4.3.3 Graph Example . 45 4.4 Chapter Summary . 47 5 Data Sets 48 5.1 DBLP Data Set . 48 5.2 Microsoft Academic Search Data Set . 50 5.3 Evaluation Data Sets . 52 5.3.1 High-Impact Paper Awards . 53 5.3.2 Best Paper Awards . 53 5.3.3 Author Contribution Awards . 53 5.3.4 Important Papers . 53 5.4 MAS Data Set Properties . 54 5.5 Chapter Summary . 66 6 Comparing Ranking Algorithms 68 6.1 Comparing Paper Ranking Algorithms . 68 6.1.1 Convergence Rates of the Algorithms . 69 6.1.2 Correlation between Paper Ranking Algorithms . 70 6.1.3 Comparison using Scatter Plots . 74 6.1.4 Score Distribution over Publication Dates . 78 6.1.5 Overall Top Papers . 81 6.1.6 Identifying Current Research Activity . 82 6.2 Comparing Venue Ranking Algorithms . 84 6.2.1 Correlations between Venue Ranking Algorithms . 84 6.2.2 Comparison using Scatter Plots . 85 6.3 Comparing Author Ranking Algorithms . 86 6.3.1 Correlation between Author Ranking Algorithms . 86 6.3.2 Comparison using Scatter Plots . 87 6.4 Chapter Summary . 90 Stellenbosch University http://scholar.sun.ac.za CONTENTS vi 7 Evaluating Ranking Algorithms 92 7.1 Evaluating Paper Ranking Algorithms . 92 7.2 How Well do Venues Predict High-Impact Papers? . 100 7.3 Evaluating Author Ranking Algorithms . 102 7.4 How Well can Important Papers be Identified by Ranking Algorithms? . 103 7.5 Chapter Summary . 104 8 Conclusion 106 8.1 Summary of Findings . 106 8.2 Threats to Validity . 107 8.3 Contributions . 107 8.4 Suggestions for Future Work . 108 Bibliography 110 Appendices 117 A Additional Information and Results 118 A.1 Additional MAS Data Set Information . 118 A.2 Evaluation Data Information . 120 A.3 Additional Results . 123 Stellenbosch University http://scholar.sun.ac.za List of Figures 4.1 Illustrative Graph G1. ............................... 39 4.2 Illustrative Graph G2. ............................... 40 4.3 Illustrative Graph G3. ............................... 41 4.4 Illustrative Graph G4. ............................... 41 4.5 Illustrative Graph G5. ............................... 45 4.6 Illustrative Graph G6. ............................... 46 5.1 The total number of papers produced in the different domains over time. 54 5.2 The number of new authors that publish their first publications over time. 55 5.3 The change in the average number of authors per paper over time. 56 5.4 The % of single-authored papers over time. 57 5.5 The average number of articles published in journals over time. 58 5.6 The average citation counts of papers over time. 59 5.7 The change of the average size of reference lists over time. 60 5.8 The average number of citations per paper since publication. 61 5.9 Number of authors publishing journal articles since their first publication. 63 5.10 The average number of journal articles published by authors since their first publication.