Bibliographic Database Analysis:

Citation Graphs and Indirect Indicators

Eleni Fragkiadaki

Ph.D. Dissertation

Supervised by Georgios Evangelidis

Submitted to Department of Applied Informatics School of Information Sciences University of Macedonia

Thessaloniki, Greece June 2016 ii iii

© Copyright by Eleni Fragkiadaki, 2016. iv v

Advisory committee

Georgios Evangelidis (supervisor), Professor

Department of Applied Informatics, University of Macedonia, Greece

Nikolaos Samaras, Associate Professor

Department of Applied Informatics, University of Macedonia, Greece

Dimitris A. Dervos, Professor

Department of Information Technology, Alexander T.E.I. of Thessaloniki, Greece

Examination committee

Dimitris A. Dervos, Professor

Department of Information Technology, Alexander T.E.I. of Thessaloniki, Greece

Georgios Evangelidis (supervisor), Professor

Department of Applied Informatics, University of Macedonia, Greece

Dimitrios Katsaros, Assistant Professor

Department of Electrical & Computer Engineering, University of Thessaly, Greece

Georgia Koloniari, Lecturer

Department of Applied Informatics, University of Macedonia, Greece

Yannis Manolopoulos, Professor

Department of Informatics, Aristotle University, Greece

Antonios Sidiropoulos, Lecturer

Department of Information Technology, Alexander T.E.I. of Thessaloniki, Greece

Nikolaos Samaras, Associate Professor

Department of Applied Informatics, University of Macedonia, Greece vi vii

Abstract

Scientific publications with new advances in a vast number of scientific fields are being published and made available to researchers around the world daily. In such an active scientific environment it has become very important for researchers to not only be able to publish their work but to also understand and explore the research performed by other influential scientists. This process of discovery and dissemination of knowledge is one of the areas where Citation Analysis can be of great use. The different techniques, metrics and approaches defined by Citation Analysis allow scientists to identify publications of particular interest, follow the published research of influential scientists and even identify publications that have set the grounds for new research fields.

In our digital era this means that institutions and publishing bodies have a need to store large sets of information in bibliographic databases. The databases hold information about the publications, their respected authors and publishing bodies. Some bibliographic databases also hold the actual published manuscripts and index the publications based on a number of different factors including the list of provided keywords. Publications are also connected, as they always rely to some extend on previously published research performed by the same or other researchers. Therefore the data stored in these bibliographic databases can be expressed in the form of Graphs, and as we will see later in this dissertation, these Citation graphs can express the relationships that are formed between the different research entities (i.e. publications, authors and publishing bodies).

This dissertation examines the different research entities that participate in the publication process of scientific research in an attempt to classify the existing indicators used to identify influential publications, researchers and publishing bodies. It proceeds by examining the use of Citation Graphs in Citation Analysis and describes in detail the concept of Derived Graphs and the algorithms that can be used to produce them, which constitute part of the contribution of this study. We continue by studying the different definitions of generations of citation and critically evaluate them in order to select the definition that is later used in the set of proposed paper and author indicators. Finally a list of well known indicators is examined and compared against the proposed indicators using the data provided by two well known bibliographic databases, namely CiteSeerx and DBLP.

Keywords: Citation analysis, Bibliographic databases, Indirect indicators, Citation generations,

Citation graphs, Derived graphs, Paper-Citation graph, Author-Citation graph, Journal-Citation graph, Hirsch algorithms, Paper indicators, Author indicators, Journal indicators, Self-citations, Indirect citations, Scholarly assessment viii ix

Περίληψη

΄Ενας µεγάλος αριθµός επιστηµονικών δηµοσιεύσεων γίνεται διαθέσιµος καθηµερινά σε ακαδη-

µαϊκούς και ερευνητές ανά τον κόσµο. Οι ερευνητές, συµµετέχουν σε αυτό το τόσο ενεργό

επιστηµονικό περιβάλλον όχι µόνο δηµοσιεύοντας την προσωπική τους έρευνα, αλλά και αναζη-

τώντας πληροφορίες και ερευνητικές πηγές που παρουσιάζουν την έρευνα άλλων διακεκριµένων

επιστηµόνων. Σε αυτήν ακριβώς την διαδικασία της αναζήτησης και διάχυσης της επιστηµονικής πλη-

ϱοφορίας είναι που το πεδίο της Ανάλυσης ϐιβλιογραφικών αναφορών µπορεί να ϐοηθήσει σηµαντικά

το έργο των ερευνητών. Οι διάφορες τεχνικές, δείκτες και προσεγγίσεις που ορίζει, επιτρέπουν

στους επιστήµονες/ερευνητές να εντοπίζουν δηµοσιεύσεις ιδιαίτερης σηµασίας, να ακολουθούν

το έργο άλλων διακεκριµένων επιστηµόνων, ακόµα και να εντοπίζουν δηµοσιεύσεις που έθεσαν τα

ϑεµέλια για την ανάπτυξη νέων ερευνητικών περιοχών.

Στην ψηφιακή εποχή που Ϲούµε αυτό σηµαίνει ότι τα πανεπιστήµια καθώς και άλλοι ϕορείς που συµ-

µετέχουν στη διαδικασία της δηµοσίευσης της επιστηµονικής έρευνας, χρειάζεται να αποθηκεύουν

έναν µεγάλο όγκο δεδοµένων σε ϐάσεις ϐιβλιογραφικών αναφορών. Τα δεδοµένα που αποθη-

κεύονται σε αυτές τις ϐάσεις αφορούν τις δηµοσιεύσεις αυτές καθεαυτές, τους συγγραφείς τους

καθώς και τα περιοδικά/συνέδρια ή άλλους ϕορείς στους οποίους πραγµατοποιήθηκαν αυτές οι

δηµοσιεύσεις. Κάποιες ϐάσεις ϐιβλιογραφικών αναφορών αποθηκεύουν τα δηµοσιευµένα κείµενα

και ευρετηριάζουν διάφορα πεδία συµπεριλαµβανοµένων και των λέξεων κλειδιών που ορίζονται

από τις δηµοσιεύσεις, διευκολύνοντας µε αυτόν τον τρόπο την αναζήτηση των δηµοσιεύσεων.

΄Ενα πρόσθετο χαρακτηριστικό των δηµοσιεύσεων είναι ότι συνδέονται, εφόσον η τρέχουσα επιστη-

µονική έρευνα ϐασίζεται σε προηγούµενη έρευνα που πραγµατοποίησαν είτε οι ίδιοι οι συγγραφείς

είτε άλλοι ερευνητές στο εν λόγω επιστηµονικό πεδίο. Εποµένως τα δεδοµένα που αποθηκεύονται

στις ϐάσεις ϐιβλιογραφικών αναφορών µπορούν να εκφραστούν µε την µορφή Γράφων, οι οποίοι,

όπως ϑα δούµε στη συνέχεια αυτής της διατριβής, εκφράζουν τις συνδέσεις ανάµεσα στις διάφορες

επιστηµονικές οντότητες (δηµοσιεύσεις, συγγραφείς, περιοδικά).

Η παρούσα διδακτορική διατριβή εξετάζει τις διάφορες επιστηµονικές οντότητες που συµµετέχουν

στην διαδικασία της δηµοσίευσης µιας επιστηµονικής έρευνας σε µια προσπάθεια να κατηγοριο-

ποιήσει τους διάφορους δείκτες που ήδη χρησιµοποιούνται για τον εντοπισµό διακεκριµένων δη-

µοσιεύσεων, συγγραφέων και περιοδικών. Στην συνέχεια εξετάζει τους Γράφους ϐιβλιογραφικών

αναφορών καθώς και την χρήση τους στο πεδίο της Ανάλυσης ϐιβλιογραφικών αναφορών. Παρου-

σιάζει και εξετάζει σε ϐάθος την έννοια των Παράγωγων Γράφων καθώς και των αλγορίθµων που x

ορίζουν τα ϐήµατα µε τα οποία αυτοί οι γράφοι µπορούν να παρασκευαστούν ξεκινώντας από τις

ϐασικές πληροφορίες που περιλαµβάνονται στον Γράφο Αναφορών-∆ηµοσιεύσεων. Στην συνέχεια

παρουσιάζονται οι διάφοροι τρόποι ορισµού των γενεών ϐιβλιογραφικών αναφορών, οι οποίοι α-

ναλύονται λεπτοµερώς προκειµένου να καταλήξουµε στον προτεινόµενο ορισµό. Τέλος, µε ϐάση

τον επιλεγµένο ορισµό, ορίζουµε τους προτεινόµενους ϐιβλιογραφικούς δείκτες για την αξιολόγηση

δηµοσιεύσεων και συγγραφέων, οι οποίοι εν συνεχεία συγκρίνονται µε άλλους, υπάρχοντες δε-

ίκτες χρησιµοποιώντας τα δεδοµένα από δύο γνωστές ϐιβλιογραφικές ϐάσεις δεδοµένων, ονόµατι

ἳτεΣεερx και ∆ΒΛΠ.

Λέξεις κλειδιά: Ανάλυση ϐιβλιογραφικών αναφορών, Βάσεις ϐιβλιογραφικών αναφορών, ΄Εµµεσοι

δείκτες, Γενεές ϐιβλιογραφικών αναφορών, Γράφοι ϐιβλιογραφικών αναφορών, Παράγωγοι Γράφοι,

Γράφοι Αναφορών - ∆ηµοσιεύσεων, Γράφοι Συγγραφέων - ∆ηµοσιεύσεων, Γράφοι Περιοδικών - ∆η-

µοσιεύσεων, Ηιρσςη Αλγόριθµοι, ∆είκτες δηµοσιεύσεων, ∆είκτες συγγραφέων, ∆είκτες περιοδικών,

Αυτο-αναφορές, ΄Εµµεσες ϐιβλιογραφικές αναφορές, Αξιολόγηση της έρευνας Contents

Contents xi

1 Introduction 1

1.1 Overview...... 1

1.2 Contribution...... 5

1.3 Dissertation organization...... 5

2 Citation analysis fundamentals9

2.1 Scholarly assessment...... 11

2.2 Mathematical notation...... 13

2.3 Paper-Citation graph...... 14

2.4 Derived graphs...... 20

2.4.1 Author-Citation graph...... 20

Mathematical notation...... 21

Transformations...... 21

Example...... 23

Known applications...... 28

2.4.2 Journal-Citation graph...... 29

Example...... 29

Known applications...... 30

xi xii CONTENTS

2.5 Indirect Citations...... 32

2.5.1 Definitions...... 33

2.5.2 Example...... 34

2.6 Generations of self-citations...... 36

2.6.1 Definition...... 36

2.6.2 Example...... 36

3 Classifying assessment indicators 41

3.1 Paper indicators...... 42

3.1.1 Direct indicators...... 42

3.1.2 Indirect indicators...... 43

3.2 Author indicators...... 47

3.2.1 Direct indicators...... 47

Standard bibliometric indicators...... 50

h − index family of indicators...... 51

Standalone indicators...... 69

3.2.2 Indirect indicators...... 72

3.3 Journal indicators...... 75

3.3.1 Direct indicators...... 75

3.3.2 Indirect indicators...... 75

4 Proposed paper indicators 81

4.1 f-value...... 82

4.1.1 Definition...... 82

4.1.2 The Reducing Factor (RF)...... 83

4.1.3 Algorithm...... 84

4.1.4 Example...... 86 CONTENTS xiii

4.2 fpk − index ...... 88

4.2.1 Critical evaluation of generations...... 88

4.2.2 Definition...... 94

4.2.3 Example...... 96

4.3 Comparison...... 98

4.3.1 f − value and fpk − index ...... 99

4.3.2 Other well known indicators...... 100

First example...... 101

Second example...... 103

5 Proposed author indicators 107

5.1 fa − value ...... 108

5.1.1 Definition...... 108

5.1.2 Example...... 108

5.2 fak − index ...... 110

5.2.1 Definition...... 110

5.2.2 Example...... 111

5.3 fask − index ...... 113

5.3.1 Definition...... 113

5.3.2 Example...... 114

5.4 Comparison...... 115

6 Bibliographic databases 119

6.1 DBLP...... 120

6.1.1 Data...... 120

6.1.2 Parser...... 121

6.1.3 Publication data analysis...... 123 xiv CONTENTS

6.1.4 Author data analysis...... 124

6.2 CiteSeerx ...... 126

6.2.1 Data...... 126

6.2.2 cc-IF algorithm...... 127

6.2.3 Parser...... 129

7 Experimental results 133

7.1 Comparison indicators...... 133

7.1.1 Paper indicators...... 134

7.1.2 Author indicators...... 134

7.2 Ranking methodology...... 135

7.3 DBLP experimental results...... 135

7.3.1 Paper indicators...... 135

7.3.2 Author indicators...... 140

7.4 CiteSeerx experimental results...... 146

7.4.1 Paper indicators...... 146

7.4.2 Author indicators...... 153

8 Summary 161

8.1 Summary...... 161

8.2 Conclusions...... 164

8.3 Future work...... 167

9 Publication List 169

9.1 Journals...... 169

9.2 Conferences...... 169

List of Figures 171 CONTENTS xv

List of Tables 173

Bibliography 177 xvi CONTENTS Chapter 1

Introduction

Scientific research has always been the pioneer for progress and for pushing the limits of our technical abilities and skills. New discoveries, experimental results and theoretical studies, all form the basis for future applications in our every day life. Its importance is undeniable and the scientific community has always found a strong supporter in the academic sector with universities from all over the world providing funds and facilities to accommodate all aspects of research. Apart from the academic sector, research is also supported by research centers and organizations, as well as private companies that form and fund internal, focused research groups.

It has thus become ever more important to be able to assess and quickly identify areas of growing scientific interest, influential scientists and high impact scientific publications. A new area of research has been formed called Citation analysis, that assists with the assessment of these research entities in order to provide better insight in the constantly expanding scientific world.

The aim of the present study has been to further explore the concept of Indirect Citations (citations that do not directly target a paper A but are linked with paper A via a citation path of length greater than one) and the impact that they can have in the way we perceive the contributions of the different research entities in their respected scientific fields.

1.1 Overview

Based on the concept of Indirect Citations and on the Cascading Citations Indexing Framework

(cc-IF) [Dervos et al., 2006a] a new algorithm was defined [Fragkiadaki et al., 2009] that accepts as an input a Paper-Citation graph and recursively calculates the Medal Standings Output (MSO)

1 2 CHAPTER 1. INTRODUCTION table for the provided graph. The Cascading Citations Indexing Framework (cc-IF), defines that citations can be examined at any depth, starting from one (which defines a publication’s direct citations). The Medal Standings Output (MSO) table, lists the number of citations gathered at each depth (generation) for each publication included in the graph up to a certain depth. According to the framework these values can be calculated at the publication level alone or at the (author, publication) level at which point author self-citations can be excluded from the calculations.

The definition of the algorithm was followed by an implementation and experimental phase during which the algorithm was executed against the data provided by the CiteSeerx database [1997].

CiteSeerx is a bibliographic database that mainly indexes the fields of computer and information sciences and provides its data under a Creative Commons license using the OAI XML format . The implementation of the algorithm relies on a relational database to read and store data to, which means that the CiteSeerx data needed to be parsed and stored in a database using a compatible format. The output of the algorithm is the MSO table for the data provided by the graph that is stored in the relational database [Fragkiadaki et al., 2009].

The study continues to define a new bibliometric indicator that attempts to assist with the assessment of a particular publication, which is called f − value [Fragkiadaki et al., 2011]. The f − value is a recursive indicator, calculated for all the publications included in a Paper-Citation graph, that considers both the direct and indirect impact of a scientific publication. The Reducing Factor (RF), which participates in the mathematical formula that calculates the f − value of a publication, is used in order to reduce the contribution of citing publications that belong to distant generations. An implementation of the f − value is executed against the CiteSeerx data in order to acquire the raw values for the publications included in the Paper-Citation graph. In order to compare the results produced by f − value a number of additional implementations of already known bibliometric indicators are also provided. The indicators used at this stage are the Number of Citations [Hirsch,

2005, 2007, Costas and Bordons, 2008, Wu, 2010] and PageRank [Page et al., 1999, Ma et al., 2008].

Based on the f − value a new indicator is also proposed for the assessment of authors, called fa − value. The main elements considered by fa − value are the f − values of the individual papers that a particular author has co-authored. Apart from these values, the fa − value also takes into account the number of co-authors of each individual publication as well as, the number of years since the author’s first publication. The indicator considers the number of years since the first publication as the scientific age of a particular author.

The study continues with a review of the literature, during which a number of indicators are examined 1.1. OVERVIEW 3 in detail. Part of the contribution of this study is in fact the definition of a common mathematical notation used to present the formulas of the examined indicators. Using this common notation highlights the common elements of the indicators and the factors considered by each indicator. In addition, two new algorithms are proposed that cover multiple definitions of indicators that are either adaptations of an existing, popular indicator, called h − index [Hirsch, 2005]. The algorithms specify the framework under which these adaptations can be categorized and the way to define new indicators. Furthermore a number of existing indicators are presented under this common framework.

An additional contribution of the review of the literature is the definition of a framework that defines and produces Derived graphs. The term Derived graphs is used to describe the Author-Citation and Journal-Citation graphs that are generated from the meta-data information available in the

Paper-Citation graph. The steps required in order to produce such Derived graphs are defined in detail by the proposed framework [Fragkiadaki and Evangelidis, 2014]. In addition, a list of Derived graphs already found in the literature is presented along with some details about their type and how they have been used.

Following the review of the literature, three new Indirect Indicators are defined, one of which is used in the assessment of scientific publications and two for the assessment of authors [Fragkiadaki and Evangelidis, 2016]. The Indirect paper indicator is called fpk − index and is defined as an indirect indicator since it considers the first k generations in its calculations. The number k of generations to use can vary based on the characteristics of each individual Paper-Citation graph, particularly with regards to the different citation patterns encountered in different scientific fields. For the purposes of the experimental study of the three indicators the first three generations of citations are utilized.

It should be noted that an important aspect of the fpk − index definition is which citations are included in the generational citation counts for any particular publication, since depending on the chosen definition the calculated values can differ significantly. In Hu et al.[2011], there are four different types of generations defined that can be applied to forward or backward generations of citations. A critical evaluation of the four definitions is performed and the definition that performed the best under the examined scenarios is selected. In brief, the scenarios covered the following:

(a) the existence of citation cycles, (b) the existence of multiple citation paths of different lengths between a pair of publications, and, (c) the existence of multiple citation paths of the same length between a pair of publications.

The next two indicators proposed, namely fak − index and fask − index are both based on the fpk − index values defined for the individual publications of a particular author. Both indicators are 4 CHAPTER 1. INTRODUCTION defined as the average fpk − index value of the publications considered as part of the publication record of an author and their main difference is that fask − index excludes any self-citations from the calculations of the fpk −index values. In general, a citation has been considered a self-citation for a particular author if the same author has participated in both publications. As we can deduct from this definition self-citations where considered at the (author, publication) level and therefore a citation can be a self-citation for only a subset of the co-authors of a particular publication.

The study continued by implementing and executing all proposed indicators against the data provided by two bibliographic databases, namely CiteSeerx mentioned earlier, and DBLP . The

DBLP database is a bibliographic database that mainly indexes publications from the Computer

Sciences field. DBLP also provides its data [DBLP, dbl, 2009] under a Creative Commons license in XML format. During this phase of the study a common relational schema is defined along with the criteria under which the XML records are considered valid and ‘‘complete’’. In general records are considered ‘‘complete’’ if they provide all information required for calculating the values of the different indicators examined. The final step of this process is the actual storing of the data in two relational MySQL databases for further processing.

In order to compare the results produced by the proposed indicators, a set of existing bibliometric indicators is defined, implemented and executed against the same databases. The chosen indicators are a mixture of Direct and Indirect indicators and they cover the assessment of both papers and authors. More specifically, the paper indicators examined are the Number of Citations, the PageRank (d = 0.50, d = 0.85), the Contemporary h − index [Sidiropoulos et al., 2007] and the SCEAS rank[Sidiropoulos and Manolopoulos, 2005]. The set of author indicators examined are the Number of Citations, the Mean Number of Citations [Hirsch, 2005, 2007, Costas and Bordons, 2008, Wu, 2010], the h − index [Hirsch, 2005, 2007], the g − index [Egghe, 2006], the Contemporary h − index, the PageRank (d = 0.50, d = 0.85) and finally the SCEAS rank.

As already mentioned, the raw values of all indicators are calculated and stored in the relational databases. From the examination of the values it became apparent that because of their different formulas, the indicators produce values that can vary in scale, thus making a direct comparison of the produced values impossible. In order to make the comparison of the indicators possible, the indicator values were used to generate the ordinal ranking of the scientific entities, which can now be compared since they all produce values of the same scale. 1.2. CONTRIBUTION 5

1.2 Contribution

The main contributions of the present study are:

ˆ the definition of the cc-IF algorithm that generates the Medal Standings Output (MSO) table

for a given Paper-Citation graph

ˆ the definition of the two Hirsch algorithms that can be used to describe existing and define new variations of the h − index author indicator

ˆ the presentation of a number of indicators under a common mathematical notation in order

to categorize them and highlight the common factors considered by the different direct and

indirect indicators

ˆ the definition of a framework that algorithmically specifies the steps required to generate

Derived graphs from any given Paper-Citation graph

ˆ the definition of two new indirect paper indicators called f − value and fpk − index that utilize the information present in a Paper-Citation graph to produce a single value that

represents the scientific impact of a particular publication

ˆ the definition of three new indirect author indicators called fa − value, fak − index and fask −index the first of which is based on f −value and the remaining two on fpk −index

1.3 Dissertation organization

The dissertation consists of eight chapters a detailed description of which can be found below.

Chapter2, presents the fundamental concepts of Citation analysis and identifies the different research entities that participate in the different stages of scientific research. A common terminology and mathematical notation is defined that will be used throughout this study to identify and describe the different entities. In addition a listing of some of the most common metrics that have been used to perform scholarly assessment is presented, in order to assist us in identifying which characteristics of the different research entities can be used for scholarly assessment. The Paper-Citation graph is described in detail and a new framework for defining and constructing Derived graphs is presented.

As part of the examination of the Paper-Citation graph the Indirect citations paradigm is described 6 CHAPTER 1. INTRODUCTION in detail, along with the adopted framework that describes the different ways of defining these indirect citations. Finally, the concept of self-citations is described.

Chapter3, presents indicators found in the literature that have been used in the assessment of the different research entities. The indicators are categorized according to different aspects of their definition and the type of entity whose impact they attempt to assess. The main categorization is based on the type of entity, thus the chapter is separated into three sections, one for Paper, one for Author and one for Journal indicators. As a secondary categorization, the indicators for each different type are separated based on whether they consider only the Direct impact or both the Direct and Indirect impact an entity has had in its respected scientific area and field. Finally, specifically for the Author based indicators, a novel categorization of the h-index [Hirsch, 2005] based indicators is presented along with the two algorithms that can be used in order to define new variations.

Chapter4, is a detailed presentation of the proposed paper indicators. Both of the proposed paper indicators are based on the Indirect citations concept. The first indicator proposed, named f −value, depends on determining the citation counts for the first two generations of citations across the entire Paper-Citation graph and is calculated recursively for all the publications included in the graph. The second indicator proposed, named fpk − index, depends on the chosen definition of the different generations of citations and the citation counts of the first k generations for each publication. This chapter also contains a comparison of the two indicators among themselves and a comparison of the indicators with other well known indicators found in the literature.

Chapter5, presents the three proposed author indicators. The first indicator proposed, named fa − value, is based on the knowledge of the f − values of the publications included in the Publication Record of a particular author. Apart from the f − values the indicator also considers the number of co-authors of each publication and the first publication year of any particular author. The remaining two indicators, named fak − index and fask − index, are both based on the fpk − index values of the publications included in the calculations and the total number of publications included in the calculations. The difference between the two is that for the calculations of fask − index we need to exclude an author’s self-citations. The chapter also includes example applications of the three indicators on constructed Paper-Citation graphs, as well as a comparison of the three indicators performed over the same graph.

Chapter6, is a presentation of the bibliographic databases examined during this study, namely DBLP and CiteSeerx. The data provided by each database were parsed and stored in a unified format. 1.3. DISSERTATION ORGANIZATION 7

For the DBLP database, the chapter also includes a summary analysis of the paper and author generations of citations. For the CiteSeerx the cc-IF algorithm that was used to produce the MSO table is also presented.

Chapter7, presents the list of chosen bibliometric indicators that were implemented along side the indicators proposed by this study. The results of executing all implemented indicators against both bibliographic databases are presented, and since the data were stored in a unified format the same implementations were used and similar sets of results are presented for both datasets. For each indicator we present the raw calculated values and a generated ranking based on these values.

Finally, in Chapter8 we present the conclusions of the present study. 8 CHAPTER 1. INTRODUCTION Chapter 2

Citation analysis fundamentals

Different research entities participate in the full cycle of a piece of scientific research becoming known to the wider scientific community. Obviously the most fundamental part of the process is the research itself conducted by individual researchers, followed by a publication to one of the publishing bodies, like a scientific journal or conference, at which point it becomes available to the rest of the scientific community.

Since published research work is the means by which knowledge is disseminated in the scientific community we consider this to be the primary research entity. All peered reviewed scientific documents that appear in the literature, and are available to other researchers to examine, are considered by the present study. Examples of such documents are technical reports, published clinical research results, articles in scientific journals, papers that appear in conference proceedings, master and PhD thesis etc. We are going to use the terms publication or paper to describe any of the above scientific documents.

Publications carry knowledge about a specific scientific topic and they also provide additional information that can be used to identify relations between different research entities. So, if we consider the text of a publication to be the data that it carries, then all accompanying pieces of information form the meta-data for that publication. Part of the meta-data for a particular publication would be:

ˆ the title of the publication

ˆ the list of co-authors

ˆ the list of references

9 10 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS

ˆ the year of publication

ˆ and the publishing body

From the meta-data list we can derive two more research entities whose assessment forms part of the citation analysis fundamentals, namely the authors and the publishing bodies. We are going to use the term author to describe all parties involved in a particular research whose names form the list of co-authors of a publication. Therefore individual researchers, research groups, students, professors, research organizations and analysts would all be referred to as authors for the remaining of this document. Finally we are going to refer to all peer-reviewed published collections of scientific documents as publishing bodies or journals as they are more commonly referred to in the literature.

In this category we would place all printed or online scientific journals, conference proceedings volumes, institutional and open-access repositories etc.

Accessing the meta-data for a set of publications means that we can identify relationships among the different research entities like for example:

ˆ the fact that one publication refers to another publication

ˆ the fact that an author has co-authored more than one publications

ˆ the fact that two publications have been co-authored by the same person

ˆ the fact that two publications have been published by the same publishing body

ˆ the fact that two authors have co-authored one or more scientific publications together and so on. Different types of indicators have been proposed in the literature that allow us to quantify these relationships in order to produce a meaningful ranking of the relative importance of the different research entities. Some of these indicators are uniquely defined, whereas others try to improve some aspect of an already existing indicator.

Part of the contribution of the present study has been to review the literature in an attempt to provide a framework that would allow us to categorize and classify the existing bibliometric indicators based on their individual characteristics. In addition, a new novel framework has been created that allows us to further utilize the information present in the meta-data of the publications in order to produce different types of citation graphs. 2.1. SCHOLARLY ASSESSMENT 11

The rest of this chapter is structured as follows: In Section 2.1 we present some of most common metrics that appear in the literature that have been used for scholarly assessment. In Section 2.2 we present the fundamental mathematical notations that we are utilizing throughout the present study to refer to the different characteristics of the research entities examined. In Section 2.3 we present the Paper-Citation graph and examine its basic properties. In Section 2.4 we describe in detail the concept of the Derived Graphs, both for the Author and Journal research entities, we define the framework to generate the different variations of these graphs and we refer to the literature to identify which variations have already appeared in other studies. In Section 2.5 we discuss the different ways in which we can define the Indirect Citations in a Paper-Citation graph. Finally in

Section 2.6 we describe the Generations of Self-Citations.

2.1 Scholarly assessment

Bibliometric indicators are numbers (unique or not) that capture the past achievements of a researcher. They are used in evaluations, on the idea that, if a researcher has been successful in the past, he is expected to be successful in the future. The data used for obtaining measures of scholarly impact for a researcher, are mainly his publication record along with the citations that these publications have received. In some cases external factors are also considered, like the Impact

Factor [Garfield, 1999, 2005] of the publishing bodies and/or information about the scientific field he currently treats.

Below, we list the factors that can be obtained from the available meta-data and that are used in the literature to define various bibliometric indicators.

Number of papers: The number of papers a researcher has (co-)authored can be a measure of how productive he has been during his scientific career. It is a measure that has been used/mentioned in many articles [Hirsch, 2005, 2007, Costas and Bordons, 2008, Wu, 2010]. When used as an author indicator the total Number of Citations (NC) has also been referred to as the s − index van Eck and Waltman[2008] and the c − method Wu[2010].

Number of citations: The cumulative number of citations a researcher has received for all the papers that he has (co-)authored can be a measure of scholarly impact and has been used/mentioned in many articles [Hirsch, 2005, 2007, Costas and Bordons, 2008, Wu, 2010]. 12 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS

Scientific age: The number of years passed since the researcher published his first paper. It has been argued that a bibliometric indicator should account for the scientific age of a researcher, otherwise, younger promising researchers will not get the proper recognition until their achievements become comparable to the ones of the scientifically older researchers of their respective fields, both in the productive and impact scales [Hirsch, 2005, Glänzel, 2006, Jin et al., 2007, Antonakis and Lalive, 2008].

Age of individual papers: The number of years passed since each individual paper was published.

In general, the value of a paper, as perceived by its citation count, can only increase over time.

It has been argued that cases where a researcher relies solely on his already published papers without keeping on producing equally important papers, should be detected and accounted for by bibliometric indicators. This way, it is possible to distinguish between currently active and inactive researchers [Glänzel, 2006, Sidiropoulos et al., 2007, Katsaros et al., 2007, Jin et al., 2007].

Age of individual citations: The age of the individual citations that each of the papers has received.

In general, the age of the individual citations received by a paper can demonstrate the current impact of the paper. A paper that keeps on receiving citations can be considered to have a higher current impact than a paper that has stopped receiving citations. It has been argued that bibliometric indicators should consider the age of individual citations to distinguish between currently active and inactive papers [Sidiropoulos et al., 2007, Katsaros et al., 2007, Jin et al., 2007].

Self-citations: We would say that paper B cites paper A, if paper A is present in the list of references provided by paper B. Self-citations can occur for the authors of a paper A, when there is an overlap between the (co-)authors of paper A and the authors of a citing paper, B. Let us now consider paper A, co-authored by researchers U, W and Z, and a set of citing papers, and, let us assume that we are interested in the self-citations of author U. The following cases have been identified:

ˆ Own self-citations: The number of citations of paper A that include U in their author list. These

have also been referred to as (author, article) level self-citations [Dervos and Kalkanis, 2005],

self-citations [Hirsch, 2005, Kosmulski, 2006], and self-citations of the first kind [Schreiber, 2007].

ˆ Co-author self-citations: Calculate the number of own self-citations of each of the co-authors

of U for paper A, namely W and Z. The co-author self-citations are defined as the sum of all 2.2. MATHEMATICAL NOTATION 13

own self-citations of the co-authors of U for paper A. These have also been referred to as

self-citations of the second kind [Schreiber, 2007].

ˆ All self-citations: Identify the number of citations that include at least one of the (co-)authors

of A in their author list. Citations that include more than one of the co-authors of A in

their author list are accounted for once. These have also been referred to as article level

self-citations [Dervos and Kalkanis, 2005] and as self-citations of the third kind [Schreiber, 2007].

Co-authors: Both the number and order of authors included in the author list of a paper have been associated with publication patterns. It has been argued that a bibliometric indicator should try to assign different credit to each of the contributors based on the number and ordering of co-authors.

Different types of orderings can be identified and the following cases are only some of the possible scenarios: (a) contributors are listed alphabetically, (b) the most important contributor is listed first (or last) with the rest listed either by contribution or alphabetically, or, (c) contributors usually publishing together can use a rotating scheme in the ordering of the author list. Thus, when a bibliometric indicator needs to take paper co-authorship into account it can do so by considering either the number of authors alone [Batista et al., 2006, Hirsch, 2005, Schreiber, 2008a,b], or the number and order of authors [Wan et al., 2007], or the number of common co-authors [Fiala, Rousselot, and Ježek,

2008], or perhaps, by defining a completely novel way of accounting for co-authorship [Hirsch, 2010].

Scientific field: The scientific field in which a researcher is active can also affect our judgment of how successful he is. It has been pointed out that different scientific fields present different scientific patterns in the number of papers published, the number of citations received by each paper, and even, the number of authors included in each paper. It has been argued that a bibliometric indicator should consider differences in the scientific fields as well [Hirsch, 2005, Batista et al., 2006, Glänzel,

2006, Antonakis and Lalive, 2008, van Eck and Waltman, 2008].

2.2 Mathematical notation

In order to better describe the meta-data information available that define the different relationships we encounter in Citation Analysis we are going to use a number of mathematical notations described in the following list, which were originally defined in Fragkiadaki and Evangelidis[2014]: 14 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS

P = {P1, P2, ..., PNP} denotes the closed set of papers participating in a Paper-Citation graph and NP is the total number of papers included in the collection.

P(Al) = {Pi|PiP} denotes the set of papers that belong to set P that author Al has co-authored.

A = {A1, A2, ..., ANA} denotes the set of authors that have participated in any of the papers included in the Paper-Citation graph. NA denotes the total number of authors participating in the Paper-Citation graph.

J = {J1, J2, ..., JNJ} denotes the set of journals in which the papers of the Paper-Citation graph where published. NJ denotes the total number of journals participating in the Paper-Citation graph.

C = {CPiPj |Pi, Pj ∈ P} denotes the set of citations between the papers included in the Paper-

Citation graph. CPiPj denotes that paper Pj is cited by paper Pi and NC denotes the total number of citations (edges) present in the Paper-Citation graph.

a(Pi) denotes the total number of authors that have co-authored paper Pi.

c(Pi) denotes the total number of (weighted) citations received by paper Pi.

r(Pi) denotes the total number of papers referenced by paper Pi.

w(CPiPj ) denotes the weight of citation CPiPj .

2.3 Paper-Citation graph

We can consider a graph to be a representation of a closed set of elements that provide links to other elements within the same set. The elements form the nodes of the graph and the links between them represent the edges of the graph. The graph edges can be directional, i.e. originating from one node and terminating at another node, or not. A non-directional edge signifies that the two 2.3. PAPER-CITATION GRAPH 15 nodes are linked, whereas a directional edge represents a relationship that originates from one node and targets another node in the graph.

Two nodes can be connected directly, i.e. an edge exists between the two nodes, or indirectly. We identify an indirect connection between two nodes if there is a path that leads from one node to the other that potentially involves visiting more than one nodes in between.

We define a cycle as a path that originates from and terminates at the same node. Zero or more nodes may appear on the same path in which case we define the different levels of cycles, starting with level 0 where a node has an edge that points back to itself.

The most commonly examined graph in Citation Analysis is the Paper-Citation graph, where the different papers correspond to the nodes of the graph and the references provided by each paper act as the graph’s edges. The Paper-Citation graph is a directed, usually acyclic graph. When a source paper S references a target paper T, this signifies a one-way relationship that originates from paper S and links it with paper T. We are going to refer to this relationship as ‘‘S references T’’ or ‘‘T is cited by S’’ depending on the currently examined paper, and the notation used to identify this relationship is S → T .

With regards to a particular paper, citations can be defined as Forward or Backward. Forward citations are all the citations that reference the current paper whereas Backward citations are all the citations that the current paper provides via its Reference list. An example of how Forward and

Backward citations are defined for a particular paper (P3) is presented in Figure 2.1. For the rest of this dissertation we are only going to refer to Forward citations and examine how they are being used in Citation Analysis.

References Citations

P1 P2 P3 P4 P5

Backward Forward

2010 2012 2016

Figure 2.1: Example of Forward and Backward citations examined for paper P3.

As previously mentioned we would consider the Paper-Citation graph to be usually acyclic, if not on its entirety, at least for its biggest part. After a paper has been published its contents never really change which means that a paper will only ever be able to reference other papers that already 16 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS existed at the time of publication. It is not uncommon though for a paper to reference another paper that has not yet been officially published, either because it appears on an author’s personal web page or or because it is being cited in a draft, pre-publication or online-first form. In these cases it is possible for a cycle to be created [Sidiropoulos and Manolopoulos, 2005].

In general, a level n cycle is going to include n + 1 papers [Fragkiadaki and Evangelidis, 2016]. So, for example three papers would participate in a Level 2 cycle as shown in Figure 2.2 (b). Using the notation defined earlier this cycle could also be presented as P 4 → P 1 → P 5 → P 4. It is true though, that in the absence of any additional information it is difficult to identify the order in which the papers were added in the graph. If we had knowledge of the publication dates of these papers we would be able to identify which paper provided a reference to a paper with a publication date set to the future.

P2 P3 P2 P3 P4 P2 P3 P4

P1 P1 P1

P6

P4 P5 P5

(a) (b) (c)

Figure 2.2: Cycles encountered in a Paper-Citation graph (a) Level 1 (b) Level 2 and (c) Level 3 cycle

As already discussed, in a Paper-Citation graph papers constitute the nodes of the graph and by using the reference list of each paper we can populate the graph with a set of directed edges.

Additional information about each paper can be depicted as properties of each node as shown in

Figure 2.3. The list of properties is formulated by the list of co-authors, the publishing body and the year of publication.

Based on the mathematical notations defined in Section 2.2, we would describe this Paper-Citation graph as

ˆ P = {P1,P2,P3,P4,P5,P6} with NP = 6 since we have six papers

ˆ A = {A1,A2,A3,A4,A5} with NA = 5 since we have five distinct authors

ˆ J = {J1,J2,J3} with NJ = 3 since we have three distinct journals papers 2.3. PAPER-CITATION GRAPH 17

A5 A4 J3 2014 J1 2013

w(C P5P4 ) w(C P4P3 ) P5 P4 P3

J2 2015 w(C ) w(C P3P2 ) A1 P4P2 J1 2010 A3

w(C P6P2 ) w(C P2P1 ) P6 P2 P1

J3 2015 J2 2012

A4 A2,A3

Figure 2.3: Example of a Paper-Citation graph The year of publication, journal and the list of co-authors are depicted as properties of the paper nodes.

ˆ and C = {CP2P1 ,CP3P2 ,CP4P2 ,CP4P3 ,CP5P4 ,CP6P2 } with NC = 6 since we have a total of six citations (edges) present in the graph

If we were to examine P2 in more detail we would also say that:

ˆ a(P2) = 2, since P2 has two authors, A2 and A3

ˆ c(P2) = 3, since P2 is referenced by P3, P4 and P6

ˆ r(P2) = 1 since P2 provides one reference, to P1

It is also worth noting that in the Paper-Citation graph we can only ever have one edge between a pair of papers, and this is due to the fact that even though a paper A may provide multiple in-text references to the same paper B, paper B can only appear in the Reference List of paper A once.

A common metric used in citation analysis is the number of citations a paper has received and it has been referred to in the literature as the Number of Citations (NC), s-index [van Eck and Waltman,

2008], or c-method [Wu, 2010]. With regards to the Paper-Citation graph this can be translated as the number of incoming edges to the paper node, or as the sum of the weights of these edges if the weight has been set to one. We are going to refer to this approach as Full Counting.

Utilizing the weight property of the edges one can provide additional information about the relationship of two papers, like the fact that a citing paper usually references multiple papers in its

Reference list. If we wish to account for the number of references each paper provides then the weight of each edge can be set to one divided by the total number of papers referenced by the source paper. We are going to refer to this approach as Fractional counting. 18 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS

In Figure 2.3 the weights of the edges have been represented with their mathematical notation and in Table 2.1 we are presenting the calculated values for the weights when using Full or Fractional counting. Fractional counting will always produce values that are either the same or lower than the ones produced by Full counting.

Citation Counting Full (FUC) Fractional (FRC)

w (CP2P1 ) 1 1

w (CP3P2 ) 1 1 1 w (CP4P2 ) 1 2 1 w (CP4P3 ) 1 2

w (CP5P4 ) 1 1

w (CP6P2 ) 1 1 Total 6 5

Table 2.1: Citation weights for the Paper-Citation graph of Figure 2.3.

As we can see from comparing the generated weights in Table 2.1, the edges that are affected by the different counting methods are the ones that originate from paper P4. P4 is the only paper in our Paper-Citation graph that references more than one papers, and therefore the weight of each of the edges is set to 1/2. It is also worth noting that the sum of all the weights of the graph edges is equal to the number of edges when using Full Counting and to the number of nodes when using

Fractional Counting.

Now that we have defined the weights of the edges in the Paper-Citation graph we can go back to one of the notations presented in Section 2.2 and provide more details about its actual definition.

More specifically, we have referred to c(Pi) as the total number of weighted citations received by paper Pi and in the example presented later we said that c(P2) = 3, since P2 is referenced by P3,

P4 and P6. This is still true if we apply Full counting in the original Paper-Citation graph, and it also represents the Number of Citations (NC) value for this paper. But, if Fractional counting is applied

1 then c(P2) = 2.5, since now w(CP4P2 ) = /2 instead of 1. Therefore we are going to refer to c(Pi) as the Weighted Citation count and it is going to be expressed as:

X c(Pi) = w(CPj Pi ) (2.1)

And, as previously mentioned, when the weight of all the edges in the Paper-Citation graph is set to 1 then the Weighted Citation count is identical with the Citation count, whereas when Fractional counting is applied to the Paper-Citation graph the Weighted Citation count is always going to be 2.3. PAPER-CITATION GRAPH 19 lower than or equal to the Number of Citations (NC).

In general the same details can also be presented in the Paper-Citation table which for the graph in Figure 2.3 is shown in Table 2.2. As we can see the only paper that does not reference any other paper in the graph is P1. The rest of the papers in the set do provide at least one reference to one of the papers in the graph. With regards to citations, the papers that do not receive any citations are papers P5 and P6, which only provide references.

Paper Publication year Journal Is cited by Co-authors References

P1 2010 J1 P2 A1 -

P2 2012 J2 P3,P4,P6 A2,A3 P1

P3 2013 J1 P4 A4 P2

P4 2014 J3 P5 A5 P3,P2

P5 2015 J3 - A3 P4

P6 2015 J2 - A4 P2

Table 2.2: Paper-Citation table for the Paper-Citation graph of Figure 2.3.

It is also worth noting at this point that the fact that a paper that does not receive any citations in the present citation graph does not mean that the paper has not received any citations at all. As it has been previously highlighted, no one citation database actually holds information about all the citations and references provided by a paper. It is not uncommon for papers to reference other papers not included in a particular citation database and is also not uncommon for a paper to receive citations by papers not included in the same database.

Unfortunately, it is not easy to link data that are stored in different bibliographic databases since there is no universal source of truth for citation data. One approach that could assist in combining the paper citation data from different bibliographic databases would be to use a paper’s DOI as part of any provided reference. This would mean that each paper could be uniquely identified across all bibliographic databases, thus making it easier for researchers to gather and combine citation data from different sources.

An equivalent approach for uniquely identifying authors has also been proposed by Dervos et al.

[2006b]. It is named the Universal Author Identifier (UAI) and would allow a researcher to be uniquely identified across all bibliographic sources. 20 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS

2.4 Derived graphs

The Paper-Citation graph presented in the previous section can be constructed from the information derived from the papers included in a closed set of papers. From the same information we can construct other types of graphs that utilize different aspects of the provided meta-data. More specifically, the two types of Derived graphs that we are going to examine more closely are the

Author-Citation graph and the Journal-Citation graph. We are going to refer to these graphs as Derived as they are constructed by applying a finite set of transformations to the originating

Paper-Citation graph.

The framework that defines how these graphs can be constructed in a structured way from the meta-data provided by the Paper-Citation graph consists part of the contribution of the present PhD thesis.

2.4.1 Author-Citation graph

The Author-Citation graph is a directed graph whose nodes are authors of papers and its edges represent the citations provided from one author to another. As discussed earlier, from any closed set of papers for which we have all required meta-data information available, i.e. title, author list and publication year we can construct the corresponding Author-Citation graph.

The nodes of the graph will be the set of co-authors of the papers in the collection. We consider an author A to reference another author B in the set if A has been the co-author of a paper that references at least one of the papers co-authored by B. Similar to the Paper-Citation graph and the definition of a paper citation, we say that ‘‘A references B’’ or ‘‘B is referenced by A’’ and the notation used to represent this relationship would be A → B.

The steps and transformations we need to apply to the originating Paper-Citation graph in order to produce the corresponding Author-Citation graph are:

ˆ Step 1: Define the weight of the edges in the originating Paper-Citation graph

ˆ Step 2: Produce the intermediate graph by transforming the paper citations to author citations

and define the weight of these edges

ˆ Step 3: Collapse the multiple edges between two authors to a single edge with a suitable

weight 2.4. DERIVED GRAPHS 21

Mathematical notation

Expanding on the mathematical notation defined in Section 2.2 the following can be defined in order to better describe the Author-Citation graph [Fragkiadaki and Evangelidis, 2014].

ˆ V V E = {EAkAl,PiPj |∃CPiPj ∈ C Pi ∈ P(Ak) Pj ∈ P(Al)} denotes the set of edges

between authors in the intermediate Author-Citation graph. EAkAl,PiPj denotes an edge from

author Ak to author Al that exists because there exists a citation from paper Pi co-authored

by Ak to paper Pj co-authored by Al.

ˆ e(Ak) denotes the total number of outgoing edges, originating from author Ak.

ˆ w(EAkAl,PiPj ) denotes the weight of edge EAkAl,PiPj .

ˆ Cd = {Cd |∃E ∈ E} denotes the set of edges between authors, or derived AkAl AkAl,PiPj author citations, in the final Author-Citation graph.

ˆ r(Ak) denotes the total number of authors referenced by author Ak in the final Author- Citation graph.

ˆ w(Cd ) denotes the weight of author citation Cd . AkAl AkAl

Transformations

So, using the defined mathematical notation we can now define the three steps needed to generate the Author-Citation graph from the original Paper-Citation graph, as originally described in

[Fragkiadaki and Evangelidis, 2014].

The weight of each paper citation CPiPj between papers Pi,Pj of the original Paper-Citation graph is  0 , citation does not exist   1 w(CPiPj ) = , fractional counting (2.2)  r(Pi)  1 , full counting

The weight of each individual edge EAkAl,PiPj between authors Ak,Al of papers Pi,Pj respec- tively of the intermediate Author-Citation graph is

 0 , edge does not exist   w(EA A ,P P ) = (2.3) k l i j w(CPiPj ) , No normalization   1 w(CP P ) · , Normalize per citation i j a(Pi)∗a(Pj ) 22 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS

The weight of each derived author citation Cd between authors A ,A in the resulting Author- AkAl k l Citation graph is

 0 , citation does not exist    P  w(EA A ,PiPj ) , full counting  k l Pi,Pj ∈P  !  1 P d  · w(EA A ,P P ) , fractional citation counting w(C ) = r(Ak) k l i j AkAl Pi,Pj ∈P  !   1 P  · w(EAkAl,PiPj ) , fractional edge counting  e(Ak)  Pi,Pj ∈P  !  1 P  P · w(EA A ,P P ) , fractional weight counting  w(EA A ,P P ) k l i j  k M i j P ,P ∈P Pi,Pj ∈P ∧AM ∈A i j (2.4)

So, as discussed in the previous section and based on Formula 2.2, there are two ways that we can define weights in the Paper-Citation graph, either by assuming that all edges have an equal weight of one (Full counting - FUC) or by assigning a weight to each outgoing edge of a paper, equal to one divided by the number of outgoing edges (Fractional counting - FRC).

In order to better demonstrate how we map the citations between the papers to links between the authors, let us consider a single pair of papers (P1, P2) from Figure 2.3. Authors A1, A2 and A3 are the co-authors of these papers and therefore the resulting graph would have three nodes, one for each author, with one outgoing edge from A2, one outgoing edge from A3 and two incoming edges for author A1.

In other words if there is a citation between two papers in the Paper-Citation graph, then, there is a citation between all the co-authors of the source paper to all the co-authors of the target paper in the resulting intermediate graph. This means that if the source paper has been co-authored by two authors and the target paper has been co-authored by a different set of three authors then the citation between the two papers will result in 2 ∗ 3 = 6 edges between the authors of the papers in the intermediate graph. These edges will be directed since the original edges are directed and therefore the nature of the graph remains the same.

While constructing the intermediate citation graph we might encounter author self-citations. This means that we might encounter a case where a particular author is part of the co-author list of both the source and target paper. In this case the generated edge will appear as an edge originating and terminating on the same author node.

Now that we have defined the nodes of the intermediate graph and how the edges are generated from the original Paper-Citation graph, the next step would be to identify the weights that the edges 2.4. DERIVED GRAPHS 23 should have. Based on Formula 2.3 there are two approaches to this. One is to assign the weight of the original paper citation as is to all derived author citations (No Normalization - NN). The other approach would be to distribute the weight of the original paper citation equally to the number of derived edges (Normalize per citation - NC). The number of derived edges from each individual paper citation is, the product of the number of co-authors of the source and target papers.

The final step in generating the Author-Citation graph is to define the edges of the final graph and assign the appropriate weights to them. The input of this step is the generated intermediate graph which has already defined the nodes of the graphs but may have more than one edges between the same source/target pair of authors. What we would like to have in the resulting graph is a single edge between each pair of authors that is weighted accordingly. As presented in Formula

2.4 we have identified four approaches in defining the weight of an author citation in the resulting

Author-Citation graph:

ˆ Full counting (FUC): The resulting weight is generated by adding the weights of all individual

author connections from the intermediate graph.

ˆ Fractional citation counting (FRCC): The resulting weight is generated by dividing the summed

weight of the individual author connections by the number of outgoing edges from the source

author in the final Author-Citation graph

ˆ Fractional edge counting (FREC): The resulting weight is generated by diving the summed

weight of the individual author connections by the number of outgoing edges from the source

author in the intermediate Author-Citation graph

ˆ Fractional weight counting (FRWC): The resulting weight is generated by diving the summed

weight of the individual author connections by the sum of the weights of all outgoing edges

from the source author

Example

Considering the Paper-Citation graph of Figure 2.3 we are going to construct the derived Author-

Citation graph and assign appropriate weights to all the edges in the Intermediate and final

Author-Citation graph. In order to generate the Intermediate Author-Citation graph we first need to translate all the citations in the originating Paper-Citation graph to their corresponding author citations. 24 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS

For example, let us consider the citation between papers P3 and P2 in the originating Paper-Citation graph. Paper P3 has been authored by A4 and paper P2 has been co-authored by authors A2 and A3. This means the P3 → P2 citation will be represented by two edges in the Intermediate

Author-Citation graph, A4 → A2 and A4 → A3. Similarly paper P6 references P2 and since P6 has been authored by A4 as well, we would have two additional edges in the Author-Citation graph originating from A4 and again targeting authors A2 and A3 respectively. If we do have a look at the

Intermediate Author-Citation graph we do see that there are two edges originating from A4 towards

A2, one from the P3 → P2 citation and one from the P6 → P2 citation. After repeating the above process for all citations present in the originating Paper-Citation graph we will have constructed the

Intermediate Author-Citation graph presented in Figures 2.4 (a).

(a) Intermediate graph (b) Final graph

Figure 2.4: Constructed Author-Citation graph

As discussed in the previous section the weights of the edges in the Intermediate Author-Citation graph depend on the weights of the citations in the originating Author-Citation graph and on the normalization method we have chosen for the Intermediate graph. In Figure 2.4 (a) the weights of the edges have been depicted with their mathematical notation and their actual values are presented in Tables 2.3 and 2.4.

In Table 2.3 we can see the weights assigned to each of the citations in the originating Paper-

Citation graph when Full Counting (FUC) has been applied to the originating Paper-Citation graph and in Table 2.4 we can see the weights when Fractional Counting (FRC) has been applied to the

Paper-Citation graph. In addition we can see the authors of each paper and the generated edges in the Intermediate Author-Citation graph along with their corresponding weights depending on whether we have normalized the weights of the citations or not. 2.4. DERIVED GRAPHS 25

Edge weight - Normalization

Citation Author None Per citation

1 Notation Weight From To w(CP P ) w(CP P ) · i j i j a(Pi)∗a(Pj )

1 1 A2 A1 1 2∗1 = 2 CP2P1 1 1 1 A3 A1 1 2∗1 = 2

1 1 A4 A2 1 1∗2 = 2 CP3P2 1 1 1 A4 A3 1 1∗2 = 2

1 1 A5 A2 1 1∗2 = 2 CP4P2 1 1 1 A5 A3 1 1∗2 = 2

1 CP4P3 1 A5 A4 1 1 = 1

1 CP5P4 1 A3 A5 1 1 = 1

1 1 A4 A2 1 1∗2 = 2 CP6P2 1 1 1 A4 A3 1 1∗2 = 2

Table 2.3: Paper (FUC) and Intermediate Author-Citation graph edge weights

For example if we have applied Full counting in the originating Paper-Citation and we have chosen to normalize the weights of each edge in the Intermediate Author-Citation graph per Citation, then for the citation P3 → P2 we would say that it has weight equal to 1 in the originating Paper-Citation 1 graph and the weight of the EA4A2,P3P2 would be 1 ∗ 1∗2 , since paper P3 has one author and P2 has two.

The next step would be to construct the final Author-Citation graph from the Intermediate graph we just generated and assign the appropriate weights to graph edges. The final Author-Citation graph will have exactly the same nodes as the Intermediate graph and it will have a single edge from one node to the other, i.e. if there more than one edges between a particular author pair these will be replaced by a single edge with an appropriate weight.

Figure 2.4 (b) present the final Author-Citation graph and as we can see the resulting Author-Citation graph has fewer edges since multiple edges between authors have been collapsed to single edges with different weights. The weights of the edges depend on the choices made in the previous steps and, as we have already mentioned, if we only examine the transition from the Intermediate graph to the Final one there are four different approaches that one could follow. The calculated weights 26 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS

Edge weight - Normalization

Citation Author None Per citation

1 Notation Weight From To w(CP P ) w(CP P ) · i j i j a(Pi)∗a(Pj )

1 1 A2 A1 1 2∗1 = 2 CP2P1 1 1 1 A3 A1 1 2∗1 = 2

1 1 A4 A2 1 1∗2 = 2 CP3P2 1 1 1 A4 A3 1 1∗2 = 2

1 1 1 1 1 A5 A2 2 2 ∗ 1∗2 = 4 CP4P2 2 1 1 1 1 A5 A3 2 2 ∗ 1∗2 = 4

1 1 1 1 1 CP4P3 2 A5 A4 2 2 ∗ 1 = 2

1 CP5P4 1 A3 A5 1 1 = 1

1 1 A4 A2 1 1∗2 = 2 CP6P2 1 1 1 A4 A3 1 1∗2 = 2

Table 2.4: Paper (FRC) and Intermediate Author-Citation graph edge weights of the edges when we apply Full counting in the originating Paper-Citation graph are presented in

Table 2.5. The equivalent weights when we apply Fractional Counting in the Paper-Citation graph are presented in Table 2.6.

As we can see from both tables the final weight for each of the edges in the resulting Author-Citation graph is affected by the transformations we choose to apply to the weights while we are building the final graph. In general, we can refer to the steps followed by naming the weight functions we used to generate the graphs, so for example FUC-NN-FUC would mean that we applied Full Counting

(FUC) in the originating Paper-Citation graph, we did not apply any normalization (NN) of the edge weights in the Intermediate Author-Citation graph and finally we applied Full Counting (FUC) in the resulting Author-Citation graph.

From Equation 2.4 we see that in all cases the weight of the edges in the resulting Author-Citation graph will be equal to the sum of the weights of all individual connections in the Intermediate

Author-Citation graph divided by nothing (FUC), or by the number of outgoing edges of the source author in the Intermediate graph (FREC), or by the number of outgoing edges of the source author in the final graph (FRCC), or, finally, by the sum of the weights of the edges in the Intermediate 2.4. DERIVED GRAPHS 27

Edge Citation FUC FRCC FREC FRWC

No Normalization

d  w (EA2A1,P2P1 ) 1 w CA2A1 1 1 1 1 d  1 1 1 1 w (EA3A1,P2P1 ) 1 w CA3A1 1 2 2 1+1 = 2 d  1 1 1 1 w (EA3A5,P5P4 ) 1 w CA3A5 1 2 2 1+1 = 2

w (EA4A2,P3P2 ) 1 d  1+1 1+1 1 1+1 1 w CA4A2 2 2 = 1 4 = 2 1+1+1+1 = 2 w (EA4A2,P6P2 ) 1

w (EA4A3,P3P2 ) 1 d  1+1 1+1 1 1+1 1 w CA4A3 2 2 = 1 4 = 2 1+1+1+1 = 2 w (EA4A3,P6P2 ) 1 d  1 1 1 1 w (EA5A2,P4P2 ) 1 w CA5A2 1 3 3 1+1+1 = 3 d  1 1 1 1 w (EA5A3,P4P2 ) 1 w CA5A3 1 3 3 1+1+1 = 3 d  1 1 1 1 w (EA5A4,P4P3 ) 1 w CA5A4 1 3 3 1+1+1 = 3 Normalize per citation

1 1 d  1 1 1 2 w (EA2A1,P2P1 ) 2 w CA2A1 2 2 2 1 = 1 2 1 1 1 1 d  1 2 1 2 1 2 1 w (EA3A1,P2P1 ) 2 w CA3A1 2 2 = 4 2 = 4 1 = 3 1+ 2 d  1 1 1 2 w (EA3A5,P5P4 ) 1 w CA3A5 1 2 2 1 = 3 1+ 2

1 1 1 w (EA4A2,P3P2 ) 1 + 1 1 + 1 + 2 w Cd  1 + 1 = 1 2 2 1 2 2 1 2 2 = 1 A4A2 2 2 2 = 2 4 = 4 1 + 1 + 1 + 1 2 1 2 2 2 2 w (EA4A2,P6P2 ) 2

1 1 1 w (EA4A3,P3P2 ) 1 + 1 1 + 1 + 2 w Cd  1 + 1 = 1 2 2 1 2 2 1 2 2 = 1 A4A3 2 2 2 = 2 4 = 4 1 + 1 + 1 + 1 2 1 2 2 2 2 w (EA4A3,P6P2 ) 2 1 1 1 1 d  1 2 1 2 1 2 1 w (EA5A2,P4P2 ) 2 w CA5A2 2 3 = 6 3 = 6 1 1 = 4 2 + 2 +1 1 1 1 1 d  1 2 1 2 1 2 1 w (EA5A3,P4P2 ) 2 w CA5A3 2 3 = 6 3 = 6 1 1 = 4 2 + 2 +1 d  1 1 1 1 w (EA5A4,P4P3 ) 1 w CA5A4 1 3 3 1 1 = 2 2 + 2 +1

Table 2.5: Author-Citation graph edge weights when using FUC in the Paper-Citation graph

graph (FRWC).

From the definitions and from the data presented in Tables 2.5 and 2.6, we can see that when the source author has the same number of outgoing edges both in the Intermediate and the Resulting

Author-Citation graph then there is no difference between the FRCC and FREC methods. The sum of the weights of the edges will be divided by the same number, as is the case for example for the Cd citation, where author A has two outgoing edges in the Intermediate graph and two A3A1 3 outgoing edges in the Final Author-Citation graph. On the other hand when the two counts differ then the FRCC and FREC methods will produce different results. For example, let us look at citation Cd . Author A has four outgoing edges in the Intermediate graph which have been collapsed A4A2 4 28 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS

Edge Citation FUC FRCC FREC FRWC

No Normalization

d  w (EA2A1,P2P1 ) 1 w CA2A1 1 1 1 1 d  1 1 1 1 w (EA3A1,P2P1 ) 1 w CA3A1 1 2 2 1+1 = 2 d  1 1 1 1 w (EA3A5,P5P4 ) 1 w CA3A5 1 2 2 1+1 = 2

w (EA4A2,P3P2 ) 1 d  1+1 1+1 1 1+1 1 w CA4A2 1 + 1 = 2 2 = 1 4 = 2 1+1+1+1 = 2 w (EA4A2,P6P2 ) 1

w (EA4A3,P3P2 ) 1 d  1+1 1+1 1 1+1 1 w CA4A3 1 + 1 = 2 2 = 1 4 = 2 1+1+1+1 = 2 w (EA4A3,P6P2 ) 1 1 1 1 1 d  1 2 1 2 1 2 1 w (EA5A2,P4P2 ) 2 w CA5A2 2 3 = 6 3 = 6 1 1 1 = 3 2 + 2 + 2 1 1 1 1 d  1 2 1 2 1 2 1 w (EA5A3,P4P2 ) 2 w CA5A3 2 3 = 6 3 = 6 1 1 1 = 3 2 + 2 + 2 1 1 1 1 d  1 2 1 2 1 2 1 w (EA5A4,P4P3 ) 2 w CA5A4 2 3 = 6 3 = 6 1 1 1 = 3 2 + 2 + 2 Normalize per citation

1 d  1 1 1 1 w (EA2A1,P2P1 ) 2 w CA2A1 2 2 2 2 1 1 1 1 d  1 2 1 2 1 2 1 w (EA3A1,P2P1 ) 2 w CA3A1 2 2 = 4 2 = 4 1 = 3 1+ 2 d  1 1 1 2 w (EA3A5,P5P4 ) 1 w CA3A5 1 2 2 1 = 3 1+ 2

1 1 1 w (EA4A2,P3P2 ) 1 + 1 1 + 1 + 2 w Cd  1 + 1 =1 2 2 1 2 2 1 2 2 = 1 A4A2 2 2 2 = 2 4 = 4 1 + 1 + 1 + 1 2 1 2 2 2 2 w (EA4A2,P6P2 ) 2

1 1 1 w (EA4A3,P3P2 ) 1 + 1 1 + 1 + 2 w Cd  1 + 1 =1 2 2 1 2 2 1 2 2 = 1 A4A3 2 2 2 = 2 4 = 4 1 + 1 + 1 + 1 2 1 2 2 2 2 w (EA4A3,P6P2 ) 2 1 1 1 1 d  1 4 1 4 1 4 1 w (EA5A2,P4P2 ) 4 w CA5A2 4 3 = 12 3 = 12 1 1 1 = 4 4 + 4 + 2 1 1 1 1 d  1 4 1 4 1 4 1 w (EA5A3,P4P2 ) 4 w CA5A3 4 3 = 12 3 = 12 1 1 1 = 4 4 + 4 + 2 1 1 1 1 d  1 4 1 4 1 2 1 w (EA5A4,P4P3 ) 2 w CA5A4 4 3 = 12 3 = 12 1 1 1 = 2 4 + 4 + 2

Table 2.6: Author-Citation graph edge weights when using FRC in the Paper-Citation graph

to two in the resulting Author-Citation graph. In this case, the FRCC and FREC methods will produce different weights since FREC will divide by four whereas FRCC will divide by two.

Known applications

The third type of derived graph (Normalize per citation) can be found in [Radicchi et al., 2009,

2012], where the Author-Citation graph is constructed from the Paper-Citation graph and is called the Weighted Author Citation Network (WACN). 2.4. DERIVED GRAPHS 29

2.4.2 Journal-Citation graph

In a manner similar to the Author-Citation graph, the Journal-Citation graph is a directed graph whose nodes are scientific journals and its edges represent the citations provided from one journal to another. The nodes of the graph will be the set of journals where the papers in our collection have been published in and we would define a journal citation as a citation provided from paper P1 published in journal J1 to P2 published in journal J2. The notation used to describe a journal citation would be J1 → J2 and the transformations that we need to apply to the originating Paper-Citation graph in order to construct the Journal-Citation graph are the same as the ones presented earlier for constructing the Author-Citation graph:

ˆ Step 1: Define the weight of the edges in the originating Paper-Citation graph

ˆ Step 2: Produce the intermediate graph by transforming the paper citations to journal citations

and define the weight of these edges

ˆ Step 3: Collapse the multiple edges between two journals to a single edge with a suitable

weight

In addition the same mathematical notation and transformations defined in sections 2.4.1 and 2.4.1 apply for the derived Journal-Citation graphs. The only distinct difference between the two processes lies in the construction of the Intermediate Journal-Citation graph, in the sense that a paper citation is always going to be translated to a single journal citation in the Intermediate graph, since a paper can only ever be published in one journal. This also means that the weights of the edges in the

Intermediate Journal-Citation graph are going to be the same as the weights of the edges in the originating Paper-Citation graph.

Example

In order to illustrate the process of creating a Journal-Citation graph we are going to use the

Paper-Citation graph of Figure 2.3 to produce the Intermediate and final Journal-Citation graphs for our collection of papers.

Following the same procedure as before in order to generate the Intermediate Journal-Citation graph we create a new graph whose nodes are the journals the papers have been published in, and we transform each paper citation to an edge in the Intermediate graph. Let us consider for 30 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS

example the P2 → P1 reference. P2 is published in journal J2 and P1 in journal J1. This means that in the Journal-Citation graph we have an edge originating from node J2 and terminating at node J1. After examining all citations present in the Paper-Citation graph the Intermediate Journal-Citation graph of Figure 2.5 (a) is generated. The weights of the edges have been depicted using their mathematical notation since, as we have seen in the Author-Citation graph as well, there are different ways of calculating their values.

(a) Intermediate graph (b) Final graph

Figure 2.5: Constructed Journal-Citation graph

In order to construct the final Journal-Citation graph we need to produce a new graph where multiple edges with the same direction between the same source and destination journals are combined into single edges. The resulting Journal-Citation graph can be seen in Figure 2.5 (b), and again the weights of the edges are depicted by their mathematical notation.

The final step would be to assign the appropriate weights to the edges of the originating Paper-

Citation graph, the Intermediate and final Journal-Citation graphs. In Table 2.7 (a) and (b) we present the citations in the Paper-Citation graph along with the weights of the edges when Full and Fractional Counting has been applied in the originating Paper-Citation graph along with the corresponding weights in the Intermediate Journal-Citation graph.

Table 2.8 presents the weights in the resulting Journal-Citation graph according to the different methods used to define the weights in the resulting Journal-Citation graph.

Known applications

A FUC-NN-FUC derived graph can be found in Bollen et al.[2006], while a FUC-NN-FREC derived graph is found in Bergstrom[2007], González-Pereira et al.[2010], Guerrero-Bote and Moya-Anegón 2.4. DERIVED GRAPHS 31

Journal Edge Journal Edge

Citation Weight From To w(CPiPj ) Citation Weight From To w(CPiPj )

CP2P1 1 J2 J1 1 CP2P1 1 J2 J1 1

CP3P2 1 J1 J2 1 CP3P2 1 J1 J2 1

1 1 CP4P2 1 J3 J2 1 CP4P2 2 J3 J2 2

1 1 CP4P3 1 J3 J1 1 CP4P3 2 J3 J1 2

CP5P4 1 J2 J3 1 CP5P4 1 J2 J3 1

CP6P2 1 J3 J2 1 CP6P2 1 J3 J2 1

(a) Full Counting in the Paper-Citation graph (b) Fractional counting in the Paper-Citation graph

Table 2.7: Edge weights in the Paper and Intermediate Journal-Citation graphs

[2012] with slight differences. The differences lie in the way journal self-citations are treated and in the time-constraints imposed during the construction of the graph.

More specifically, Bollen et al.[2006] do not consider journal self-citations while constructing the graph. On the other hand, in the methods used to construct the Journal-citation graph for the computation of the EigenFactor scores [Wes, 2008] journal self-citations are completely removed, whereas González-Pereira et al.[2010] restrict the number of journal self-citations to 33% of the journal’s overall citation count. The same 33% limit can be implied for the self-citations included in the Journal-citation graph used in Guerrero-Bote and Moya-Anegón[2012] based on the fact that the authors follow a similar procedure with the one presented in González-Pereira et al.[2010] in order to propose two indicators that extend the ones presented in the later paper.

Time wise, Bollen et al.[2006] use the generated graph as is, whereas for the calculations of the

EigenFactor scores [Wes, 2008] only the subset of citations falling into a specific five-year window are included in the graph. By imposing this limitation, the produced graph contains a subset of the available information local to a specific time-period. We refer to this property of the particular graphs as time-awareness. A similar time constraint is also imposed by González-Pereira et al.[2010] and Guerrero-Bote and Moya-Anegón[2012]with the time-window set to three years instead of five. 32 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS

Edge Citation FUC FRCC FREC FRWC

Full counting

d  w (EJ1J2,P3P2 ) 1 w CJ1J2 1 1 1 1 d  1 1 1 1 w (EJ2J1,P2P1 ) 1 w CJ2J1 1 2 2 1+1 = 2 d  1 1 1 1 w (EJ2J3,P5P4 ) 1 w CJ2J3 1 2 2 1+1 = 2 d  1 1 1 1 w (EJ3J1,P4P3 ) 1 w CJ3J1 1 2 2 1+1+1 = 3

w (EJ3J2,P4P2 ) 1 d  1+1 1+1 2 1+1 2 w CJ3J2 2 2 = 1 3 = 3 1+1+1 = 3 w (EJ3J2,P6P2 ) 1

Fractional counting

d  w (EJ1J2,P3P2 ) 1 w CJ1J2 1 1 1 1 d  1 1 1 2 w (EJ2J1,P2P1 ) 1 w CJ2J1 1 2 2 1 = 3 1+ 2 1 1 1 1 d  1 2 1 2 1 2 1 w (EJ2J3,P5P4 ) 2 w CJ2J3 2 2 = 4 2 = 4 1 = 3 1+ 2 1 1 1 1 d  1 1 1 1 w (EJ J ,P P ) w C 2 2 2 3 1 4 3 2 J3J1 2 2 = 4 3 = 6 1 = 5 2 +1+1

w (EJ3J2,P4P2 ) 1 d  1+1 1+1 2 1+1 4 w CJ J 2 = 1 = 1 = 5 3 2 2 3 3 2 +1+1 w (EJ3J2,P6P2 ) 1

Table 2.8: Journal-Citation graph edge weights

2.5 Indirect Citations

So far we have only presented Direct Citations, i.e. citations originating from a source paper and terminating at a target paper. All targeted papers appear in the List of References of the source paper. Apart from Direct Citations though we can also examine Indirect citations, i.e. citations that originate from a source paper and terminate at a target paper created by identifying paths of length greater than one, that are present in the Citation graph that link the two papers.

The notation we are going to use to specify an indirect citation between a source paper S and a target paper T via the intermediate paper I1 is going to be S → I1 → T . The length of the path that generates this indirect citation is equal to two. The following S → I1 → I2 → T signifies a length three indirect citation from paper S to paper T via the intermediate papers I1and I2. In order to better demonstrate the indirect citations present a citation graph let us examine the

Paper-Citation graph of Figure 2.3 and particularly paper P1. P1 receives one direct citation from

P2, and three indirect citations from papers P3 via P3 → P2 → P1, P4 via P4 → P2 → P1 and P6 via P6 → P2 → P1. 2.5. INDIRECT CITATIONS 33

2.5.1 Definitions

The definitions and notations presented in this section are presented in detail in Hu et al.[2011].

Based on that paper there are four different definitions of generations for backward and forward citations depending on whether the currently examined target paper has already participated in a previous generation or not. The eight different types of generations are described below and listed in Table 2.9:

ˆ Forward and Backward generations are denoted with a subscript n, with n being either a positive natural number (Forward generations) or a negative whole number (Backward

generations).

ˆ H denotes that the generations are to be defined independently and G denotes that the generations can only include papers not already included in a previous generation.

ˆ The superscript s denotes that a paper can only be included once in a particular generation (the generation is a set) and m denotes that a paper can be included more than once in a generation (the generation is a multi-set).

Type Relation Papers Notation

s Unique papers per generation Hn Independent m Non-unique papers per generation Hn Forward s Unique papers per generation Gn Restricted m Non-unique papers per generation Gn

s Unique papers per generation H−n Independent m Non-unique papers per generation H−n Backward s Unique papers per generation G−n Restricted m Non-unique papers per generation G−n

Table 2.9: Generations definitions

m Rousseau in his paper Rousseau[1987] defines the generations that correspond to the G−n generations presented here. In that paper, backward citation generations are used to determine the influence that references have on the paper under scrutiny.

In Dervos and Kalkanis, 2005, Dervos et al., 2006a, the authors define the Cascading Citations m Indexing Framework (cc-IF) in which citation generations are defined as Hn generations. In these papers, self-citation generations are also defined at the (paper, author) level. The definition follows 34 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS the same pattern but instead of considering the citations solely at the paper level, the author of each citation is examined as well. If the current author is the co-author of the source paper of the currently examined citation, then the citation is an n-generation self-citation for the current author.

Another aspect of the citations examined in these papers is the existence of chords that we will also examine in the following section. A chord is created when a source paper S cites the same paper both directly and indirectly via a citation path of length greater than one. Depending on the length of the indirect citation, different levels of chords of different ranks can be defined (i.e. second, third, fourth, etc. generation citation).

Finally, Kosmulski, 2010 studies forward generations and particularly second generation citations of s the Gn definition.

2.5.2 Example

Now, let us again consider the Paper-Citation graph of Figure 2.6 which originates from the Paper-

Citation graph of Figure 2.3. The extra meta-data information have been removed, so only the papers and their citations are included in the graph. In addition in order to better demonstrate some of the concepts of the different definitions of generations of citations, one extra citation has been added from P3 to paper P1.

Figure 2.6: Paper-Citation graph with only Papers and Citations

Table 2.10 shows the different sets of forward citation generations according to the definitions listed above for paper P1, which we consider to be the only paper included in generation 0. 2.5. INDIRECT CITATIONS 35

Non-unique Unique m s H0 = {P1} H0 = {P1} m s H1 = {P2,P3} H1 = {P2,P3} m s Independent H2 = {P3,P4,P4,P6} H2 = {P3,P4,P6} m s H3 = {P4,P5,P5} H3 = {P4,P5} m s H4 = {P5} H4 = {P5} m s G0 = {P1} G0 = {P1} m s G1 = {P2,P3} G1 = {P2,P3} m s Restricted G2 = {P4,P4,P6} G2 = {P4,P6} m s G3 = {P5,P5} G3 = {P5} m s G4 = {} G4 = {}

Table 2.10: Different types of forward citation generations for paper P1.

There are several citations that are worth noting in Table 2.10 depending on which aspect of the definitions we would like to focus on. For example in order to demonstrate the differences between the Set and Multi-Set definitions we can look at the level two citations originating from paper P4. P4 provides two level two citations to paper P1 via P4 → P2 → P1 and P4 → P3 → P1. According to m m the H (Independent / Non-unique) and G (Restricted / Non-unique) definitions, P4 is included twice in the citations for the second generation, whereas if we follow the Hs(Independent / Unique) s and G (Restricted / Unique) definitions, P4 is included only once. The same is true for paper P5 and the third generation citations.

If we wish to highlight the differences between the Independent and Restricted definitions we can again examine the citations originating from paper P4 but this time we are going to include all citations in our analysis independent of length. As we can see from Figure 2.6 there are three citation paths that originate from P4 and terminate at paper P1. Two are of length two (P4 → P2 → P1 and P4 → P3 → P1) and one is of length three (P4 → P3 → P2 → P1 ). According to the Independent definitions (Hs and Hm) each generation is defined independently from any of the previous generations and therefore P4 is included in both the second and third generations. On the other hand, if we examine the restricted definitions (Gs and Gm) a paper can only participate in a generation of citations if it has not already been included in a generation of lower rank. Therefore,

P4 is only present in the second generation of citations. 36 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS

2.6 Generations of self-citations

When a Paper-Citation graph is examined from the paper point of view the authors of the papers do not really participate in the process. But if we choose to examine the papers with regards to their contribution to the Publication Record of a particular author, one might wish to include extra information that relates to the author in question. In this section we are going to examine the concept of self-citations and how we believe this fits with the definitions of Indirect Citations presented in the previous section.

2.6.1 Definition

We say that there exists a direct self-citation between papers P1 and P2 for author A1, if paper

P2 cites paper P1 and A1 has co-authored both papers [Fragkiadaki and Evangelidis, 2016]. This definitions complies with the Own self-citations definition defined earlier in this chapter. When one wishes to account for the existence of self-citations, it is a common practice to examine a paper at the author level by either simply counting the number of self-citations and supplying this number along side the full citation count or by completely removing the self-citations from the list of citations for the paper and author in question. So, in the same sense that self-citations are defined for a particular

(paper, author) pair in the case of direct citations, we define the generations of self-citations for a (paper, author) pair for all indirect citations. This concept has been originally discussed in the

Cascading-Citations Indexing Framework (cc-IF) defined in Dervos, Samaras, Evangelidis, and Folias

[2006a], were the generations of self-citations were defined as forward Gm.

In general, a n-gen self-citation for a (paper, author) pair (P, A) is defined by a citation path of length n originating from a source paper and ending at paper P, with author A being present in the author list of both papers. Therefore, in the definition of a self-citation the only points of interest are the source and target papers and the corresponding authors. For example, the citation path P6 → P3 → P1 is considered a 2-gen self-citation for author A1, but the citation path P7 → P6 → P3 → P1 is simply considered a 3-gen citation even though it passes through a paper co-authored by A1.

2.6.2 Example

Let us consider the Paper-Citation graph of Figure 2.7 originally presented in Fragkiadaki and

Evangelidis[2016]. As we can see in the graph, for each of the papers included in the graph we 2.6. GENERATIONS OF SELF-CITATIONS 37 have also included information about the co-authors of the papers, the Year of Publication and the Journal in which each of the papers was published in. The set of papers P that participate in the graph is equal to P = {P1,P2,P3,P4,P5,P6,P7} , the set of authors A that have co-authored these papers is equal to A = {A1,A2,A3,A4,A5} and finally the set of publication journals J is equal to J = {J1,J2,J3} . The papers have been published between the years 2000 and 2004.

Figure 2.7: Paper-Citation graph that includes the paper meta-data information

The Paper-Citation table for our graph is presented in Table 2.11 and as we can see it included one record for each paper along with all accompanying meta-data information. Paper P1 does not provide any reference to any of the other papers included in the graph, whereas papers P5 and P7 do not receive any direct or indirect citation from any of the papers.

Paper Publication Year Journal Co-authors References Is cited by

P1 2000 J1 A1, A2 - P2, P3, P4

P2 2001 J2 A3 P1 P5

P3 2001 J1 A3, A4 P1 P6

P4 2001 J3 A2, A4 P1 P6

P5 2002 J3 A1, A5 P2, P6 -

P6 2003 J1 A1, A2 P3, P4 P5, P7

P7 2004 J3 A2, A3 P6 -

Table 2.11: Paper-Citation table for the Paper-Citation graph of Figure 2.7.

In Table 2.12 we present the citation paths up to Length 3 for paper P1 with the additional information about the co-authors of the source papers and the list of co-authors of paper P1. This means that 38 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS we examine the citations at the (paper, author) level and for each pair we can deduce from the list of co-authors whether each citation path will result in a self-citation for the particular author or not.

This information is presented in the last column of the table where for each (paper, author) we mark the self-citations. There are a total of 3 Length 1 citation paths which result in 6 (paper, author) citation paths, since paper P1 has be co-authored by two authors. From these 6 (paper, author) citation paths 1 is a self-citation path. In particular the citation that originates from paper P4 is considered a self-citation for author A2 of paper P1, since A2 has co-authored both papers.

In a similar manner we can see that there are a total of 3 Length 2 paper citation paths that terminate at paper P1 that generate a total of 6 (paper, author) pairs, 5 of which are considered to be self-citations. Finally, there are a total of 4 Length 3 paper citation paths that terminate at paper

P1 that generate a total of 8 (paper, author) pairs, 4 of which are considered to be self-citations.

Citation Source Co-authors Via Target Self path Paper Paper Author citation

A1 P2 A3 P1 A2

A1 Length 1 P3 A3,A4 P1 A2

A1 P4 A2,A4 P1 A2 x

A1 x P5 A1,A5 P2 P1 A2

A1 x Length 2 P6 A1,A2 P3 P1 A2 x

A1 x P6 A1,A2 P4 P1 A2 x

A1 x P5 A1,A5 P6 P3 P1 A2

A1 x P5 A1,A5 P6 P4 P1 A2 Length 3 A1 P7 A2,A3 P6 P3 P1 A2 x

A1 P7 A2,A3 P6 P4 P1 A2 x

Table 2.12: Direct and indirect citation paths for paper P1 of Figure 2.7. Self-citations are considered at the (paper, author) level.

We propose that when a paper is examined as part of the Publication Record of a scientist it should be determined whether self-citations should be included or not in the generations of citations. If self-citations are included, then the results for the four definitions of citations are the same as the 2.6. GENERATIONS OF SELF-CITATIONS 39 ones for the paper itself since the author information does not participate in the definition. But if an author’s self citations are to be excluded from the generations of citations for the (paper, author) pair then the results will potentially differ.

m s m s

m s m s H0 = {P1} H0 = {P1} H0 = {P1} H0 = {P1} m s m s Independent H1 = {P2,P3,P4} H1 = {P2,P3,P4} Independent H1 = {P2,P3} H1 = {P2,P3} m s m s (H) H2 = {} H2 = {} (H) H2 = {P5} H2 = {P5} m s m s H3 = {P7,P7} H3 = {P7} H3 = {P5,P5} H3 = {P5}

m s m s G0 = {P1} G0 = {P1} G0 = {P1} G0 = {P1} m s m s Restricted G1 = {P2,P3,P4} G1 = {P2,P3,P4} Restricted G1 = {P2,P3} G1 = {P2,P3} m s m s (G) G2 = {} G2 = {} (G) G2 = {P5} G2 = {P5} m s m s G3 = {P7,P7} G3 = {P7} G3 = {} G3 = {} (a) (b)

Table 2.13: Forward citation generations for (a) paper P1 and author A1 and, (b) for paper P1 and author A2 of Figure 2.7.

In order to demonstrate this we are going to use the information presented in Table 2.12 to specify the generations of citations for the authors of paper P1, namely A1 and A2. The results are presented in Tables 2.13 (a) and (b) and as we can see they do differ for the two co-authors. It is interesting to examine 2-gen and 3-gen citations for author A1 in Table 2.13 (a). After removing all self-citation paths for author A1, there is no citation path of length two left, which means that all 2-gen citations originate from papers co-authored by A1. This has as a consequence that generation 2 of citations for A1 is empty. This does not necessarily imply that A1 will not have any 3-gen citations since, self-citations are only defined using the starting and ending points of the citation paths without examining the intermediate papers. Thus, even though A1 has no 2-gen citations (by any definition), he still has some 3-gen citations. 40 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS Chapter 3

Classifying assessment indicators

Because of the importance of citation analysis, a multitude of scientific indicators have been proposed in order to assist in the evaluation of scholarly impact [Hirsch, 2005, 2007, Costas and

Bordons, 2008, Wu, 2010, van Eck and Waltman, 2008, Waltman and van Eck, 2012]. Usually each of these indicators examines only one scientific entity and it considers a variety of information from the citation meta-data. There are indicators though, that have been defined as a general methodology that applies to all types of research units as long as the appropriate unit-citation graph is used.

Some of the indicators utilize the notion of citation generations as such, while others do so indirectly by utilizing the information present in the entire citation graph. For example, the Gozinto theorem, proposed by Rousseau[1987], specifically determines the citation generations, while the popular

PageRank algorithm, proposed by Page et al.[1999] and applied to bibliometrics by a number of researchers, is based on the information present in the whole citation graph without specifically naming citation generations as such.

PageRank, originally inspired by citation analysis and used for ranking pages on the web, has again found its way back to bibliometrics with many researchers attempting to explore the interlinking of the research units via their citation patterns. PageRank is defined recursively by equally dividing the influence value of a web page to its connected pages via the outbound links found on the page.

The model imitates a ‘‘random surfer’’ who chooses to blindly follow one of the outbound links of a page, and thus, navigates through the web in a number of random hops. The surfer from time to time chooses to end his current path and start a new one from a completely different point in the web, and does so by a probability defined by a pre-selected damping factor. The damping factor chosen in the original implementation of the algorithm was 0.85 [Page, Brin, Motwani, and

41 42 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS

Winograd, 1999]. The PageRank scores of the pages are calculated by the following formula:

X PRi PR = (1 − d) + d · (3.1) a N i i

where, PRa is the score of the current page (page a), d is the damping factor, PRi is the score of the individual pages directly citing page a and Ni are the total pages cited by each page i.

In this chapter we attempt to present a subset of the indicators found in the literature under a unified mathematical notation model which forms part of the contribution of the present study. We believe that the unified notation allows us to highlight the similarities of the indicators and to better classify them depending on the meta-data information they utilize.

In addition we propose a novel categorization of a subset of author indicators that are based on the widely accepted h − index [Hirsch, 2005]. Extensive comparative reviews of the h-index indicators may be found elsewhere in the literature [Alonso et al., 2009, Bornmann et al., 2011]. For indicators based on the h − index, we introduce four distinct approaches taken by researchers and we provide two types of Hirsch algorithms for describing h type indicators.

The rest of this chapter is structured as follows: Sections 3.1, 3.2 and 3.3 present an overview of the Paper, Author and Journal Indicators covered by the present study. The primary classification used is based on whether the indicators utilize both the direct and indirect information present in the meta-data of the publications included in the Citation graphs. Based on the research unit there is also a secondary classification scheme applied that further highlights the differences of the indicators.

3.1 Paper indicators

We define the paper indicators as the group of indices that assess the scientific impact of a single publication. These indices can draw information either from the list of direct citations or from citations of length greater than one found in the Paper-Citation graph.

3.1.1 Direct indicators

Number of Citations: The total number of citations a paper has received is the most common direct indicator used in citation analysis. 3.1. PAPER INDICATORS 43

3.1.2 Indirect indicators

The Gozinto theorem [Rousseau, 1987]: In this paper, Rousseau determines the papers that had the greatest influence in the creation of the paper under scrutiny. Papers included in the reference list of the current paper (first generation) had direct influence whereas papers included in the reference lists of those papers (second generation) are considered to have an indirect influence.

The direct influence of a paper can be given a weight or considered to have a weight equal to 1. The weights of all direct influences that papers had among each other forms an n x n matrix A, where n is the total number of papers. The author states that there are many ways in which the weights can be assigned and in an example included in the paper two different methods are presented. The first method assigns an integer value in the references mentioned in the different sections of the paper under scrutiny. The direct influence is then calculated as the sum of all distinct values of all occurrences of the particular reference within the paper. A derived method uses this weight to calculate a weight that transforms all weights to numbers between 0 and 1.

Rousseau utilizes matrix A in his calculations of the total influence along with the Gozinto theorem.

Based on the theorem, the total influence of paper Ai on Aj is the sum examined over all papers

(z) of the direct influence of Ai on Ak (aik) times the total influence of Ak on Aj (ckj) and is given by: z X cij = aik · ckj + δij (3.2) k=1

Where δij denotes the Kronecker delta and is defined as

 1 , i = j δij = (3.3) 0 , i 6= j

For more details on the calculations of the total influence we refer the reader to the original paper

[Rousseau, 1987].

SCEAS Rank [Sidiropoulos and Manolopoulos, 2005]: A recursive scoring algorithm that wishes to minimize some of the side effects of the original PageRank algorithm. According to the authors the proposed score meets two conditions that are not present in the original PageRank algorithm. More precisely, the following are true: (a) the factor that should have the greatest influence over the score of a particular paper should be the number of direct citations and, (b) the addition of new citations 44 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS in the Paper-Citation graph should have a greater effect in the scores of nearby rather than distant papers. In that respect, they proposed the following scoring formula:

X Si + b S = a−1 (a ≥ 1, b > 0) (3.4) a N i i where, Sa is the score of the current paper (paper a), Si is the score of the individual papers directly citing paper a, Ni is the total number of papers cited by each paper i, b denotes the direct citation enforcement factor (which controls the effect that direct citations have to the calculated score) and a denotes the speed with which an indirect citation enforcement converges to zero.

The authors also propose a generalization of the above formula and the original PageRank algorithm that introduces a damping factor (d) in the SCEAS rank:

X Si + b S = (1 − d) + d · a−1 (a ≥ 1) (3.5) a N i i

PageRank [Ma et al., 2008]: The original PageRank algorithm applied to the Paper-Citation graph by changing the damping factor from its original value to 0.5. The selection of this value was based on an earlier study that, according to the authors of the paper, indicates that in the random surfer model for scientific papers the path followed is much shorter (in particular two papers).

Cumulative patent citations [Atallah and Rodríguez, 2006]: The proposed indicator was used as a means to measure the importance and quality of patents. It uses the Patent-Citation graph that is identical to the Paper-Citation graph with the only difference being that the nodes of the graph are patents instead of papers. Thus, this indicator can be used in the context of paper assessment and, therefore, is included in this study.

The Cumulative patent citations measure represents the sum of all direct and indirect citations received by a given patent. So, for a Patent-Citation graph with N patents the score of generation j of citations received by patent x is given in (3.6), where, ai(x) = 1 if a path exists between patents i and x and ai(x) = 0 otherwise. In more detail, the generational score of a patent is the sum of all direct citations received by the patents included in the previous generation. 3.1. PAPER INDICATORS 45

N X Sj(x) = ai(x) · Sj−1(i) (3.6) i=1

The score of patent x is then calculated by adding the individual generation scores from 0 to M, where, M is the maximum generation of citations for the patent x, i.e., the length of the longest path present in the Patent-Citation graph that terminates at x, and is given in (3.7).

M X ST (x) = Si(x) (3.7) i=0

Weighted cumulative patent citations [Atallah and Rodríguez, 2006]: This is a weighted version of the

Cumulative patent citations indicator and its purpose was to account for the closeness of citations to the cited patent. The weight of a generation i is calculated based on its distance from the patent, which means that generations closer to the patent have a greater influence to the score of the patent in question. The calculation of the Weighted cumulative patent citations is given in (3.8).

M X  i  S (x) = 1 − · S (x) (3.8) w M + 1 i i=0

CiteRank [Walker et al., 2007, Maslov and Redner, 2008]: This is an adaptation of the original

PageRank algorithm that takes into consideration the fact that researchers usually traverse papers starting from a relatively new paper and following its references. So, apart from including the damping factor as the probability that the researcher will drop their current search path and start a new one, they also include a decay time (Tdir) that controls the probability that a paper will be selected as the start of a new research path. This probability is defined in (3.9)

−agei/Tdir pi = e (3.9)

where, agei denotes the age of the paper. Therefore, more recent papers have a higher probability to be selected as the starting point of a random walk. 46 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS

P-Rank [Yan et al., 2011]: This is a proposition to evaluate articles taking into account the hetero- geneity of the citation networks. It uses a citation network that includes papers, citations, authors and journals and the final value is calculated as a combination of the importance of all three factors.

For the papers aspect, the PageRank algorithm is used, whereas for the author and journal aspects the adjacency matrices are utilized.

PrestigeRank [Su et al., 2011]: The authors discuss the problem of the incompleteness of bibliometric databases, in the sense that usually not all citations to (or from) papers included in the database are always present in the Paper-Citation graph used to calculate the paper scores based on PageRank.

This poses a problem in the computations, since a paper providing one internal citation (to a paper included in the database) and five external citations (to papers not included in the current system), in the standard PageRank algorithm will transfer all its influence to the single paper included in the system instead of diving its score to all the referenced papers.

In order to solve the problem, they introduce a ‘‘virtual node’’ that accumulates all citations originat- ing from papers within the database to external papers and that is also responsible for providing all citations from external papers. The ‘‘virtual node’’ instantly solves the problem of diving the influence of a paper to the referenced papers, but in order to also account for all external citations some more computations are needed. Based on the assumption that the more internal citations a paper has received the more external citations it should have received they divide the score of the ‘‘virtual node’’ to the papers in the database.

All the discussed indirect paper-based indicators consider the information present in the entire Paper-

Citation graph and are either independently defined or defined as modifications of the PageRank algorithm. One may also categorize some of them based on whether they consider: (a) the distance of citations from the current paper, (b) the age of the current paper, and, (c) the incompleteness of the Paper-Citation graph. Table 3.1 presents the classification scheme described and lists the indirect indicators that belong to each class. 3.2. AUTHOR INDICATORS 47

Relation to PageRank Additional factors considered Indicator - Gozinto theorem Distance of citations from current paper SCEAS Rank Independently defined - Cumulative patent citations Distance of citations from current paper Weighted cumulative patent citations Age of current paper CiteRank - PageRank Modifications of PageRank - P-Rank Incompleteness of the Paper-Citation graph Prestige-Rank

Table 3.1: Classification of the paper-based indirect indicators

3.2 Author indicators

3.2.1 Direct indicators

Many of the direct indices used to assess an individual researcher have been proposed as variations of the original h − index, like the h2 − index, the ch − index and the ch2 − index [Kosmulski, 2006], while others as h − index supplementary indices, like the A − index, the R − index and the AR − index [Jin, Liang, Rousseau, and Egghe, 2007]. In addition, a number of completely new indices have appeared, like the hT − index [Anderson et al., 2008] and the Pa − index [Soler, 2007]. The list of indices has increased significantly and the ones just mentioned are but a fraction of the indices currently available.

These indices can be categorized based on the way they are calculated and/or defined. We distinguish the following categories:

Standard bibliometric indicators: As already mentioned, the assessment of a researcher’s scholarly impact has been common since a long time. A number of scientific indicators have been assisting hiring and/or awarding bodies in their task.

h − index family of indicators: In 2005, in an attempt to create a simple single-valued indicator that could capture the scientific achievements of an individual in a robust way, Hirsch[2005] proposed the h − index.

‘‘A scientist has index h, if h of his/her Np papers have at least h citations each and

the other (Np − h) papers have no more than h citations each.’’ 48 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS

The h − index found wide acceptance in the scientific community and many researchers worked towards eliminating its disadvantages by proposing a large number of indicators that were either vari- ations of the original h − index, or supplementary indices to be used in conjunction with the original h−index, or new indices that somehow utilized the information provided by the h−index. Another important aspect of the h − index that also found wide acceptance among researchers was the Hirsch core [Rousseau, 2006]. The Hirsch core is identified as the set of papers in the publication record of a researcher that do actually contribute to his h − index. Since there might exist more than one papers that could be included in the Hirsch core it has been stated that even though the h−index of a researcher is unique his Hirsch core is not [Hirsch, 2010]. In case of such ties Rousseau [2006] suggests an anti-chronological listing of papers, so that the newest papers get included in the core. The same approach is followed by Jin et al.[2007] and Wan et al.[2007], even though in the former paper this approach favors the proposed indicator whereas in the latter case it does not. Particularly, for the h−index family of indicators, we attempt an even finer categorization, as follows:

Hirsch core approach: The indices included here treat some disadvantages of the h − index by defining a complementary index that utilizes the information in the Hirsch core and is meant to be used either in conjunction with the original h − index or as an independent indicator.

First Hirsch algorithm: We define as the First Hirsch algorithm any definition of an

indicator that is computed using the following steps: (a) retrieve the list of papers in

the publication record of the researcher, (b) apply a predefined scoring function to the

individual papers and list them in descending order, (c) define a cutoff value for the

indicator, and, (d) set the value of the indicator to be the maximum value obtained by

applying the predefined cutoff value on the list of papers and scoring values. Article

scoring functions were introduced in Sidiropoulos et al.[2007], and here we generalize,

extend and combine this concept with the new concept of the cutoff values in order

to introduce the First Hirsch algorithm class of indices.

Second Hirsch algorithm: We define as the Second Hirsch algorithm any definition of

an indicator that is computed using the following steps: (a) retrieve the list of papers in

the publication record of the researcher, (b) apply a predefined scoring function to the 3.2. AUTHOR INDICATORS 49

individual papers and list them in descending order, (c) define a cumulative function

that combines the scoring values of the individual papers, and, (d) set the value of

the indicator to be the maximum (or minimum) value that satisfies a predefined cutoff

operation on the cumulative scoring value of all included papers.

Derived indices approach: The indices belonging here are either hybrid (they combine the original h − index with one of its variations) or derivative (they use the calculated h − index value to produce a new index).

Standalone bibliometric indicators: A large number of new bibliometric indicators have been proposed that consider a number of the factors of scholarly impact listed in the previous section and are independent from the h − index family of indicators. These indicators sometimes utilize information provided by the Standard bibliometric indicators.

The indicators are going to be presented using a common mathematical framework, originally suggested by Woeginger[2008a,b] which we have extended and adapted as required for the indicators examined in this section.

Definition 1: Let N ≥ 0 denote the total number of papers a researcher has (co-)authored, i.e., this is the size of his publication record.

Definition 2:

Let x = (x1, ..., xN ) be the vector of the citations of the papers published by the researcher ordered in descending order. The following statements are true: (a) the kth component of the vector x represents the total number of citations received by the kth most important paper of the researcher, (b) all vector components are non-negative integers, (c) since the papers are ordered in descending order based on their citation count, inequality x1 ≥ x2 ≥ ... ≥ xN holds and, (d) if a researcher has not published any papers then vector x is empty.

Definition 3:

Let n denote the scientific age of a researcher, i.e., the years passed since the year he published his 50 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS first paper.

Definition 4:

p Let ni denote the age of a paper i, i.e., the years passed since the year paper i was published.

Definition 5:

a Let ni denote the number of co-authors of paper i.

Definition 6:

c Let ni(j) denote the age of citation j of paper i.

Standard bibliometric indicators

The standard bibliometric indicators include indices that were introduced before the proposal of the h − index family of indicators and are still utilized when producing the publication profile of an individual researcher.

Total number of papers (N): The total number of papers a researcher has (co-)authored during his whole scientific career. The total number of papers is a commonly used indicator [Hirsch, 2005, 2007, Costas and Bordons, 2008, Wu, 2010]. It has also been referred to as the p−method [Wu, 2010].

Total number of citations (NC): The total number of citations received by all the papers a re- searcher has (co-)authored during his whole scientific career. The total number of citations has also been referred to as the s−index [van Eck and Waltman, 2008] and the c−method [Wu, 2010].

Mean number of citations (MNC): The mean number of citations received by the papers the author has (co-)authored during his whole scientific career [Hirsch, 2005, 2007, Costas and Bordons,

2008]. The mean number of citations is expressed as

PN xi MNC = i=1 ,N ≥ 1 (3.10) N 3.2. AUTHOR INDICATORS 51

and it is defined only when the researcher has (co-)authored at least one paper. It has also been referred to as the e − method [Wu, 2010]. Here, the cumulative count of citations received by the publications included in the Publication Record is divided by the number of publications to produce the mean number of citations for the papers an author has co-authored.

Percentage of highly cited papers (PHCP): The highly cited papers of a researcher are defined as the papers that have exceeded a specified threshold of citations. This threshold is arbitrarily defined and is greatly affected by the scientific field, mainly due to the distinct citation patterns that govern each scientific field. Once the threshold is defined, the percentage of highly cited papers is the percentage of the researcher’s papers that meet or exceed the required number of citations

[Costas and Bordons, 2008, Waltman and van Eck, 2012].

In Table 3.2, the Standard Bibliometric indicators are listed together with the factors of scholarly impact that they consider. We notice that the indicators included in this class focus around the Number of papers and the Number of citations. An exception is thePHCP that takes into account the scientific field since the threshold of citations that a paper needs to overcome in order to be part of the highly cited papers is arbitrarily defined and it depends on the citation patterns of the specific field.

Number of Scientific Age of Self Co

papers citations age field papers citations citations authors

N X NC X MNC XX PHCP XXX

Table 3.2: Factors considered by the Standard Bibliometric indicators

h − index family of indicators

In this subsection we present the bibliometric indicators belonging to the h − index family. Indicators of this category relate to the original h − index indicator and they either deal with some of its disadvantages, or are meant to be used along side the original indicator to produce a more 52 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS complete scientific profile of the researcher in question. The indicators are assigned to the four subcategories described earlier.

Hirsch core approach Indices of this subcategory utilize the information of the Hirsch core and either introduce a supplementary index to be used in conjunction with the original h − index or define a new indicator.

hI − index: The hI − index was proposed as an alternative to the original h − index. It takes into account the number of co-authors of the papers included in the Hirsch core to compensate for the different publication patterns encountered mainly across different scientific fields [Batista et al.,

2006]. The hI − index is calculated by diving the value of h by the mean number of authors included in the papers in the Hirsch core and is expressed as

h h2 h = = (3.11) I Ph a Ph a i=1 ni n h i=1 i

A − index: The A − index was proposed as a supplementary index to compensate for the fact that the h − index does not consider all of the citations of the papers included in the Hirsch core [Jin et al., 2007]. It is defined as the average number of citations received by the publications in the Hirsch core [Jin et al., 2007] and is expressed as

h 1 X A = x (3.12) h i i=1

R − index: The R − index was proposed as an improvement of the A − index that does not disfavor researchers with high h−index values (A−index includes a division by h). The R −index is also a supplementary index to be used in conjunction with the original h − index. It is defined as the square root of the sum of all citations of the papers that are included in the Hirsch core [Jin et al., 2007] and is expressed as v u h uX R = t xi (3.13) i=1 3.2. AUTHOR INDICATORS 53

AR − index: The AR−index [Jin et al., 2007] was proposed as an improvement of the R−index that compensates for the fact that the original h − index never decreases as time passes. If a researcher stops publishing new papers or if his new papers receive no citations, his h − index either remains stable (if past papers stop receiving new citations) or even increases over time (if past papers continue to receive new citations). Therefore, by including the scientific age of the papers in the Hirsch core, the AR − index produces a new supplementary index that can also be used in conjunction with the original h − index to provide a sense of citation density and age. It is defined as the square root of the sum of the citations received by each paper that belongs to the Hirsch core divided by the age of the paper and is expressed as

v u h uX xi AR = t (3.14) np i=1 i

We should mention that if there exist more than one papers with the same number of citations, the definition implies that the newest paper is used, and therefore the denominator always favors the researcher.

hp − index: The pure h − index attributes a partial value of the h − index to a particular author based on the number of (co-)authors included in the papers in his Hirsch core [Wan et al., 2007] and it is our understanding that it was proposed as a replacement of the original h − index. In order to attribute a partial value of the h − index to a particular author it divides h − index by the square root of the equivalent Hirsch core average number of authors, which for researcher A we will denote as E. E is in turn calculated by summing the equivalent number of co-authors of A in paper E i, which we will denote as ni , and then by dividing by the value of the h − index. The equivalent E number of co-authors of A in paper i (ni ) is in turn equal to one divided by S(i), where S(i) denotes the normalized score of researcher A in paper i and can be calculated using one of four different functions, depending on whether we just wish to account for the number of participating authors or for the number and order of the authors in the author list. For the four different methods of scoring (Fractional, Arithmetic, Geometric and Noblesse oblige) we refer the reader to the original paper [Wan et al., 2007].

E Thus, ni is expressed as 54 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS

1 nE = (3.15) i S(i)

The equivalent Hirsch core average number of authors for researcher A is

Ph nE E = i=1 i (3.16) h

and the hp − index is s h h h = √ = h · (3.17) p Ph E E i=1 ni

Rp − index: The pure R − index is defined as a variation of the R − index [Jin et al., 2007] that also accounts for the number of authors in the papers in the Hirsch core of researcher A [Wan et al., 2007]. It is our understanding that since the Rp − index was proposed as an alternative of the R − index it was meant to be used in conjunction with the original h − index while at the same time it is based on the same principles as the hp − index discussed earlier. The Rp − index for author A is s Ph xi R = i=1 (3.18) p E

m − index: The m − index was proposed as an alternative of the A − index that instead of considering the arithmetic average of the citations of the papers in the Hirsch core it considers the median of the number of citations, due to the very skewed nature of the distribution of citations

[Bornmann et al., 2008].

hw − index: The hw − index depends on the obtained citations of the papers belonging to the Hirsch core and is defined both in a continuous and a discrete setting. We present here the discrete case of the citation-weighted h − index (hw − index), which even though it loses some of the properties of the continuous case it is much easier to calculate [Egghe and Rousseau, 2008]. The calculations of the hw − index require that the papers in the publication record of a researcher are listed in descending order based on their citations and that the value of the h − index is already 3.2. AUTHOR INDICATORS 55

known. Then, for each paper its weighted rank (rw(i)) is

i X xj r (i) = (3.19) w h j=1

The r0 value is defined as the largest number i such that the weighted rank of paper i is smaller than or equal to xi, and is expressed as

r0 = max{i | rw(i) ≤ xi} (3.20)

Finally, the hw − index is defined as the square root of the sum of all citations received by the first r0 papers of the researcher and can be expressed as

v u r0 uX hw = t xi (3.21) i=1

Rm − index: The Rm − index is a modification of the R − index and is defined as the square root of the sum of the square root of all citations received by the papers that are included in the Hirsch core [Panaretos and Malesios, 2009] and is expressed mathematically as

v h u 1 uX 2 Rm = t xi (3.22) i=1

e − index: The e−index is defined as the square root of the excess citations of all papers included in the Hirsch core Zhang[2009]. Excess citations are all the citations not accounted for of the papers in the Hirsch core. The e − index is expressed mathematically as

h p 2 2 2 X e = d − h , d = xi (3.23) i=1 56 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS

j − index: The j − index was originally defined to account for the excess citations present in the Hirsch core [Todeschini, 2011]. It is based on 12 weighted increments and is calculated as

P12 wk · Nk j = h + k=1 (3.24) P12 k=1 wk where k denotes the number of the term, wk denotes the weight assigned to each term and is 1 equal to wk = k and finally Nk is the number of papers that should be considered for the kth increment. Nk is calculated as the number of papers in the Hirsch core that have received more than h · ∆hk citations each. The values for k, wk and ∆hk are shown in the following table. k 1 2 3 4 5 6 7 8 9 10 11 12

∆hk 500 250 100 50 25 10 5 4 3 2 1.5 1.25 wk 1.000 0.500 0.333 0.25 0.200 0.167 0.143 0.125 0.111 0.100 0.091 0.083

Table 3.3: Weighted increments used in the calculations of the j − index

P12 For the wk values listed on the table we have k=1 wk = 3.103, so (3.24) becomes

P12 wk · Nk j = h + k=1 (3.25) 3.103

In Table 3.4, the Hirsch core indicators are listed along with the factors of scholarly impact they consider. The listed indicators utilize information provided by the papers included in the Hirsch core and are therefore calculated after we have calculated the h − index. For the calculation of the h − index both the number of papers and number of citations are needed, hence all of the Hirsch core indicators consider the Number of papers and Number of citations factors. Other than that, AR − index also considers the Age of the citations, pure h − index (hp − index) and pure R − index (Rp − index) consider the effects of co-authorship, and, finally, hI − index considers the effect of co-authorship as an indication of the scientific field.

First Hirsch algorithm The indices included in this subcategory follow the First Hirsch algorithm. Let us redefine its steps for defining an indicator z using the notations presented earlier: 3.2. AUTHOR INDICATORS 57

Number Number Scientific Age of Age of Self Co- au- Scientific of papers of age papers citations citations thorship field citations hI − index XXXX A − index XX R − index XX AR − XXX index hp − index XXX

Rp −index XXX m − index XX hw −index XX

Rm − XX index e − index XX j − index XX

Table 3.4: Factors considered by the indicators in the Hirsch core approach subcategory

1. Retrieve the list of all papers in the publication record of a researcher

2. Define the paper scoring function (S) for the indicator

3. Order the list of papers in descending order based on their value, as calculated by Si for each paper i

4. Define the cutoff value (V ) for the indicator

5. Then, the indicator z is defined as z = max{u | ∀i[1, u]: Si ≥ V }

h − index: The h − index is defined as the largest number h, such that h of the papers have received at least h citations each, while the rest N − h papers have not received more than h citations each [Hirsch, 2005].

h2 − index: The h2 −index is defined as the largest number h of papers that have received at least h2 citations each, while the rest N−h papers have not received more than h2 each [Kosmulski, 2006].

ch − index: The papers in the publication record of the researcher are listed in descending order based on the corrected for self-citations count (own self-citations). Then, the ch − index is defined 58 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS as the largest number ch such that ch of the papers have received at least ch citations each, while the rest N − ch papers have received no more than ch citations each [Kosmulski, 2006].

ch2 − index: The papers in the publication record of the researcher are listed in descending order based on the corrected for self-citations count (own self-citations). Then the ch2 − index is defined as the largest number ch such that ch of the papers have received at least ch2 citations each, while the rest N −ch papers have received no more than ch2 citations each [Kosmulski, 2006].

ha − index: The ha − index is a generalization of the h − index, which for a(0, ∞) is defined as the largest number ha such that ha of a researcher’s papers have at least a · ha citations each, while the rest N − ha papers have no more than a · ha citations each [van Eck and Waltman, 2008].

w − index: The w − index is defined as the largest number w of papers that have received at least 10 · w citations each, while the rest N − w papers have not received more than 10 · w citations each [Wu, 2010].

w(q) − index: The w(q) − index is defined as the largest number w of papers that have received at least 10 · w citations each, while the rest N − w papers have not received more than 10 · w citations each. In addition, the factor q is defined as the least number of citations a researcher needs to increase his w − index by 1 [Wu, 2010].

hf − index: For the calculation of the hf − index the papers included in the publication record of the researcher are listed in descending order based on the fraction of the citations received by the number of co-authors of each paper. Then the hf − index is defined as the largest number of papers hf such that this fraction is greater than or equal to hf [Schreiber, 2008a].

ho − index: For the calculation of the ho − index the papers included in the publication record of the researcher are listed in descending order based on their citation count after removing every self-citation of the first kind as defined in Schreiber[2007]( own self-citations). Then, the ho − index is the largest number of papers ho such that the number of citations is greater than or equal to ho and the rest N − ho papers have no more than ho citations each [Schreiber, 2007]. This index is 3.2. AUTHOR INDICATORS 59 actually the same as ch − index defined above.

hc − index: For the calculation of the hc − index the papers included in the publication record of the researcher are listed in descending order based on their citation count after removing every self-citation of the second kind as defined in Schreiber[2007]( co-author self-citations). Then, the hc − index is the largest number of papers hc such that the number of citations is greater than or equal to hc and the rest N −hc papers have no more than hc citations each [Schreiber, 2007].

hs − index: For the calculation of the hs − index the papers included in the publication record of the researcher are listed in descending order based on their citation count after removing every self-citation of the third kind as defined in Schreiber[2007]( all self-citations). Then, the hs − index is the largest number of papers hs such that the number of citations is greater than or equal to hs and the rest N − hs papers have no more than hs citations each [Schreiber, 2007].

hc − index: The contemporary h − index (hc − index) is a variant of the original h − index indicator that considers the age of the papers. All papers in the publication record of the researcher are listed in descending order based on the scoring function

p −δ Si = γ · (ni + 1) · xi (3.26)

In the scoring function, γ is an arbitrarily chosen coefficient so that the resulting hc − index is not too small. In Sidiropoulos et al.[2007], γ was selected to be 4. In addition δ defines the strength of the time penalty. The greater the value of δ the more the age of a paper reduces its score. The hc − index is then defined as the largest number hc such that the value of the scoring function for that paper is greater than or equal to hc and the remaining N − hc papers have a score of no more than hc each [Sidiropoulos et al., 2007].

hT − index: The trend h − index (hT − index) is a variant of the original h − index indicator that considers the age of the citations. All papers in the publication record of the researcher are 60 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS listed in descending order based on the scoring function

xi X c −δ Si = γ · (ni(j) + 1) (3.27) j=1

In the scoring function, γ is an arbitrarily chosen coefficient so that the resulting hT − index would not be too small. In [Sidiropoulos et al., 2007] γ was selected to be 4. In addition δ defines the strength of the time penalty. The greater the value of δ the more the age of the citation reduces its contribution to the score of the paper. The hT − index is then defined as the largest number hT such that the value of the scoring function is greater than or equal to hT and the remaining N − hT papers a score of no more than hT citations each [Sidiropoulos et al., 2007].

had − index: The age decaying h − index (had − index)[Katsaros et al., 2007] is a combination of the contemporary h − index and the trend h − index [Sidiropoulos et al., 2007] that considers both the age of the papers and the age of the citations of the papers in the publication record of the researcher. All papers in the publication record of a researcher are listed in descending order based on the scoring function

xi 2 p −δ1 X c −δ2 Si = γ · (ni + 1) · (ni(j) + 1) (3.28) j=1

In the scoring function, the γ coefficient is arbitrarily chosen so that the resulting had − index is not too small. The δ1 and δ2 values are the time penalties applied to the age of the papers and the age of each citation respectively. The larger the values the highest the penalty. The relative difference between δ1 and δ2 indicates whether we wish the contribution of the age of the papers or the contribution of the age of the citations to be greater. The had − index is then defined as the greatest number had such that the value of the scoring function is greater than or equal to had and the remaining N − had papers have a score of no more than had.

Table 3.5 provides a list of all indicators of this subcategory along with their corresponding scoring function, cutoff value and definition as described by the First Hirsch algorithm. We can observe the similarities of the indicators of this subcategory as well as the pattern that we defined as the First

Hirsch algorithm. 3.2. AUTHOR INDICATORS 61

Indicator Scoring function Cutoff Definition value h − index Si = xi u max{u | ∀ i  [1, u]: xi ≥ u}

2 2 2 h − index Si = xi u max{u | ∀ i  [1, u]: xi ≥ u }

o o ch − index Si = xi − xi u max{u | ∀ i  [1, u]: xi − xi ≥ u}

2 o 2 o 2 ch − Si = xi − xi u max{u | ∀ i  [1, u]: xi − xi ≥ u } index ha − index Si = xi a · u max{u | ∀ i  [1, u]: xi ≥ a · u} w − index Si = xi 10 · u max{u | ∀ i  [1, u]: xi ≥ 10 · u} w(q) − Si = xi 10 · u max{u | ∀ i  [1, u]: xi ≥ 10 · u} index

xi xi hf −index Si = a u max{u | ∀ i  [1, u]: a ≥ u} ni ni o o ho − index Si = xi − xi u max{u | ∀ i  [1, u]: xi − xi ≥ u}

c c hc − index Si = xi − xi u max{u | ∀ i  [1, u]: xi − xi ≥ u}

s s hs − index Si = xi − xi u max{u | ∀ i  [1, u]: xi − xi ≥ u}

c p −δ p −δ h − index Si = γ · (ni + 1) · xi u max{u | ∀ i  [1, u]: γ · (ni + 1) · xi ≥ u}

T Pxi c −δ Pxi c −δ h −index Si = γ · j=1(ni(j) + 1) u max{u | ∀ i  [1, u]: γ · j=1(ni(j) + 1) ≥ u} 2 p −δ1 had − Si = γ · (ni + 1) · u max{u | ∀ i  [1, u]: Pxi c −δ2 2 p −δ1 Pxi c −δ2 index j=1(ni(j) + 1) γ · (ni + 1) · j=1(ni(j) + 1) ≥ u}

Table 3.5: First Hirsch algorithm indicators and definition details.

Table 3.6 presents a list of all the indicators of the subcategory along with the factors of scholarly impact they consider. Again, all these indicators are based on the h − index and, hence, they all consider the Number of papers and Number of citations factors. Indicators ch−index, ch2 −index, ho − index, hc − index, hs − index consider Self-citations, the hf − index considers the effects c of Co-authorship, the ha − index considers the Scientific field, the h − index considers the Age T of papers, the h − index considers the Age of citations whereas the had − index considers both the Age of papers and the Age of citations. It is interesting to notice that all the indicators of this subcategory apart from the two factors already considered in the calculation of h−index, consider one more factor with the exception of had − index that considers two.

Second Hirsch algorithm The indices included here follow the Second Hirsch algorithm. Let us redefine its steps for defining an indicator z using the notations presented earlier:

1. Retrieve the list of all papers in the publication record of a researcher 62 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS

Number Number Scientific Age of Age of Self Co- au- Scientific of papers of age papers citations citations thorship field citations h − index XX h2 − index XX ch − index XXX ch2 − XXX index ha − index XX X w − index XX w(q) − XX index hf −index XXX ho − index XXX hc − index XXX hs − index XXX hc − index XXX hT −index XXX had − XXXX index

Table 3.6: Factors considered by the indicators in the First Hirsch algorithm subcategory.

2. Define the paper scoring function (S) for the indicator

3. Order the list of papers in descending order based on their value, as calculated by (Si) for each paper i

4. Define the cumulative function (C) for the indicator

5. Define the cutoff value (V ) for the indicator (along with the operator o)

6. Then the indicator z is defined as z = max{u | CuoV }

g − index: For the calculation of g − index, the papers in the publication record are listed in descending order based on their citation count. Then, the g − index is defined as the largest number g of papers that have received together at least g2citations [Egghe, 2006]. The g − index uses the cumulative sum of the citations received by the papers of the researcher.

f − index: For the calculation of f − index, the papers in the publication record are listed in descending order based on their citation count. Then the f −index is defined as the largest number 3.2. AUTHOR INDICATORS 63 f of papers for which their harmonic mean is greater than or equal to f [Tol, 2009] and is expressed as 1 f = max{f | ≥ f} (3.29) 1 · Pf 1 f i=1 xi

t − index: For the calculation of t − index, the papers in the publication record are listed in descending order based on their citation count. Then, the t−index is defined as the largest number t of papers for which their geometric mean is greater than or equal to t [Tol, 2009] and is expressed as t 1 Y t t = max{t | xi ≥ t} (3.30) i=1

hm − index : For the calculation of hm − index, the papers in the publication record are listed in descending order based on their citation count. For each paper, its effective rank (reff (i)) is calculated as i X 1 r (i) = (3.31) eff na j=1 j

Then, the hm − index is defined as the largest number hm such that the reff (i) is lower than or equal to the citations of paper i [Schreiber, 2008b] and is expressed as

hm = max{i | reff (i) ≤ xi} (3.32)

In Schreiber[2009], a variation of the hm − index is proposed, named h˜m − index, that uses a piecewise linear interpolation for the rank-frequency function in terms of the effective rank. It is mentioned that even though this variation can increase the value of the hm − index, the increase is less than 1. For more details on the interpolation used we refer the reader to the original paper

[Schreiber, 2009].

hms − index: For the calculation of hms − index, the papers in the publication record are listed in descending order based on their citation count after removing all citations of the third kind 64 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS

[Schreiber, 2007]. Then, for each paper, its effective rank (reff (i)) is calculated in the same way as in the case of hm − index i X 1 r (i) = (3.33) eff na j=1 j

Then, the hms − index is defined as the largest number hms such that the reff (i) is lower than or equal to the corrected for self-citations of the third kind citations of paper i [Schreiber, 2009] and is expressed as

hms = max{i | reff (i) ≤ xi} (3.34)

In a way similar to the definition of the h˜m − index Schreiber[2009], proposes a variation of the hms − index, named h˜ms − index.

Table 3.7 provides a list of all indicators of this subcategory along with their corresponding scoring function, cumulative function, cutoff value and operator, as well as their definition as described by the Second Hirsch algorithm. We can observe the similarities of the indicators of this subcategory as well as the pattern that we defined as the Second Hirsch algorithm.

Indicator Scoring function Cumulative Operator Cutoff Definition function value

Pu 2 Pu 2 g − index Si = xi C = i=1 xi ≥ u max{u | i=1 xi ≥ u } f − index S = x C = 1 ≥ u max{u | 1 ≥ u} i i 1 ·Pu 1 1 ·Pu 1 u i=1 xi u i=1 xi

1 1 Qu u Qu u t − index Si = xi C = i=1 xi ≥ u max{u | i=1 xi ≥ u}

Pu 1 Pu 1 hm−index Si = xi C = i=1 a ≤ xu max{u | i=1 a ≤ xu} ni ni

s Pu 1 s Pu 1 s hms − Si = xi − xi C = i=1 a ≤ xu − xu max{u | i=1 a ≤ xu − xu} ni ni index

Table 3.7: Second Hirsch algorithm indicators with their definition details.

Table 3.8 presents a list of all indicators included in the current subcategory along with the additional factors considered by each indicator. All these indicators are based on the h − index and hence they all consider the Number of papers and Number of citations factors. The hm − index also considers Co-authorship whereas the hms − index considers both Self-citations and Co-authorship. 3.2. AUTHOR INDICATORS 65

Number Number Scientific Age of Age of Self Co- au- Scientific of papers of age papers citations citations thorship field citations g − index XX f − index XX t − index XX hm−index XXX hms − XXXX index

Table 3.8: Factors used by the indicators of the First Hirsch algorithm

Derived indices approach hg − index: The hg − index is defined as the geometric mean of the h− and g− indices [Alonso et al., 2010] and is expressed as

p hg = h · g (3.35)

q2 − index: The q2 − index is defined as the geometric mean of the h− and m− indices [Cabrerizo et al., 2010] and is expressed as

√ q2 = h · m (3.36)

m − quotient: The m − quotient was defined as a ‘‘useful yardstick to compare researchers of different seniority’’ [Hirsch, 2005], and is defined as

h m − quotient = (3.37) n

hr − index: For the calculation of hr −index, the papers of the publication record of a researcher are listed in descending order based on their citation count. Let L(i) be the piecewise linear interpolation of the ranks of the papers versus their scoring function, which for the h − index is 66 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS

Si = xi. Then, the real valued h − index (hr − index) is defined as a generalization of the original h − index and is equal to the abscissa of the intersection of the y = x and L(i) lines [Rousseau, 2006]. As discussed in Guns and Rousseau[2009], the linear interpolation that intersects the line y = x based on the definition of the h − index is the one defined by the pairs (h, Sh) and

(h + 1,Sh+1). Solving for y gives

y = Sh + (x − h) · (Sh+1 − Sh) (3.38)

In the intersection of (3.38) with y = x, Y = X and solving for X (the abscissa)

S − h · (S − S ) X = h h+1 h (3.39) 1 + Sh − Sh+1

and hence Sh − h · (Sh+1 − Sh) xh − h · (xh+1 − xh) hr = = (3.40) 1 + Sh − Sh+1 1 + xh − xh+1

gr − index: For the calculation of the gr − index, the papers in the publication record of a researcher are listed in descending order based on their citation count. Let C(i) be the piecewise linear interpolation of the ranks of the papers versus their scoring function that is defined as Pi Si = j=1 xj. Then, the real valued g − index (gr − index) is defined as a generalization of the original g − index and is equal to the abscissa of the intersection of the y = x2 and C(i) lines [Rousseau, 2006]. As discussed in Guns and Rousseau[2009], the linear interpolation that intersects 2 the line y = x is the one defined by the pairs (g, Sg) and (g + 1,Sg+1). Solving for y gives

Y = Sg + (X − g) · (Sg+1 − Sg) (3.41)

Based on the definition of the scoring function Si it is true that

g+1 g X X Sg+1 − Sg = xi − xi = xg+1 i=1 i=1 3.2. AUTHOR INDICATORS 67 and, therefore, (3.41) becomes

Y = xg+1 · X + (Sg − g · xg+1) (3.42)

In the intersection of (3.41) with y = x2, Y = X2 and solving for X (the abscissa)

2 p 2 (xg+1) + (xg+1) + 4 · (Sg − g · xg+1) X2 = x · X + (S − g · x ) ⇒ X = (3.43) g+1 g g+1 2 and, hence 2 p 2 (xg+1) + (xg+1) + 4 · (Sg − g · xg+1) g = (3.44) r 2

∆ hrat − index: Ruane and Tol[2008] first introduced the hl − index as the rational, successive h − index (hl − index) for departments and used it to indicate not only the current value of the hl − index but also its distance from its next value (hl + 1). In the same paper, they mention that 4 4 the hl − index can also be defined for individuals and they use the notation h . Later, in Guns 4 and Rousseau[2009], the h − index for individuals is named rational h − index (hrat − index).

Hence, for the definition of the hrat − index, we need to know the current number of citations the researcher’s papers need to obtain (e) and the maximum distance between h and h + 1 expressed in citations (d). Values e [Tol, 2008] and d can be expressed as:

h+1 X e = max(0, h + 1 − xi) (3.45) i=1 d = 2 · h + 1 (3.46)

Then, hrat − index is expressed as

e h = h + 1 − ⇒ rat d Ph+1 max(0, h + 1 − xi) h = h + 1 − i=1 (3.47) rat 2 · h + 1 68 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS

∆ grat − index: Tol[2008] first introduced the gl − index as the rational, successive g − index

(gl − index) for departments and used it to indicate not only the current value of the gl − index but ∆ also its distance from its next value (gl + 1). In the same paper, he mentions that the gl − index can also be defined for the g − index as well. Later, Guns and Rousseau[2009] use the name grat − index for the rational g − index. Thus, for the definition of the grat − index, we need to know the current number of citations the researcher’s papers need to obtain (e) and the maximum distance between g and g + 1 expressed in citations (d). Values e and d can be expressed as:

2 e = (g + 1) − Sg+1 (3.48)

d = (g + 1)2 − g2 = 2 · g + 1 (3.49)

Then, grat − index is expressed as

e g = g + 1 − ⇒ rat d 2 2 (g + 1) − Sg+1 Sg+1 − g g = g + 1 − = g + (3.50) rat 2 · g + 1 2 · g + 1

hn − index: The normalized h − index (hn − index) is calculated by diving h by the number of papers a researcher has (co-)authored [Sidiropoulos et al., 2007] and is expressed as

h hn − index = (3.51) N

~ − index: The ~ − index was introduced as an index that takes into consideration the co-authors of each paper in order to identify if the current paper should contribute to the ~ − index of the researcher or not [Hirsch, 2010]. In general, a paper belongs to the ~ − core for a researcher only if it has more than ~ citations and it belongs to the Hirsch core of all participating authors. Due to the way the ~ − index is defined, it may include papers in the ~ − core of a researcher that are not actually included in his Hirsch core.

Table 3.9 presents a list of all indicators included in the current subcategory along with the additional 3.2. AUTHOR INDICATORS 69 factors considered by each indicator. All these indicators are based on h − index and hence they all consider the Number of papers and the Number of citations. The m − quotient also considers the Scientific age and the ~ − index the Co-authorship.

Number Number Scientific Age of Age of Self Co- au- Scientific of papers of age papers citations citations thorship field citations hg − index XX q2 − index XX m − XXX quotient hr − index XX gr − index XX hrat − XX index grat − XX index hn − index XX ~ − index XXX

Table 3.9: Factors considered by the indicators of the Derived h-index subcategory

Standalone indicators

The standalone indicators are indicators that are not based on the h − index but instead propose an alternative way of evaluating a researcher.

p − index: The p − index is defined as the number of papers included in the publication record of a researcher that have received at least one citation [van Eck and Waltman, 2008] and is expressed as

p − index = max{i | xi > 0} (3.52)

c − index: The c − index is defined as the number of citations of the most cited paper in the publication record of a researcher [van Eck and Waltman, 2008], and, using the notations used in this paper is expressed as

c − index = x1 (3.53) 70 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS

The c − index is also referred to as the maximum − index (fmax − index) by Woeginger[2008a] and as max by Kosmulski[2006].

hT − index: Anderson et al.[2008] proposed a new bibliometric indicator called the tapered h − index as an alternative to the h − index. They used a Ferrers graph to demonstrate that if we produce the Ferrers graph for the publication record of a researcher then his h − index is the size of his Durfee square in the graph. In their paper, they propose a new index that, instead of identifying a subset of a researcher’s papers and from those papers counting a subset of their citations, it includes all papers and all citations in its calculation. To calculate hT − index, the papers are listed in descending order based on their citation count and for each one of them a score is calculated based on their position in the vector and the number of citations received. The scoring functions for the individual papers are:

xi h = , x ≤ i (3.54) T (i) 2 · i − 1 i x i Xi 1 h = + , x > i (3.55) T (i) 2 · i − 1 2 · j − 1 i j=i+1

and the hT − index is then defined as the sum of all scores for all papers that the researcher has (co-)authored:

N X hT = hT (i) (3.56) i=1

Pa − index: The productivity index (Pa − index) is defined as the cumulative value of a researcher’s share of the citations of the papers that he has (co-)authored [Soler, 2007] and is expressed as N X xi P = (3.57) a na i=1 i

hmock − index: The mock h − index was referred to as hm − index in Prathap[2010], but in order not to confuse it with the hm − index defined above [Schreiber, 2008b], we use the notation 3.2. AUTHOR INDICATORS 71

hmock − index in this paper. The hmock − index is expressed as

1 NC2  3 h = (3.58) mock N

IQp − index: (accounts for scientific age and scientific field) The index for quality and productivity IQp−index [Antonakis and Lalive, 2008] was proposed as an indicator that takes into consideration a researcher’s total number of papers, total number of citations, scientific age and scientific fields. It is expressed as NC IQp = (3.59) NCest N + N where NCest is the estimated number of total citations that the researcher would have received if he were an average researcher in his scientific field. The estimated number of citations can be computed as c · n · (N + 1) NC = (3.60) est 2 where c is the correction factor and is calculated as the weighted aggregate journal Impact Factor of the top three categories that the researcher has received citations from [Antonakis and Lalive,

2008].

Square root of total number of citations: The Square root of total number of citations [De Visscher,

2011] was proposed as a bibliometric indicator that measures the overall impact of a scientist when considering his whole publication list, and is expressed as

√ Square root of total citations = NC (3.61)

Table 3.9 presents a list of all indicators included in the current category along with the additional factors considered by each indicator. We observe that most of the indicators of this category do consider both the Number of papers and the Number of citations. In addition, the Pa − index considers the Co-authorship, whereas the IQp − index considers the Scientific age and Scientific 72 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS field .

Number Number Scientific Age of Age of Self Co- au- Scientific of papers of age papers citations citations thorship field citations p − index X c − index X hT − index XX

Pa − index XXX hmock − XX index IQp − XXXX index √ NC X

Table 3.10: Factors considered by the Standalone indicators category

3.2.2 Indirect indicators

SARA [Radicchi et al., 2009]: SARA (Science Author Rank Algorithm) utilizes a FUC-NC-FUC Author-

Citation graph in order to calculate author scores based on a PageRank approach. The citations used to construct the Author-Citation graph are defined dynamically using a set of MI overlapping homogeneous intervals of the full date ordered list of citations. The intervals are defined as homogeneous because they contain the exact same number of citations (MR), and overlapping because each qth interval shares the first MR/2 citations with the (q −1)th interval and its last MR/2 citations with the (q + 1)th interval. The algorithm differs from the original PageRank since part of the credit attributed to each author includes a contribution from the whole network. The authors define the ‘‘scientific dept’’ of a scientist as the knowledge gained by the whole field and the value attributed to each author is proportional to his productivity and the number of co-authors of each one of his papers. The SARA score of author i is given by

    X Pj X P = (1 − q) · · w + q · z + (1 − q) · z P · δ(sout) (3.62) i  sout ji  i i j j  j j j

The first term is the distribution of the value of all authors citing i, which is weighted based on the total number of authors cited by each author, while the second term is the sum of the scientific dept received by the current author from all other authors in the community whether they do cite other authors or not (dangling nodes). In the equation, q represents the damping factor, Pi is the score of 3.2. AUTHOR INDICATORS 73

out node i, wji is the weight of the directed connection from j to i, sj is the sum of the weights of all links outgoing from the jth vertex, and if x = 0 then δ(x) = 1 otherwise δ(x) = 0. zi is a factor that considers the normalized scientific credit given to the author i based on his productivity. For a more detailed description of equation (3.62) and of the variables used, we refer the reader to Radicchi et al.[2009].

Finally, the authors propose two rankings for the authors, an absolute one and a relative one (the relative one being used to account for different historical periods).

hfg-index [Kosmulski, 2010]: The hfg-index is a successive Hirsch-type index that is based on first calculating the h-index values for the papers a scientist has co-authored [Schubert, 2009]. The h-index of a paper is defined as the number h of papers citing the current paper that have received at least h citations each. Having calculated the h-index values of the papers co-authored by the scientist, his hfg-index is defined in the same way as the h-index as the largest hfg of papers that each have an h-index of at least hfg, whereas the remaining papers of the scientist do not have an h-index greater than hfg. This indicator may not be recursive in its calculation but it does utilize more information from the citation graph than the direct citations (one extra generation of citations is used).

Indirect H-index [Egghe, 2011a,b]: This is an indicator very similar to the hfg-index [Kosmulski, 2010], as it is based on the same concept of creating a listing of the h-indices of the papers of an author, and then using these values instead of the number of citations to calculate the indirect H-index of the author. The main difference is that this indicator was proposed as a complementary indicator and it is calculated for the papers participating in the Hirsch-core of the author rather than on his whole set of papers. Thus, the papers included in the hfg-core may very well be completely different from the ones included in this indirect H-index core, since papers with very few first generation citations will probably not be included in the indirect H-index core, whereas if they have received few citations from high-impact papers they may be included in the hfg-core.

Generational indices [Hu et al., 2011]: A general term used to describe indicators that are calcu- lated for a particular generation of citations. An example given for a Generational index is the generational h-index. The authors having defined each generation in accordance with one of the four ways described in Section 2.5.1, they define the h-index of generation n, n ≥ 0 as the number h of papers included in generation n each of which has received at least h citations. 74 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS

Cross-generational indices [Hu et al., 2011]: A general term used to describe indicators that are calculated based on values calculated per generation of citations. The authors define the general generational indicator Cn for a sequence of A = (ak)k=0,..,n−1 of forward generational indicators as max(a0, a1, .., an−1). For example, if the generational indicator used is the h-index then the

Cn cross-generational indicator is equal to max(h0, h1, .., hn−1). The definition also applies when backward generational indicators are considered.

The authors also describe a case where the cross-generational index considers the generations in a cumulative manner instead of using them as standalone values. An example provided is the

Total influence indicator, defined as the sum of the generational indices divided by the generation factorial. Again, the generational h-index can be used as a generational indicator. In both cases the selection of the number of generations considered is selected arbitrarily.

Eigenfactor score for authors [West et al., 2013]: The EigenFactor score for authors is an adaptation of the original Eigenfactor score algorithm [Bergstrom, 2007, Bergstrom et al., 2008, Wes, 2008] used for Journal assessment. It uses a FRC-NC-FUC Author-Citation graph from which all self-citations are excluded. As discussed in Section 3.3.2, EigenFactor imitates the original PageRank algorithm.

The indirect author-based indicators either use the information present in the Author-Citation graph

(SARA and EigenFactor score for authors) or are based on information present in the Paper-Citation graph (Cro1111ss generational indices, hfg-index and the Indirect H-index). From the indicators examined, the EigenFactor score for authors uses the entire graph, whereas SARA makes use of the entire graph within the specified citations interval. The cross-generational indices may (or may not) utilize the entire graph depending on the number of generations examined, and the hfg- and indirect H-indices only examine two generations of citations. Table 3.11 presents the classification scheme described and lists the indirect indicators that belong to each class. 3.3. JOURNAL INDICATORS 75

Type of graph Use of graph Indicator Complete within interval SARA Author-Citation Complete EigenFactor score for authors Complete or Partial cross generational indices Paper-Citation hfg-index Two generations Indirect H-index

Table 3.11: Classification of the author-based indirect indicators

3.3 Journal indicators

3.3.1 Direct indicators

Influence weight of the journal [Pinski and Narin, 1976]: This indicator was described by means of journals but, as the authors of the paper state, the same concept can be applied to any unit participating in a citation network as long as the corresponding derived graph is used. The Influence weight is meant to be a size-independent measure of the number of citations a journal receives from all other journals participating in the citation network, normalized by the number of references the journal gives to other journals. The Influence weight is a measure of the influence that each reference provided by this journal has.

Influence per publication for the journal [Pinski and Narin, 1976]: To calculate the Influence per publication for a journal, one needs to know the Influence weight of the journal, the number of references that the journal made and the papers published within a year. Since the Influence weight is a measure of influence per reference, by multiplying by the number of references and dividing by the number of publications, we get the Influence per publication. This indicator is included in this category as a derived indicator based on the calculations for the Influence weight of a journal.

Total influence of the journal [Pinski and Narin, 1976]: The Total influence of a journal is its Influence per publication multiplied by the number of publications within a year. This indicator is included in this category as a derived indicator based on the calculations for the Influence weight of a journal.

3.3.2 Indirect indicators

Weighted PageRank [Bollen et al., 2006]: A modification of the original PageRank algorithm that instead of equally dividing the credit of a journal to the references it gives to other journals, it uses 76 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS

Journal-size dependent Indicator - Influence weight of the journal - Influence per publication of the journal X Total influence of the journal

Table 3.12: Classification of the journal-based direct indicators a weighting that is proportional to the amount of references given per journal. So, if a journal cites a particular journal more often, that should be done via links that weigh more. The formula used to calculate the Weighted PageRank score for journal a is

(1 − d) X PRw = + d · PRw · w(i, a) (3.63) a N i i where, w(i, a) is the weighting function, which according to the authors is given by

W (vi, vj) w(vj, vi) = P (3.64) k W (vj, vk) where, W (vi, vj) maps each edge between the journal vi and vj to a positive, citation frequency. This means that the Journal-Citation graph used in this version of PageRank is constructed based on the method that does not normalize the weights of the edges, which are later normalized via the weighting function (3.64).

Y-factor [Bollen et al., 2006]: The Y-factor is calculated as the product of the Weighted PageRank value of a journal and its Impact factor [Garfield, 1999, 2005]. It is listed in this category because it is a derived indicator based on the Weighted PageRank. Equation (3.65) shows the formula used to calculate the Y-factor w Ya = ISIIFa · PRa (3.65)

w where ISIIFa denotes the Impact Factor value for journal a, and PRa denotes the Weighted PageRank value for the same journal.

PSJR - Prestige SJR [González-Pereira et al., 2010]: PSJR is a size-dependent metric used to calculate the overall journal prestige and influence. Its calculation is based on a FUC-NN-FREC Journal-Citation graph, with a three year citation window and the number of self-citations per journal restricted to

33% of its overall citations. 3.3. JOURNAL INDICATORS 77

It is recursively calculated for each journal and the final value depends upon three terms. The first term is constant and represents a minimum value assigned to each journal in the citation graph. The second term, also constant, represents the prestige of the papers that is the number of the papers included in the journal normalized by the total number of papers published by all journals included in the graph. Finally, the third term represents the prestige of the citations and is given by the weighted

PSJR values of the citing journals and a constant value that represents the portion of the PSJR value of the dangling nodes of the graph (journals that do not cite any other journal) assigned to the current paper. The overall outcome of the calculations can be tuned by two constants (namely, e and d) that control the effect of the prestige of the papers and citations respectively. The PSJR is given by

  ! N ! 1 − d − e Arti X PSJRj Arti X PSJRi = + e · N + d · Cji · · CF + N · PSJRk N P Cj P j=1 Artj j=1 j=1 Artj kDN (3.66)

In the equation, Arti is the number of primary items of journal i, Cji represents the references from journal j to journal i, Cj is the number of references of journal j, CF is a correction factor used to spread the undistributed prestige and DN represents the dangling nodes. For a more detailed description of each element we refer the reader to González-Pereira et al.[2010].

SJR indicator [González-Pereira et al., 2010]: The SJR indicator is a size-independent metric that calculates the average prestige per paper published in a specific journal and, as such, it can be used to compare journals that publish different number of items. It is calculated by dividing the PSJR value of a journal by the number of papers published and multiplying the result by a constant value c that makes the outcome more easily readable. The SJR indicator for journal i is given by

PSJRi SJRi = c · (3.67) Arti

PRsum - Total authority [Su et al., 2011]: This is a derived indicator based on the scores of the papers included in the Paper-Citation graph and calculated with the PrestigeRank indicator. It is defined as 78 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS the sum of all PrestigeRank values of all papers published in a journal and for journal i is given by

N X PRsum = PRi (3.68) i=1

The authors consider this indicator to be the equivalent of the citation counts.

PRave - Authority factor [Su et al., 2011]: It is a derived indicator based on the scores of the papers included in the Paper-Citation graph and calculated with the PrestigeRank indicator. It is defined as the average PrestigeRank values for the papers published in a journal and for journal i is given by

N 1 X PR = PR (3.69) ave N i i=1

The authors consider this indicator to be the equivalent to the Impact Factor for journals.

Recursive Mean Normalized Citation Score (Recursive MNCS) [Waltman et al., 2011b]: The Re- cursive MNCS is based on the non-recursive MNCS indicator originally defined to account for differences among scientific fields [Waltman et al., 2011a]. The MNCS indicator is defined over a set of papers and is equal to the average Normalized citation score of the papers in the set.

The Normalized citation score for each paper is defined as the Total number of citations received, divided by the Expected number of citations for papers published in the field, which is equal to the average number of citations received per paper in the field. In Waltman et al.[2011b] the authors define the described MNCS indicator as the first-order MNCS indicator used to recursively calculate higher-order MNCS indicators for journals and institutions. For the calculations of these higher order indicators the authors assign varying weights to each citation based on the previous order MNCS value of the citing journal. In an empirical study presented in the same paper, the authors, apart from the scientific field, also consider the publication year of the papers included in the calculations. It is should be mentioned though that the authors conclude that the com- bination of normalized citation counts, to account for differences among scientific fields, along with recursive citation weighing does not produce satisfactory results. For a detailed presentation of the empirical results and conclusions we refer the reader to the original paper Waltman et al.[2011b].

PSJR2 - Prestige SJR2 [Guerrero-Bote and Moya-Anegón, 2012]: PSJR2 was proposed as an 3.3. JOURNAL INDICATORS 79 improvement of the PSJR indicator proposed by González-Pereira et al.[2010] and is a size- dependent metric. The indicator considers both the prestige and closeness of the citing journal.

PSJR2 is also recursively calculated and its value depends again upon three terms. The first two terms are the same as in the original PSJR indicator. The difference between the two indicators lies in the calculations of the third term which represents the citation prestige. This term is dependent upon a set of coefficients, named Coefji and a factor, named PSJR2D, in the calculations of which the cosine of the co-citation profiles of the journals are used. The co-citations received by two journals are used as a measure of their closeness and the introduction of the cosine of the co-citation vectors is used as a measure of the thematic relationship between journals. The PSJR2 indicator is given by

  ! " N #! 1 − d − e Arti d X PSJR2 = + e · + · Coef · PSJR2 (3.70) i N PN PSJR2D ji j j=1 Artj j=1 For a detailed description of each element and for the equations used to calculate the PSJR2D factor and the Coefji coefficients we refer the reader to Guerrero-Bote and Moya-Anegón[2012].

SJR2 indicator [Guerrero-Bote and Moya-Anegón, 2012]: The SJR2 indicator is a size-independent metric whose calculations are based on the PSJR2 indicator. It is calculated by dividing the PSJR2 value of a journal with the ratio of citable documents that each journal has relative to the total. The

SJR2 indicator is given by

N PSJR2i PSJR2i X SJR2i =   = · Artj (3.71) PN Arti Arti/ j=1 Artj j=1

Eigenfactor— score [Bergstrom, 2007, Bergstrom et al., 2008], [Wes, 2008]: The EigenFactor score is an indicator of the total influence of a journal. It uses a FUC-NN-FREC Journal-Citation graph, with a five-year citation window and with all journal self-citations excluded. EigenFactor imitates the original

PageRank algorithm by calculating the journal influence vector, which in turn is used to calculate the EigenFactor score as the percentage of citations received by the journal in question from all other journals included in the graph. The authors mention that the EigenFactor metrics can be used at the article and author levels as well.

Article influence— score [Bergstrom, 2007, Bergstrom et al., 2008]: The Article influence score is a derived indicator, whose calculation is based on the EigenFactor Score of a journal divided by the 80 CHAPTER 3. CLASSIFYING ASSESSMENT INDICATORS number of papers published in the journal for the five year period normalized by the total number of papers published in the same period. This yields the per-article influence of the journal which can be compared to the Impact Factor.

One can assign the indirect journal-based indicators based on their relation (if any) to PageRank in the following categories: (a) the ones that are adaptations of PageRank and, (b) the ones that are derived from indicators that use or adapt PageRank. Another possible categorization can be based on whether the indicators are dependent on the size of the journal in question or not. Table 3.13 presents the classification scheme described and lists the indirect indicators that belong to each class.

Relation to PageRank Journal-size dependent Indicator X Weighted PageRank X EigenFactor score PageRank adaptations X Prestige SJR X Prestige SJR2 X Y-factor - Article influence score Derived from indicators - SJR indicator that use or adapt PageRank - SJR2 indicator

X PRsum-Total authority

- PRave-Authority factor - Influence weight of the journal - Influence per publication of the journal Independently defined X Total influence of the journal - Recursive Mean Normalized Citation Score

Table 3.13: Classification of the journal-based indirect indicators Chapter 4

Proposed paper indicators

During our research we have identified some areas where a new indicator could provide some additional information or insight not already covered by the currently used indicators. In order to address these we have proposed two new indicators that can be used to assess scholarly impact.

The first indicator proposed is called f − value and we would classify it as an indirect indicator used to assess scientific papers drawing from the information present in the Paper-Citation graph. f − value considers the information from the Paper-Citation graph in a localized manner, in order to produce a single value that would depict the scientific value of the a publication.

The second indicator is fpk − index, which is also defined as an indirect indicator and treats some of the shortcomings of f − value. fpk − index utilizes more information from the Paper-Citation graph in order to produce a more representative value for the distinct publications. We believe that the definition of fpk − index is more robust and will better behave in the different use cases one might encounter in a real-world Paper-Citation graph.

For each of the defined indicators we are going to present the theoretical background and factors considered, along with applications of the indicators on a number of constructed Paper-Citation graphs. During this process we are going to compare them with some well known indicators and present the results. The scenarios used will highlight how the indicators behave under these defined conditions and demonstrate the strengths and weaknesses of each indicator. In addition we are going to present a comparison of the f − value and the fpk − index in an attempt to highlight the differences in their definition and application.

The rest of this chapter is organized as follows: In Section 4.1 we define the first indicator called f − value and apart from its definition and formula, we present the algorithm used to calculate

81 82 CHAPTER 4. PROPOSED PAPER INDICATORS the f − values of a Paper-Citation graph and provide an example application. In Section 4.2 we critically evaluate the different definitions of generations of citations, define the fpk − index based on the chosen definition and present an example application. Finally, in Section 4.3 we present a comparison between the two proposed paper indicators, along with a comparison of the proposed indicators with other well known indicators found in the literature.

4.1 f-value

The f − value indicator, proposed in Fragkiadaki et al.[2011], is an indirect Paper indicator that considers the direct and indirect impact of a scientific publication taking into account the information present in the whole of the Paper-Citation graph. It is based on the concept of the Cascading

Citations Indexing Framework [Dervos and Kalkanis, 2005, Dervos et al., 2006a, Fragkiadaki et al.,

2009] and particularly on the generations of citations and how they can affect our understanding of the impact of a scientific publication.

4.1.1 Definition

The formula to calculate the f − value of paper Pi is presented below, and as we can see it is based on recursive calculations across the entire citation graph.

c(Pi) X f(Pi) = 1 + RF · f(Pj) (4.1) j=1

A value of 1 is assigned to all published papers. We consider that any scientific publication that has gone through the process of being examined and peer-reviewed carries a scientific value, even if it has received no citations later on.

Based on the formula the f − value of a publication is recursively calculated based on the f − value’s of the papers that directly cite it. According to the original definition, c (Pi) is the number of direct citations of paper Pi, whose f − value we wish to calculate, and f(Pj) represents the f-value of paper Pj directly citing paper Pi.

In addition, in order to account for the different publication patterns of the scientific fields, the formula introduces a reducing factor (RF ), used to mitigate the impact transferred from the 4.1. F-VALUE 83

1 different generations of citations. In the original paper the fraction RF = 2.2 is used as the reducing factor, and the reducing factor was calculated based on the data set used. The general concept of this formula is that a paper transfers some of its value to the papers that it cites via the direct citations that exist among the papers in the Paper-Citation graph. How much of this value is transferred by each direct citation depends on the calculated value of the Reducing Factor used for the current graph.

4.1.2 The Reducing Factor (RF)

As mentioned earlier, one way to define the Reducing Factor is to examine the properties of a specific Citation graph. The number of papers included, the total number of citations and the average number of citations received by each publication could play an important role in actually defining the Reducing Factor. In this section we are going to describe the methodology followed in the original paper [Fragkiadaki et al., 2011] in order to define the Reducing Factor.

Given a Paper-Citation graph we calculate the Medal Standings Output (MSO) table up to depth

3 for the papers included in the graph. The calculated MSO table will contain a line for each publication included in the Paper-Citation graph along with the number of 1-gen, 2-gen and 3-gen citation counts, and the number of 2-gen chords and 3-gen chords for each publication. Based on the MSO table we can produce the following statistical metrics for the full set of papers and citations present in the graph:

ˆ Mean number of citations per generation of citations

ˆ Standard deviation per generation of citations

ˆ Minimum / Maximum values per generation of citations

ˆ Quartiles per generation of citations

From our research, we concluded that Bibliographic databases are very rarely complete, in the sense that it is currently impossible for a single bibliographic database to include all papers and all citations that these papers have received. Taking for granted the incomplete nature of bibliographic database we calculated the same metrics for the subset of papers included in the Paper-Citation graph that received at least one 1-gen citation and we based our calculations on these numbers instead. 84 CHAPTER 4. PROPOSED PAPER INDICATORS

The value chosen as the Reducing Factor for the application of the f − value on the Paper-Citation graph was the ratio

1 RF = 2−gen citations (4.2) 1−gen citations

This ratio indicates the average number of 2-gen citations citations received for every 1-gen citation a paper has received in the specific Paper-Citation graph, when considering only papers that have received at least one 1-gen citation. As mentioned in Section 4.1.1, the calculated value for 1 the Paper-Citation graph in the original paper was 2.2 , which means that for every 1-gen citation received, a paper would on average receive 2.2 2-gen citations.

One thing to note about the reducing factor, is that it is a constant value only for a particular period in time. Since bibliographic databases are always evolving with new papers being added that provide new citations to the papers already present in the Paper-Citation graph, the Reducing

Factor can and will change over time. For this reason we would propose that the Reducing Factor would be calculated regularly, at a time specified by the frequency and extend the Paper-Citation graph changes.

4.1.3 Algorithm

In this section we present the algorithm that calculates the f −values of the closed set of papers in a Paper-Citation graph. The algorithm requires a finite number of iterations to calculate the f −values but the exact number of iterations is dependent on the size and density of the Paper-Citation graph on which we apply it.

Following the mathematical notation defined in Section 2.2, we would say that the algorithm receives as input the list (P) of papers to be processed, the Paper Direct Citations (PDC) data structure which for each paper includes the list of papers that directly cite it, and, the Paper F-Values (PFV) data structure which includes the papers that need to be processed plus their current f − value and a flag that denotes whether this value has changed during the last iteration.

In other words, for a Paper-Citation graph with a total of NP papers, the set of papers that would need to be processed is defined by P = {P1,P2,...,PNP }. Let C (Pi) denote a list with the set of papers that directly reference paper Pi. Thus, C (Pi) is a subset of P and the Paper Direct

Citations (PDC) data structure is PDC = [C (P1) ,C (P2) ,...,C (PNP )]. 4.1. F-VALUE 85

Additionally, for each paper Pi, let V (Pi) denote the information required for this paper during the execution of the algorithm. This information consists of the f − value calculated so far for this paper and of a flag indicating whether the f − value has changed during the last iteration of the algorithm. Thus, for every paper Pi, in the beginning of the execution of the algorithm we would define V (Pi) as V (Pi) = [fval = 1, changed = 0]. Finally, the Paper F-Values (PFV) data structure is PFV = [V (P1) ,V (P2) ,...,V (PNP )].

The output of the algorithm is the PFV structure that contains the calculated f − values for all the papers of the current Paper-Citation graph.

During the first iteration of the algorithm, all articles have an f − value equal to 1. At each iteration, the algorithm calculates the f − values of all articles in the database based on the f − values calculated during the previous iteration and records whether any f − value has changed between the two iterations. If there is at least one changed value, the algorithm requires one more iteration because that change could propagate to more articles in the following iteration. If there is no f-value change then all f-values have been calculated and the algorithm terminates.

Αλγόριθµος 4.1 f − value algorithm

1 Input: 2 P list of papers to be processed 3 PDC data structure with direct citations of each paper 4 PFV data structure with initial f-values and flags 5 Output: 6 PFV data structure with calculated f-values and flags 7 8 PDC = remove_cycles(PDC) 9 NChanged = 0 10 first = true 11 while (first || NChanged > 0) do 12 first = false 13 NChanged = 0 14 PREV_PFV = PFV 15 foreach R in P do 16 prev_fval = PFV[R][fval] 17 PFV[R][fval] = 1 18 RCIT = PDC[R] 19 for T in RCIT do 20 PFV[R][fval] = PFV[R][fval] + RF*PREV_PFV[T][fval] 21 if PFV[R][fval] != prev_fval then 22 PFV[R][changed] = 1 23 NChanged = NChanged + 1 24 else 25 PFV[R][changed] = 0

In order to avoid possible errors in the execution of the algorithm we must ensure that no cycles exist in the closed set of papers included in the calculations. As already mentioned the algorithm is 86 CHAPTER 4. PROPOSED PAPER INDICATORS recursive and it bases its calculations on whether a paper’s f − value has changed during the last iteration. If a cycle does exist in the Paper-Citation graph then the algorithm will enter an infinite loop and never terminate, since the calculated f − values will constantly change from one iteration to the other.

The method used to remove the cycles from within the Paper-Citation graph was based on the citations themselves and on the publication years of the papers that participated in any given cycle. In order to remove a cycle from the graph we would need to either remove all citations that comprise this cycle, or simply remove one of them so that the cycle cannot complete. Ideally we would choose to remove the citation that is responsible for the formulation of the cycle and we would identify it as the one that originates from a older paper and terminating to a younger one, as such a citation would indicate that an already published paper is referencing a paper not yet published.

4.1.4 Example

As an example, let us consider the Paper-Citation graph of Figure 4.1. The graph consists of seven paper nodes (NP = 7) and there are eight direct citations among them (NC = 8). If we consider this Paper-Citation graph as the complete graph for which we wish to calculate the f − values of its papers, then we must first calculate the Reducing Factor for this graph.

Figure 4.1: Paper-Citation graph

As we recall from Section 4.1.2, the Reducing Factor has been defined as the ratio of 2-gen citations to 1-gen citations. In order to calculate the total number of 2-gen citations we are going to produce the Paper-Citation table for our graph. The Paper-Citation table for citation paths up to Length 2 is shown in Table 4.1 and as we can see the total number of Length 2 citation paths is 10, hence 10 is the number of 2-gen citations included in the graph. Therefore the Reducing Factor for our graph 4.1. F-VALUE 87

1 1 will be equal to RF = 2−gen = 10 = 0.8. 1−gen 8

Papers

Path Source Via Target

P2 P1 P2 P7 P3 P2 P4 P2 Length 1 P4 P3 P5 P4 P6 P4 P7 P1

P2 P7 P1 P3 P2 P1 P3 P2 P7 P4 P2 P1 P4 P2 P7 Length 2 P4 P3 P2 P5 P4 P2 P5 P4 P3 P6 P4 P2 P6 P4 P3

Table 4.1: Paper-Citation table up to Length 2 for the graph of Figure 4.1

So, now that we have the RF value defined we are able to start the calculations for the f − values of the papers included in the graph. The results after each iteration of the algorithm are shown in Table 4.2. Iteration 0 is considered as the input to the algorithm where the f − value of all papers is set to be equal to 1. The columns of the table represent the f − values of the papers included in the graph and the rows of the table display the calculated values of each iteration. The values that are set in bold are the ones that have changed if we compare them with the value that they had in the previous iteration.

Iteration P1 P2 P3 P4 P5 P6 P7

0 1.00 1.00 1.00 1.00 1.00 1.00 1.00

1 2.60 2.60 1.80 2.60 1.00 1.00 1.80

2 4.52 4.52 3.08 2.60 1.00 1.00 3.08

3 7.08 5.54 3.08 2.60 1.00 1.00 4.62

4 9.13 5.54 3.08 2.60 1.00 1.00 5.44

5 9.78 5.54 3.08 2.60 1.00 1.00 5.44

6 9.78 5.54 3.08 2.60 1.00 1.00 5.44

Table 4.2: Iterations of the f − value algorithm for the Paper-Citation graph of Figure 4.1 88 CHAPTER 4. PROPOSED PAPER INDICATORS

As we can see the algorithm required 6 iterations to converge since none of the calculated f − values have changed between iterations four and five. Papers P5 and P6 have a constant value of 1 since they do not receive any citations from any of the other papers included in the graph. Thus their value is set to 1 and it never changes. The paper that ranks higher is paper P1 since it accumulates part of the f − value of other papers included in the graph.

Following paper P1 in the ranking, are papers P2 and P7. Paper P7 receives 1 1-gen citation, 2 2-gen citations and 3 3-gen citations, whereas paper P2 receives 2 1-gen citations and 3 2-gen citations. The difference between these two papers is very subtle if we only examine their f − values. If we also examine the indirect impact of these two papers we will see that P2 has received more lower ranked citations that P7, since it has more 1-gen and more 2-gen citations than P7. P2 on the other hand, has zero 3-gen citations but P7 has received three 3-gen citations, and that is the main reason it has an f − value so similar to the one P2 has. The point to consider in this scenario is that given a sufficient number of 3-gen citations paper P7 could have ranked higher than P2.

Finally we have papers P3 followed by P4. As we can see from the results paper P3 is ranked higher because it has received 1 1-gen and 2 2-gen citations whereas paper P4 has received just 2 1-gen citations.

4.2 fpk − index

As briefly mentioned in the introduction of this section, the definition of the fpk − index is tied with the definition of generations of citations and the meaning we assign to them. So, in order to describe the indicator we would first need to define the data from within the Paper-Citation graph that we are going to utilize in order to calculate the fpk − index value of each individual paper that participates in the graph [Fragkiadaki and Evangelidis, 2016].

4.2.1 Critical evaluation of generations

So far, we have defined what direct and indirect citations are but we have not defined what they represent for the referenced paper. We believe that direct citations clearly represent a relation between the source and target paper. The target paper has somehow participated in the work presented and therefore it is being referenced. We should mention though at this point that it does not necessarily follow that the paper has been mentioned in a positive way within the source paper. 4.2. FP K − INDEX 89

Indirect citations also describe a relation between the source and target papers, since without the pre- existing work of other researchers some of the papers examined might have not come to life.

We believe that the connection defined by an indirect citation between a source and target paper should be considered stronger the closer the two papers are in the Citation graph, since this proximity represents the number of intermediate papers published.

Building on top of the indirect citations we have defined the generations of citations and we have described the different ways they can be constructed, but we have not yet examined which approach we consider to be the preferred one. In this section we are going to attempt to explain the choice made on how the generations of citations should be defined and what are the specific use case scenarios that can benefit from the chosen definition. More specifically we are going to present and discuss the following scenarios:

ˆ the existence of cycles of different levels in the graph

ˆ the existence of multiple citation paths between the source and target papers

– the citation paths might be of the same length

– the citation paths might be of different lengths

– or, the citation paths could comply with the definition of chords where a direct citation

co-exists with an indirect one of greater length

Let us again consider the Paper-Citation graphs presented in Figure 2.2, where we have three

Paper-Citation graphs that depict graphs with different levels of cycles. Building on the concept of the Medal Standings Output table (MSO table) presented in Dervos and Kalkanis[2005], it is possible to create a table of the papers included in this Paper-Citation graph along with counts of the first n-gen citations of the papers.

Tables 4.3 (a) and (b) present the MSO tables that include the citation paths up to Level 3 for the

Paper-Citation graphs of Figure 2.2 (a) and (b) respectively when calculated for paper P1. The

Paper-Citation graph of Figure 2.2 (a) depicts the following paper set P = {P1,P2,P3,P4}, with papers P1 and P4 participating in a Level 1 cycle, since P1 → P4 → P1. The Paper-Citation graph of Figure 2.2 (b) depicts the following paper set P = {P1,P2,P3,P4,P5}, with papers P1, P5 and

P4 participating in a Level 2 cycle, since P1 → P5 → P4 → P1.

In Table 4.4 we present the MSO tables for the first three generations for paper P1 for both Paper- Citation graphs based on the four definitions of generations of citations. As we can see from the 90 CHAPTER 4. PROPOSED PAPER INDICATORS

Papers Papers Path Source Via Target Path Source Via Target P2 P1 Length 1 P3 P1 P2 P1 P4 P1 Length 1 P3 P1

Length 2 P1 P4 P1 P4 P1 Length 2 P2 P1 P4 P1 P5 P4 P1 Length 3 P3 P1 P4 P1 Length 3 P1 P5 P4 P1 P P P P 4 1 4 1 (b) (a)

Table 4.3: Paper-Citation tables for paper P1 presented in Figures 2.2 (a) and (b).

Gm Gs P1 1-gen 2-gen 3-gen 1-gen 2-gen 3-gen a 3 0 0 3 0 0 b 3 1 0 3 1 0

(a)

Hm Hs P1 1-gen 2-gen 3-gen 1-gen 2-gen 3-gen a 3 1 3 3 1 3 b 3 1 1 3 1 1

(b)

Table 4.4: MSO table for the G and H definitions for paper P1 of Figure 2.2 (a) and (b).

tables the m and s definitions produce exactly the same counts for P1, for both graphs examined. This is to be expected since based on the definitions for m and s they would produce different results if we had more than one citation paths of the same length originating from the same source paper.

Now, if we compare the G and H definitions we will see that they indeed produce different results for the second and third generations of the examined graphs. For the graph in 2.2 (a), we can see that using the G definition paper P1 has received 0 citations. The only 2-gen citation received originates from paper P1 itself which cannot be included since it already belongs to generation 0. Similarly, the paper appears to have 0 3-gen citations since all the 3-gen citations originate from papers already included in the first generation and follow a path similar to Px → P1 → P4 → P1. Since this restriction does not apply to the H definition we can see that the counts are different with

P1 receiving 3 1-gen citations, 1 2-gen citation and 3 3-gen citations.

Similarly, for the graph in 2.2 (b), we can see that using the G definition again produces different results from the H definition but only for the third generation of citations for paper P1. In this case the paper has 0 3-gen citations if we are using the G definition since the only 3-gen citation originates 4.2. FP K − INDEX 91 from P1 itself via the P1 → P5 → P4 → P1 citation path, and again P1 cannot be included in this generation since it participates in generation 0.

In order to demonstrate the differences in the calculated generations of citations when there are multiple paths between a source and target paper we are going to use the Paper-Citation graph presented in Figure 4.2. There are three such scenarios that we are going to examine in the graph, and moving from left to right they are: (a) The existence of multiple citation paths of the same length, (b) the existence of a chord where a citation path of length greater than 1 co-exists with a direct citation, and, (c)the existence of multiple citation paths of different length.

P11

P10 P3 P6

P5 P2 P1 P9

P4 P7 P8

Figure 4.2: Paper-Citation graph with citation paths of different lengths

An application of the first scenario is visible on the graph if we examine the citations paths between papers P5 and P1. As we can see there are two Length 3 citations paths between the two papers, via P5 → P3 → P2 → P1 and P5 → P4 → P2 → P1. The second scenario can be examined by looking at the citation paths between papers P11 and P1, where a Length 2 citation path

(P11 → P10 → P1) co-exists with a direct citation from paper P11. This would make the Length 2 citation fall under the definition of a chord. Finally the third scenario can be examined by looking at the citation paths between papers P9 and P1 where we can see that a Length 2 citation path

(P9 → P6 → P1) co-exists with a Length 3 citation path (P9 → P8 → P7 → P1).

As before, we have generated Table 4.5 which presents the citation paths included in the Paper-

Citation graph of Figure 4.2, up to Length 3 for paper P1. From the Paper-Citation graph and the citation paths presented in the table we can see that P1 is the target paper for five Length 1, five Length 2 and three Length 3 citation paths.

The MSO tables for the first three generations of citations for paper P1 of Figure 4.2 are presented in Table 4.6. The generation counts are presented in the columns of the table under the definition 92 CHAPTER 4. PROPOSED PAPER INDICATORS

Citation path Source paper Via Target paper

P2 P1 P6 P1 P P Length 1 7 1 P10 P1 P11 P1

P3 P2 P1 P4 P2 P1 P P P Length 2 11 10 1 P9 P6 P1 P8 P7 P1

P5 P3 P2 P1 Length 3 P5 P4 P2 P1 P9 P8 P7 P1

Table 4.5: Paper-Citation table for paper P1 of the Paper-Citation graph of Figure 4.2 used to generate them, and the rows of the table represent the three different scenarios we are examining, namely (a) multiple paths of the same length, (b) multiple paths of different length, and,

(c) multiple paths of different lengths where one of them is a direct citation.

Hm Hs P1 1-gen 2-gen 3-gen 1-gen 2-gen 3-gen a 1 2 2 1 2 1 b 2 1 0 2 1 0 c 2 2 1 2 2 1 Total 5 5 3 5 5 2

(a)

Gm Gs P1 1-gen 2-gen 3-gen 1-gen 2-gen 3-gen a 1 2 2 1 2 1 b 2 0 0 2 0 0 c 2 2 0 2 2 0 Total 5 4 2 5 4 1

(b)

Table 4.6: MSO table for the G and H definitions for paper P1 of Figure 4.2.

m For example for the H definition, scenario (c) provides two 1-gen citations from papers P10 and

P11, one 2-gen citation from paper P11 which is defined as a 2-chord and zero 3-gen citations. Finally, the last row of the table contains the total number of citations for each generation for each type of definition. So, if we examine the Hm definition we will see that it produces five 1-gen citations, five 2-gen citations and three 3-gen citations for the entire graph.

All four definitions produce the same count for the 1-gen citations included in the graph. This is due 4.2. FP K − INDEX 93 to the fact that there is no Level-0 cycle present in the graph and based on the definition of direct citations a paper can only ever directly cite another paper once. It is also worth noting that the number of 1-gen direct citations is equal to the number of incoming edges on the P1 node of the Paper-Citation graph.

Based on the G definition a paper can be included in a generation of citations only if it does not participate in any other generation with highest proximity to the paper under scrutiny. Therefore the counts produced by the G definition for the 2-gen citations present in the graph are lower than the counts produced by the H definition, four vs five. The citation that is excluded from the second generation is part of scenario (b) where paper P9 provides both a 2-gen citation to

P1 via participates in two citation paths of different length. The 2-gen citation provided is via the

P9 → P6 → P1 path and the 3-gen citation is via the P9 → P8 → P7 → P1 path. But since P9 has already been included in the second generation for P1 it cannot participate in the third generation as well.

The four definitions produce different values for the third generation of citations. In fact as we can see from Table 4.6 the four definitions produce four different counts. The Hm definition generates s three 3-gen citations, two from paper P5 and one from paper P9. The H definitions generates two 3-gen citations, since now paper P5 can be included only once per generation. If we examine m the G definition, it also generates two 2-gen citations, but this time they are both from paper P5.

Since P5 has not previously been included in any other generation it is included twice in the third generation. But this time the 3-gen citation from paper P9 is not counted for since, P9 participates s in the second generation. Finally, the G definition counts a single 3-gen citation since P5 cannot appear in the third generation twice.

To summarize, we observe that the in all scenarios the Hm definition produces the largest citation counts, since it applies no restriction to the papers included in each generation. As long as a paper provides a citation to a particular target paper, the paper is included in that generation and if there are multiple paths of the same or different lengths, then the paper will appear in all the generations defined by these paths. The difference between the Hm and Hs definitions is that a paper can only appear once per generation. This means that in all cases where there multiple paths of the same length between a source and target paper then the source paper will only appear once in that particular generation.

The G definition produces smaller citation counts if there are multiple citation paths of different lengths between a source and destination paper. A paper from which multiple such paths originate, 94 CHAPTER 4. PROPOSED PAPER INDICATORS will only be present in the generation closest to the target paper. As before, papers with multiple citations paths of the same length towards the same target paper, are only counted once.

We should note that in a Paper-Citation graph it is not uncommon to have papers indirectly citing other papers via multiple citation paths of the same or different length, since if a paper cites other papers from the same scientific field it is highly likely that these papers will have some common ancestor.

From the above examples we have seen that the Gs definition, manages to distinguish among all of the scenarios, including the existence of cycles as well as the existence of these multiple citation paths. And this is the definition that we propose for counting indirect citations. With this definition an indirect citation indicates that there is a connection between the source and target papers rather than simply a citation path. Also, this definition fulfills the requirement that a connection should be considered stronger the closer we are to the target paper.

4.2.2 Definition

In the previous section, we evaluated the available definitions for the generations of citations and we concluded that the Gs definition covers the scenarios examined, in the sense that it the one less sensitive to the existence of cycles, chords and other paths of the same or different lengths. In addition, this definition agrees with our original notion that a citation represents something more than the existence of citation path between a source and target paper. A citation should represent the connection between the two papers, that should be considered stronger the smaller the length of the citation path [Sidiropoulos and Manolopoulos, 2005] and that it should be irrelevant whether this connection is the outcome of a single or multiple citation paths.

Based on these conclusions, we propose a new Paper-Indicator that assesses a paper based not only on its direct but also the indirect citations received up to a specific generation, when the generations are defined according to the Gs definition. This indicator can be described as a cross-generational index [Hu, Rousseau, and Chen, 2011], since it builds on the concept of multiple generations of citations for each of which we calculate a specific value. We name this indicator fpk − index and the formula for calculating it is

Pk 1  1 + ∗ geni fpk = i i (4.3) np 4.2. FP K − INDEX 95

From the definition of fpk − index, we can see that the indicator considers the first k generations of citations defined by the Gs definition. For each generation of citations we calculate a single value which represents the total number of papers included in that generation for the paper under scrutiny

(geni). These citations are weighted depending on the generation they belong to, with citations of lower rank being more important and indicating that the target paper had a higher impact on the source paper. The indicator assigns a value 1 to each published paper and it uses the scientific age of the paper (np) to produce scores that can be used to compare papers of different scientific age. Once published, a paper is considered to have a scientific age of 1.

The exact number of generations used is a subject that on its own requires further investigation, since the optimal number of generations to use can depend on a number of different factors depending on the starting point of our research. The results can vary significantly depending on whether we assume that we are using a unified Paper-Citation graph with papers from many different scientific areas or if we are considering smaller, more specialized Paper-Citation graphs that depict papers from a specific scientific field.

Let us consider the latter case where a Paper-Citation graph contains more uniform papers that relate with a particular scientific area. In this case a number of factors exist that could alter the number of generations that one chooses to include in their calculations. The following list provides just an overview of some of them, and we should point out that the list cannot be considered neither complete nor exhaustive.

ˆ Average number of co-authors:

The average number of co-authors could affect the number of citations received by a

particular publication. In some cases the more co-authors a paper has the greater the

possibility than one of the co-author will continue to produce research based on the findings

presented in the publication.

ˆ Average number of citations received or references provided:

A large average number of citations could indicate a citation pattern where authors reference

not only new papers but also papers published several years ago, thus possibly producing

large number of chords in the citation graph.

ˆ Average elapsed time from the date of publication until a paper receives its first citation:

If the observed times are high it could be that several years may pass before published papers

receive citations in which case the time is the limiting factor in our calculations. We should 96 CHAPTER 4. PROPOSED PAPER INDICATORS

note though, that the average time until a paper receives its first citation does not necessarily

mean that all citations would require that much time to formulate.

ˆ Average age of citations:

The average age of the citations received could also affect the number of generations

considered since a large average citation age could mean that it could be several years

before long citation paths could be generated within the graph. In this case, again the time

would be the limiting factor in the calculations since picking a relatively large number of

generations to examine does not necessarily mean that the generations would be able to

contribute to the calculations due to the fact that they might not have been populated yet.

ˆ Number of publications per year:

Small number of papers published in a particular scientific field could mean that the density

of the citation graph examined is high with a relatively small number of participating papers

and many citations among them. On the other hand, large number of papers published each

year could mean that the length of the citation paths is small therefore not providing many

generations to our calculations on.

In most of the applications of the fpk − index in the present study we have chosen k = 3, thus considering the first three generations of citations of the Gs definition. This number has been chosen based on the authors sentiment that three generations (similar to friends of friends of friends in social networks) are enough to illustrate the usability and validity of the indicator under different scenarios.

4.2.3 Example

Let us consider the Paper-Citation graph of Figure X where NP = 6 and P = {P1,P2,P3,P4,P5,P6}. From the meta-data information about each paper we are only displaying the year of publication, since this is the only required information in order for us to be able to calculate the fpk − index value for each publication.

There are a total of NC = 6 direct citations on the graph and in Table 4.7 we can see the Paper-Citation table with the citation paths up to Length 3 for all the papers included in the graph.

There are a total of 6 Length 1, 6 Length 2 and 3 Length 3 citation paths in the graph. In Table 4.8 we present the MSO table for the papers included in the graph for the first three generations (k = 3), for all participating papers, when the generations are defined using the Gs definition. For each paper 4.2. FP K − INDEX 97

Figure 4.3: Paper-Citation graph that includes the year of publication for each paper we include the Year of Publication, as it is required for the calculation of the fpk − index, along with the calculated fp3 − index value.

Citation path Source paper Via Target paper

P2 P1 P3 P2 P4 P2 Length 1 P6 P2 P4 P3 P5 P4

P3 P2 P1 P4 P3 P2 P4 P2 P1 Length 2 P5 P4 P3 P5 P4 P2 P6 P2 P1

P4 P3 P2 P1 Length 3 P5 P4 P3 P2 P5 P4 P2 P1

Table 4.7: Paper-Citation table for the graph of Figure 4.3

One of the first things to note is the total number of citations included per generation. Table 4.7 presented the citation paths of different lengths that exist in our Paper-Citation graph and we saw that there were six Length 1 paths, six Length 2 paths and three Length 3 paths. In the MSO table

(Table 4.8) we can see that there are six 1-gen citations, five 2-gen citations and only one 3-gen citation. The Length 1 citation paths are all counted as 1-gen citations, but the P4 → P3 → P2 citation path is not counted as a 2-gen citation since paper P4 is also included in the first generation of citations for P2. Finally the only Length 3 citation path that is counted as a 3-gen citation is the

P5 → P4 → P2 → P1, since paper P4 also provides a 2-gen citation to paper P1 and P5 can only provide one 3-gen citation to paper P1.

If we now look at the calculated fp3 − index values of the papers included in the graph, we will 98 CHAPTER 4. PROPOSED PAPER INDICATORS

Papers Gs fp3 − index Name Year 1-gen 2-gen 3-gen 1 3 1 1+( 1 + 2 + 3 ) P1 2010 1 3 1 7 = 0.55 3 1 1+( 1 + 2 ) P2 2012 3 1 0 5 = 0.90 1 1 1+( 1 + 2 ) P3 2013 1 1 0 4 = 0.63 1 1+( 1 ) P4 2014 1 0 0 3 = 0.67 1 P5 2015 0 0 0 2 = 0.50 1 P6 2015 0 0 0 2 = 0.50 Total 6 5 1

Table 4.8: MSO table and fp3 − index values for the graph of Figure 4.3

see that the papers that rank higher is P2 followed by P1, whereas the papers with the lowest values and rank are P5 and P6. These papers were published within the same year and have received zero citations each. As expected the indicator produces the exact same value for these papers. The paper ranked above papers P5 and P6 is P3 and paper P4 comes immediately after. It is actually interesting to compare the results for these two papers, since P3 is given a lower score compared with P4 even though it has received the same 1-gen citations (one) but also has one 2-gen citation.

The difference in the score of the two papers comes from their publication years, with P3 being published a year earlier than P4.

As we already mentioned the top paper in the current graph is P2, since it has received the highest number of 1-gen citations and was published just two years after paper P1 which has just one 1-gen citation. As we said citations closest to the target paper do weigh more in the calculation and we expect younger papers with the same number of citations as older papers to be ranked higher since they have higher probability of receiving a larger number of citations when they reach the same scientific age.

4.3 Comparison

In this section we are going to provide a comparison and critical evaluation of the differences between the f − value and the fpk − index indicators. In addition we are going to compare the rankings produced by these indicators with some other well known paper indicators that are commonly used in citation analysis. 4.3. COMPARISON 99

4.3.1 f − value and fpk − index

The proposed paper indicators share some common characteristics. Both indicators try to take into consideration more information from within the Paper-Citation graph and accommodate both direct and indirect citations. The main concept is that a paper’s scientific importance should be assessed by considering the longer term impact it has had in its respective scientific field. This is visible in the formulas of both indicators, since the f − value is calculated across the entire graph whereas the fpk − index is calculated by taking into account the first k generations of citations for each paper.

In addition both indicators agree that although the impact of a paper should not be limited to its immediate neighbors, it should be considered to have a stronger influence to them rather than to papers further away in the Paper-Citation graph. In the case of f − value this is expressed via the Reducing Factor (RF ) that is used to reduce the contribution of distant papers. In the case of fpk − index this is expressed with the limit one can apply in the number of generations considered, and by the fact that the indicator considers the weighted count of the citations per generation. We should also note that f − value uses the same Reducing Factor for all papers and generations, whereas the fpk − index uses a reducing ratio, so that higher ranked generation counts are reduced more than lower ranked ones.

Apart from their common characteristics, the two indicators also differ in some aspects of their definitions. f − value requires an initial pre-processing step for a newly assessed graph, in order to calculate the value of the Reducing Factor. On one hand this means that the indicator can easily adapt to the different citation patterns present in each graph, but it also makes this a prerequisite to the actual calculations.

Another difference is that in order to calculate the f − values of a subset of papers included in the graph, one must calculate the values for all papers present in the graph, since the indicator is calculated across a graph and its calculations cannot easily be localized. The fpk − index also requires more extensive knowledge of the graph, in contrast to other indicators, but the calculations can be localized to the papers included in the first k generations of any particular paper.

One more aspect to consider, is the sensitivity of the indicators with regards to irregular citation patterns that one might encounter in any citation graph, i.e. the existence of different levels of cycles and the of multiple citation paths of the same or different length that exist between a particular source and target destination. The f − value indicators appears to be more sensitive to that, since if cycles are not detected and removed from the graph the indicator will fall into an infinite loop 100 CHAPTER 4. PROPOSED PAPER INDICATORS and will be unable to successfully conclude its calculations. fpk − index on the other hand will not be affected by such cases but it requires the generations to be examined in detail in order to define the generations based on the chosen methodology.

Finally, fpk − index considers one additional factor from the f − value which is the year a paper was published in. This allows the indicator to be able to better distinguish between papers with similar citation history but different publication year. The indicator will favor more recent papers since they have managed to have a similar impact with their older counterparts. This could also be an indicator that their impact will outgrow their older counterparts in the future since the expectation would be that they would accumulate even more citations.

4.3.2 Other well known indicators

In this section we have chosen to compare the proposed indicators with two other well known paper indicators, namely the Number of citations (NC) and the PageRank [Page, Brin, Motwani, and Winograd, 1999, Ma, Guan, and Zhao, 2008]. The Number of Citations (NC) has been selected as the most commonly used direct indicator which measures the impact of a paper by counting the number of direct citations received. This indicator produces values that are identical to the first generation citation counts (1-gen) we have discussed so far.

On the other hand, PageRank has been selected as a representative example of an indirect indicator that considers the information available in the complete Paper-Citation graph. As previously mentioned, PageRank was inspired by citation analysis and was originally used to rank pages on the web. PageRank has currently found its way back to citation analysis with multiple applications, modifications and adaptations that aim at providing a more accurate representation of scientific impact whether it is for a paper, author or journal.

In the original implementation of PageRank the damping factor d was set to be 0.85 but for the calculations included in this section of the paper we use d = 0.5 as defined in Ma, Guan, and Zhao[2008]. We refer to this version of PageRank as Base. A normalized version of PageRank also exists, where the first component is divided by the total number of web pages or papers if applied on the Paper-Citation graph.

(1 − d) X PR(i) PR(A) = + d ∗ (4.4) N N(i) 4.3. COMPARISON 101

By implementing PageRank as shown in 4.4, the sum of the PageRank values of all nodes included in a particular graph should be 1.0. As discussed in the literature though, this is not the case in graphs that include nodes that do not provide any reference to any of the nodes included in the graph. These nodes are named dangling nodes Yan and Ding[2011] and their behavior would cause the sum of the PageRank values to decline after a number of iterations. In the second version of PageRank, we accommodate these dangling nodes by equally re-distributing their value to all the nodes in the graph and we refer to this version of PageRank as Normalized.

First example

In this section, we examine two applications of the fpk − index. The first one is to the Paper-Citation graph of Figure 4.2. In this graph, we consider all papers to be of equal scientific age (age 1). The purpose of this example is to demonstrate the usage of f − value and of fpk − index, k = 3 and the way these indicators behave when applied in a graph that includes multiple citation paths of the same or different lengths between the same source and target papers.

Table 4.9 (a) presents the generation counts up to depth 3 for all the papers included in the graph when calculated using the Gs definition for the generations. In addition it displays the raw calculated values for the four indicators rounded up to 2 decimal digits. The indicators listed in the table are the Number of Citations (NC), PageRank in both its Base and Normalized versions, the f − value and finally the fp3 − index.

As already mentioned, a damping factor d = 0.50 has been used for the PageRank calculations. Base PageRank required 6 iterations to converge and the Normalized PageRank required 18 (with a convergence criterion set to 0.00001). Finally f − value required 5 iterations to complete.

We should note at this point that the indicators produce different types of values, with the Number of Citations (NC) producing strictly integer values and all other indicators producing floats. Also, the scale and range of the values produced by each indicator differs, since for example the

Normalized version of PageRank produces values lower than 1, whereas for the current graph the largest f − value produced is 35.20. Interestingly enough the Reducing Factor is 1.625 for this graph. This is due to the fact that the number of 2-gen citations is lower than the number of 1-gen citations present in the graph. Table 4.9 (b) presents the different categories generated by the calculated values of each indicator and the papers that belong to each category.

As we can see, all four indicators agree that paper P1 is the most important paper included the 102 CHAPTER 4. PROPOSED PAPER INDICATORS

Gs PageRank gen1 gen2 gen3 NC Base Normalized f − value fp3 − index

P1 5 4 1 5 1.30 0.35 35.20 8.33

P2 2 1 0 2 0.51 0.14 9.53 3.50

P3 1 0 0 1 0.21 0.06 2.63 2.00

P4 1 0 0 1 0.21 0.06 2.63 2.00

P5 0 0 0 0 0.15 0.04 1.00 1.00

P6 1 0 0 1 0.21 0.06 2.63 2.00

P7 1 1 0 1 0.33 0.09 5.27 2.50

P8 1 0 0 1 0.21 0.06 2.63 2.00

P9 0 0 0 0 0.15 0.04 1.00 1.00

P10 1 0 0 1 0.21 0.06 2.63 2.00

P11 0 0 0 0 0.15 0.04 1.00 1.00 (a)

PageRank NC Score f − value fp3 − index Score Papers (B) (N) Papers Score Papers Score Papers

5 P1 1.30 0.35 P1 35.20 P1 8.33 P1

2 P2 0.51 0.14 P2 9.53 P2 3.50 P2

P3,P4,P6 0.33 0.09 P7 5.27 P7 2.50 P7 1 P7,P8,P10 P3,P4,P6 P3,P4,P6 2.00 P3,P4,P6 0.21 0.06 2.63 0 P5,P9,P11 P8,P10 P8,P10 P8,P10

0.15 0.04 P5,P9,P11 1.00 P5,P9,P11 1.00 P5,P9,P11 (b)

Table 4.9: (a) On the left, we list the citation generation counts of the papers included in the Paper- Citation graph of Figure 4.2, and on the right we list the values of the four indicators (Number of citations (NC), PageRank (Base and Normalized), f − value and fp3 − index), (b) the categories defined by each indicator based on the calculated raw values are presented along with the papers that fit each category.

graph and it is ranked first in all the defined categories. P1 , has received the highest numbers of 1-gen and 2-gen citations and it is also the only paper included in the graph that has received a

3-gen citation. Similarly, there is agreement among the indicators about the paper with the second biggest influence in the current Paper-Citation graph, which is paper P2. Paper P2 has received 2 2-gen citations and a single 2-gen citation.

At this point the direct indicator examined (NC) places the next six papers in the same category 3 where as all the indirect indicators (PageRank, f − value, fp − index) distinguish paper P7 and place it third. The reason for this difference is that papers P3, P4, P6, P7, P8 and P10 have all received a single 1-gen citation which means that if we only examine their direct impact they will all appear the same. The reason the indirect indicators place P7 higher is because it has also received 4.3. COMPARISON 103 a 2-gen citation, thus it is placed in a category of its own with the rest of the papers mentioned earlier placed one position further down.

Finally the papers that have received no direct and therefore no indirect citation are all placed together at the last category generated by all indicators examined. It is also interesting to note that for the selected graph all indirect indicators generate the same categories of papers although their calculated raw values differ significantly.

Second example

The second application is to the Paper-Citation graph of Figure 4.4, that contains a graph with 22 papers. The graph is constructed using paper P1 as the target paper. All citation paths of length lower than or equal to four have been included. For simplicity, we consider all papers within the same citation path length area to have the same scientific age. The oldest papers are P1, P2, P3 and P4 with scientific age 4.

P21 P22 P20 P19 P18 P11 P17 P10 P16 P9 P6 P12

P15 P5 P7 P13 P3 P14 P8 P2 P4

1 2 3 4 :age P1 length: 1 2 3 4

Figure 4.4: Example of a Paper-Citation graph. All citation paths of length lower than or equal to four are included in the graph. For simplicity we consider all papers within the same citation path length to have the same scientific age.

Table 4.10 presents the gen1, gen2 and gen3 citation counts for the 22 papers of the graph along with the scientific age of each paper and the calculated values for the four indicators under examination. For PageRank, we are displaying the scores for both the Base and Normalized version.

The Base version required 8 iterations to converge whereas the Normalized one required 29 (the convergence criterion has again been set to 0.000001).

The f − value required 7 iterations and the calculated Reducing factor for the graph of Figure 4.4 is 0.79. The papers are ordered in increasing order based on their name and no other sorting has been applied. The PageRank, f − value and fp3 − index values have been rounded to two decimal places whereas the Number of citations are always integer values. 104 CHAPTER 4. PROPOSED PAPER INDICATORS

Gs PageRank age 1-gen 2-gen 3-gen NC Base Normalized f − value fp3 − index

P1 4 3 3 6 3 2.06 0.20 14.63 1.88

P2 4 1 3 9 1 1.36 0.13 11.92 1.63

P3 4 1 0 0 1 0.28 0.03 1.79 0.50

P4 4 1 3 0 1 0.61 0.06 3.64 0.88

P5 3 3 9 0 3 1.43 0.14 13.90 2.83

P6 3 0 0 0 0 0.15 0.02 1.00 0.33

P7 3 3 0 0 3 0.53 0.05 3.36 1.33

P8 2 1 9 0 1 0.70 0.07 7.34 3.25

P9 2 9 0 0 9 1.30 0.13 8.07 5.00

P10 − P13 2 0 0 0 0 0.15 0.02 1.00 0.50

P14 − P22 1 0 0 0 0 0.15 0.02 1.00 1.00

Table 4.10: On the left the 22 papers of the Paper-Citation graph of Figure 4.4 are listed along with their scientific age and citation generation counts. On the right the calculated values based on the Number of Citations (NC), PageRank (Base and Normalized) and fp3 − index indicators are presented.

3 There are nine papers (P14 to P22) that have an fp − index of 1.000 since they have not received any direct or indirect citations and their scientific age is 1. We can compare the fp3 − index values 3 of these papers to the fp − index values of papers P10, P11, P12 and P13 that also have not received any direct or indirect citations but whose scientific age is 2, and thus their fp3 − index value is 0.500. We consider this to be a valid result since if a paper has not received any direct or indirect citations its value should decline as it is getting older since (with the exception of sleeping beauties) it becomes more and more unlikely that it receives many citations in the future. The same logic applies to paper P6 as well, whose value is 0.333, since it has not received any direct citations and its scientific age is 3.

The remaining three indicators have calculated the same score for these 13 papers (P6, P10 − P12 and P14 − P22). More specifically the Number of Citations (NC) scores them with 0, since they have not yet received any direct or indirect citations, the Base PageRank scores them with 0.15 and the Normalized PageRank scores them with 0.02, and, finally the f − value scores them with 1.00 since they have all received some credit by the fact that they have actually been published. As we can see the age of a publication is not taken into consideration by any of these indicators.

In order to make the comparison of the remaining papers easier, the scores and the corresponding papers per indicator are presented in Table 4.11. For each indicator we present the number of different categories that are created based on the calculated values for the papers included in the graph. Then the papers that fit each category have been placed in the right hand side. 4.3. COMPARISON 105

As we can see the Number of Citations (NC) generates the smallest distinct number of categories (4) for the papers included in the graph, whereas with the exception of the Normalized PageRank all other indicators generate 9 distinct categories of papers. The Normalized PageRank generates four, and this is affected by the number of decimal places included in the result tables, since if we had used 3 decimal places instead of 2 then this indicator would have also generated 9 categories of papers.

Going back to the papers in the graph, another interesting comparison is between papers P2, P3 and P4 of scientific age 4. P3 has only received a single 1-gen citation, P4 has received a single

1-gen citation along with 3 2-gen citations and, P2 has received a single 1-gen citation along with 3

2-gen citations and 9 3-gen citations. As we can see here, the fact that there is a citation from P9 to paper P8 does not change the number of 2-gen citations received by paper P2 because based s on the G definition used for the citation counts, paper P5 has already been included in the second generation for this paper, thus is excluded from also providing a 3-gen generation for the paper.

So, with regards to the calculated values for the papers, we need to note that all the papers have the same scientific age, so the factor that will determine the acquired score is the number of 1-gen,

2-gen and 3-gen citations. In addition, the 1-gen citation count is the same for all papers. Therefore, the one that should gather the lower score is the one that has no 2-gen and 3-gen citations, which is paper P3. From the remaining papers the one that should follow is the one that has 2-gen citations but no 3-gen citations, which is paper P4. And, finally, the paper that should gather the greatest score is P2 since it has more 3-gen citations than P4.

As, we can see from the table this relative positioning of the papers is honored by all indirect indicators, although the ranking generated by each of them is going to place the papers in different positions. We can see clearly by the position received by paper P3, which is placed in the second to last category for Base PageRank, Normalized PageRank and the f − value, but it is placed in 3 the bottom category for fp − index, along with papers P10 − P13. This is due to the fact that even though paper P3 has indeed received a 1-gen citation its scientific age is 4 which means that the it’s score would be 0.50 which is the same score that a published paper with no citations but scientific age of 2 will receive.

So, some important aspects to consider when examining and comparing scientific indicators is the number of distinct paper categories they generate, the relative positioning of papers with somewhat similar properties and also the absolute ranking position of the papers as such ranking might reveal aspects not already considered during the analysis of a particular Paper-Citation graph. 106 CHAPTER 4. PROPOSED PAPER INDICATORS

Score Papers

9 P9

3 P1 P5 P7

1 P2 P3 P4 P8

0 P6 P10 P11 P12 P13 P14 P15

(a) Number of Citations

Score Papers Score Papers 2.06 P1 0.20 P1 1.43 P5 0.14 P5 1.36 P2 0.13 P2 P9 1.30 P9 0.07 P8 0.70 P8 0.06 P4 0.61 P4 0.05 P7 0.53 P7 0.03 P3 0.28 P3 P6 P10 P11 P12 P13 P14 P6 P10 P11 P12 P13 P14 0.02 P15 P16 P17 P18 P19 P20 0.15 P15 P16 P17 P18 P19 P20 P21 P22 P21 P22 (c) Normalized PageRank (b) Base PageRank

Score Papers Score Papers 14.63 P1 5.00 P9 13.90 P5 3.25 P8 11.92 P2 2.83 P5 8.07 P9 1.88 P1 7.34 P8 1.63 P2 3.64 P4 1.33 P7 3.36 P7 P14 P15 P16 P17 P18 P19 1.79 P3 1.00 P20 P21 P22 P6 P10 P11 P12 P13 P14 0.88 P4 1.00 P15 P16 P17 P18 P19 P20 0.50 P3 P10 P11 P12 P13 P21 P22 (e) fp3 − index (d) f − value

Table 4.11: Scores and papers distribution for the four indicators included (Number of citations (NC), Base and Normalized PageRank, f − value and fp3 − index). Chapter 5

Proposed author indicators

In this chapter we present three author based indicators that are based on the paper indicators described in the previous chapter. We are going to present how each indicator is defined, along with the criteria that it considers and an example of the way each indicator can be calculated.

Finally we are going to compare these indicators among themselves and with some other well known indirect author indicators found in the literature.

The first indicator is called fa − value and is based on the f − value indicator. It is an indirect indicator based on the concept of cascading citations and considers the whole Publication Record of a researcher along with several other criteria in order to define a single representative value of the scientific impact of the researcher.

The remaining two indicators are based on the fpk −index values of the papers participating in the Publication Record of a researcher and are named fa − index and fas − index. These indicators are again based on the concept of the direct and indirect impact a researcher’s work has had and their main difference is that the fas − index excludes any self-citations that an author’s papers might have received.

The rest of the chapter is organized as follows: In Sections 5.1, 5.2 and 5.3 we define the three proposed indicators, namely fa−value, fak −index and fask −index, and present an example application for each indicator based on constructed Paper-Citation graphs. Finally, in Section 5.4, we present a comparison of the three proposed indicators.

107 108 CHAPTER 5. PROPOSED AUTHOR INDICATORS

5.1 fa − value

The fa − value indicator is an indicator based on the f − value indicator presented in Chapter4, Section 4.1. It is an indirect indicator since its definition is based as we said in the f − value which requires full knowledge of the Paper-Citation graph in order to calculate the distinct values for the papers included in the researchers Publication Record.

5.1.1 Definition

The formula to calculate the fa − value of an author A is presented below, and as we can see it is based on the full Publication Record of the author under scrutiny.

n P f−value(Pi) a(Pi) fa − value = 1 (5.1) (CurrentY ear − MinP ublicationY ear) + 1 fa − value considers a number of factors during its calculations. Firstly it considers the sum of the f −values of all the papers in which the current author has participated in normalized based on the number of co-authors of each paper. The reasoning behind this division is an effort to accommodate for the fact that not all papers are single authored, meaning that not all papers are the scientific product of just one person. In such cases where for a particular publication many authors have contributed we could either equally distribute the f − value of the publication to all co-authors or assume that the value should be assigned as a whole to each of the individual researchers. In this case we have chosen the first approach that will eventually award papers with fewer co-authors. fa − value also considers the number of years that have passed since the first publication included in the Publication Record of the author. This factor can be considered important in the sense that it assists with the evaluation of authors of different scientific ages. It is true that an author who has contributed to a number of publications several years ago could still be receiving citations to the present day, thus his papers might accumulate more value as they grow older.

5.1.2 Example

Let us consider the graph presented in Figure 5.1. This is the same Paper-Citation graph as the one presented earlier (see Figure 4.1) but for each paper we have included its Year of Publication as well as the list of co-authors. 5.1. FA − V ALUE 109

In general we could say that the papers included in the graph were all published between the years

2010 and 2016 and the set of authors A is defined as A = {A1,A2,A3,A4,A5} with NA = 5. For simplicity we are going to assume that the authors of the papers have only participated in the publications included in the present Paper-Citation graph.

2013 A3, A4

2015 A3

P3

2010 A2, A3 P5 2010 A1, A2

2016 A4 P4 P2 P1

2016 A5

P6 2013 A5

P7

Figure 5.1: Paper-Citation graph of Figure 4.1 with Publication Years and Author information

In order to calculate the fa − value score for each of the authors included in the Paper-Citation graph we need to first identify the Publication Record of the author, identify the number of co-authors for each publication, calculate the f − values for the papers in the Publication Record and finally calculate the scientific age of the author. In Table 5.1 we present these information for each author included in the Paper-Citation graph of Figure 5.1.

As we recall from Section 4.1.4 the Reducing Factor (RF ) for the Paper-Citation graph has been calculated and set to 0.8. In addition we had calculated the f − values for the papers included in the graph which we are also included in Table 5.1. The final column of the table are the fa − value values for the co-authors of the papers.

If we were to rank the authors in descending order based on their fa − value value we would see that author A5 is the top ranked author and is followed by authors A2, A3,A1 and A4.

The two main reasons why author A5 is ranked higher than the rest of the authors in the graph are because he has authored two papers, of which he is the sole author, and his scientific age is smaller compared with most of the rest authors included in the graph. The fact that he is the sole author means that he accumulates the full f − value of his papers and the fact his relative young scientific age means that the accumulated value is not greatly reduced. 110 CHAPTER 5. PROPOSED AUTHOR INDICATORS

Author Age Paper Co-authors f − value fa − value 9.78 2 A1 1 + (2016 − 2010) = 7 P1 A1,A2 9.78 7 = 0.70

A2 1 + (2016 − 2010) = 7 P1 A1,A2 9.78 9.78 5.54 2 + 2 7 = 1.09 P2 A2,A3 5.54

A3 1 + (2016 − 2010) = 7 P2 A2,A3 5.54 5.54 3.08 2 + 2 +1.00 P3 A3,A4 3.08 7 = 0.76

P5 A3 1.00

A4 1 + (2016 − 2013) = 4 P3 A3,A4 3.08 3.08 2 +1.00 4 = 0.64 P6 A4 1.00

A5 1 + (2016 − 2013) = 4 P4 A5 2.60 2.60+5.44 4 = 2.01 P7 A5 5.44

Table 5.1: Author metadata for the Paper-Citation graph of Figure 5.1 along with the fa − value scores for each author

5.2 fak − index

The fak − index is an indicator based on the fpk − index presented in Section 4.2, that can be used to calculate the current cumulative value of a paper based on the first k generations of citations as defined by the Gs definition.

5.2.1 Definition fak − index is defined as the sum of all fpk − index values of all papers co-authored by an author divided by the total number of papers (N) in the Publication Record of the author and is equal to

PN fpk − index(i) fak = i (5.2) N

k k k where fp − index(i) is the fp − index of the ith paper of the author. Since the fp − index of a paper represents the current value of a paper the fak −index represents the average fpk −index value of the author’s papers at the time when the evaluation occurs.

We might say that this indicator is independent of the scientific age of the author since the value of each paper is normalized based on its age. We believe that only the paper’s age should be used to distinguish between younger and older papers that share the same properties and that younger 5.2. FAK − INDEX 111 papers that have attracted a considerable number of citations quickly should be rewarded. In addition the proposed indicator is size-independent since the cumulative value of the fpk − index scores of the papers is divided by the number of papers included in the Publication Record of an author. By doing so, authors with different productivity levels could more easily be compared based on the scientific impact of their papers.

Summarizing, the fak − index is an indirect indicator that takes into account the first k generations of citations, the scientific age of each individual paper as well as the productivity of the author in order to produce the author’s score and it is independent of the scientific age of the author.

5.2.2 Example

In order to better demonstrate the application of fak − index let us re-examine the Paper-Citation graph of Figure 4.3 presented again in 5.2 but this time with the additional information about the list of co-authors of each paper.

2015 A1 2014 A4 2013 A2

P5 P4 P3

2010 A1,A2

P6 P2 P1

2015 A4 2012 A3

Figure 5.2: Paper-Citation graph with Year of Publication and Author information

Same as before the set P of papers included in the graph is equal to P = {P1,P2,P3,P4,P5,P6} with NP = 6. In addition, we can also define the set A of authors that have co-authored the papers in the graph, as A = {A1,A2,A3,A4} with NA = 4, since there are four distinct co-authors included in the graph.

In order to calculate the fak − index for these authors we are going to set k = 3, which means that we are going to consider the first three generations of citations calculated based on the Gs definition. The next step would be to calculate the fp3 − index values for all the papers in the graph, which we have already done as part of the Example presented in 4.2.3. To re-iterate the results for fp3 − index for all the papers included in the graph are shown in Table 5.2. 112 CHAPTER 5. PROPOSED AUTHOR INDICATORS

Papers Gs fp3 − index Name Year 1-gen 2-gen 3-gen 1 3 1 1+( 1 + 2 + 3 ) P1 2010 1 3 1 7 = 0.55 3 1 1+( 1 + 2 ) P2 2012 3 1 0 5 = 0.90 1 1 1+( 1 + 2 ) P3 2013 1 1 0 4 = 0.63 1 1+( 1 ) P4 2014 1 0 0 3 = 0.67 1 P5 2015 0 0 0 2 = 0.50 1 P6 2015 0 0 0 2 = 0.50 Total 6 5 1

Table 5.2: MSO table and fp3 − index values for the graph of Figure 5.2

From the information present in this table and the information that we can extract from the Paper-

Citation graph about the papers that each researcher has co-authored we can calculate the values for the fa3 − index. For the authors included in our graph the fa3 − index values are presented in Table 5.3.

Author Paper fp3 − index fa3 − index

P1 0.55 fp3(1)+fp3(5) 0.55+0.50 A1 2 = 2 = 0.53 P5 0.50

P1 0.55 fp3(1)+fp3(3) 0.55+0.63 A2 2 = 2 = 0.59 P3 0.63

A3 P2 0.90 0.90

P4 0.67 fp3(4)+fp3(6) 0.67+0.50 A4 2 = 2 = 0.56 P6 0.50

Table 5.3: fa3 − index values for the co-authors of the papers in the Paper-Citation graph of Figure 5.2.

Ranking the authors in descending order based on their calculated fa3 − index values we have author A3 at the top of the list, followed by authors A2, A4 and finally A1. A3 has authored a single paper, the one with the highest fp3 − index value.

It is interesting to compare the next two authors included in the ranking, which are A2 and A4. If we examine the papers that they have co-authored we will notice that A2 has authored two papers 3 with fp − index values of 0.55 and 0.63 respectively, whereas author A4 has also co-authored two papers one with fp3 − index equal to 0.67 and one paper with an fp3 − index of 0.50.

So, on average the two papers co-authored by A2 have had a higher impact than the papers co-authored by A4.

So, as already mentioned the fak − index is expressed as the average fpk − index value of the papers included in the Publication Record of any particular author. Therefore, an author with many 5.3. F ASK − INDEX 113 papers in their Publication Record that differ considerably in their impact will be averaged out, thus making it more difficult to identify authors with exceptional papers in their Publication Record. At the same time authors with less publications could be promoted since each paper would have a higher contribution to the produced average.

In order to accommodate for these scenarios, in Fragkiadaki and Evangelidis[2016] we have chosen to follow other similar approaches [Sidiropoulos and Manolopoulos, 2005] and only include the top

25 papers from the Publication Records of the authors. It still leaves the indicator sensitive to authors with less than 25 publications but it does mean that for authors with more publications than that, we can acknowledge their papers with the highest impact.

Another approach that one might choose to explore here would be to instead of taking a fix number of top papers from an author’s Publication Record, one would consider the top 5% of papers for each individual author. This approach would eliminate both the inequalities among authors with different number of papers in their Publication Record and also make the exact number of publications unique for each author.

5.3 fask − index

The fask − index is an indicator based on the fpk − index presented in Section 4.2 and the fak − index described in the previous section.

5.3.1 Definition

Expanding on the definition of fak − index, there is an additional aspect that we could consider for an indicator used to assess authors is the number of self-citations. As discussed in Section 2.6 we could consider the citations in the Paper-Citation graph at the (author, paper) level, which means that we will be able to identify author self-citations.

Thus, a new indicator was proposed named fask − index, which is calculated using the same formula as the fak − index with the only difference being the way the citation generations are produced for the calculations of the fpk − index values for the papers in the Publication Record of the author.

For the fak −index all citations based on the Gs definition are counted for, but for the fask −index the citation generations should be constructed in the way described in Section 2.6. This means that 114 CHAPTER 5. PROPOSED AUTHOR INDICATORS

with regards to author Ai, any citations originating from papers co-authored by Ai are not going to be counted for even if they do appear in one of the Gs generation of citations for one of the papers in the Publication Record of the author.

The fask − index is always smaller than or equal to the fak − index of an author. The two indices are equal only when the author has zero self-citations or the self-citations.

5.3.2 Example

Let us consider the graph of Figure 5.2 examined in the previous section. In order to calculate the fas3 − index for the authors included in the graph we will need to generate the MSO table for the (paper, author) pairs.

In Table 5.4 we can see the paper included in each author’s Publication Record, along with their

Year of Publication. In addition we can see the first three generations of citations for each (author, paper) pair that excludes any self-citations and based on those we have calculated the fp3 −index values for each of the pairs. Finally in the last column we display the fas3 − index for each author.

Gs fp3 − index fas3 − index Author Paper Year 1-gen 2-gen 3-gen 1+( 1 + 3 + 1 ) A P 2010 1 3 1 1 2 3 = 0.55 1 1 7 0.55+0.50 = 0.53 1 2 A1 P5 2015 0 0 0 2 = 0.50 1+( 1 + 2 + 1 ) A P 2010 1 2 1 1 2 3 = 0.48 2 1 7 0.48+0.63 = 0.56 1 1 2 1+( 1 + 2 ) A2 P3 2013 1 1 0 4 = 0.63 3 1 1+( 1 + 2 ) A3 P2 2012 3 1 0 5 = 0.90 0.90 1 1+( 1 ) A4 P4 2014 1 0 0 3 = 0.67 0.67+0.50 1 2 = 0.59 A4 P6 2015 0 0 0 2 = 0.50 Total 6 5 1

Table 5.4: fas3 − index values for the co-authors of the papers in the Paper-Citation graph of Figure 5.2.

If we now rank the authors in descending order based on their fas3 − index value we will see that the author placed on the top of the list is still A3 but now the author with the second highest impact is A4, followed by A2 and then A1. This switch is caused by the fact that one of the 2-gen citations 3 for author A2 was a self-citation and since these are excluded when calculating the fp − index values for the (paper, author) pairs the calculated value is now smaller. 5.4. COMPARISON 115

5.4 Comparison

In this section we are going to present a comparison between the three indicators presented in this chapter, namely fa − value, fak − index and fask − index. In order to do that we are going to use the Paper-Citation graph of Figure 5.3.

A1, A5 A3

2012 J3 2011 J2

P5 P2

A2, A3 A3, A4

2014 J3 2011 J1 2010

P7 P6 P3 P1 A1, A2

J1 2013 J1

A1, A2

P4

2011 J3

A2, A4

Figure 5.3: Paper-Citation graph with Author and Publication Year information

The set of papers P is equal to P = {P1,P2,P3,P4,P5,P6,P7} with NC = 7 and the set of authors A is equal to A = {A1,A2,A3,A4,A5} with NA = 5. The graph also includes the publication year of each paper from which we calculate its scientific age with regards to 2016.

In Table 5.5 (a) we present the papers included in our Paper-Citation graph along with the counts of the papers included in each of the first three generations of citations for the papers and the calculated f − values for each of the papers. The Reducing Factor (RF ) for the graph has been set to RF = 1.143. As a reminder, we note that the counts for the citation generations included in the table do not follow the Gs definition but the Hm where all citation paths are included without any restriction applied to the papers that qualify for inclusion in any of the generations. Table 5.5 (b) presents similar information for the fp3 − index values for the papers but this time the MSO tables are constructed using the Gs definition. The scientific age of the papers has been calculated relative to 2016.

Table 5.6 presents similar information but this time we are considering the generations of citations at the (author, paper) level and the calculated fp3 − index values are now pair specific. As already 116 CHAPTER 5. PROPOSED AUTHOR INDICATORS

Hm Gs age 1-gen 2-gen 3-gen f − value age 1-gen 2-gen 3-gen fp3 − index

P1 7 3 2 1 14.318 P1 7 3 2 1 0.762

P2 6 1 0 0 2.143 P2 6 1 0 0 0.333

P3 6 1 2 0 4.755 P3 6 1 2 0 0.500

P4 6 1 2 0 4.755 P4 6 1 2 0 0.500

P5 5 0 0 0 1.000 P5 5 0 0 0 0.200

P6 4 2 0 0 3.286 P6 4 2 0 0 0.750

P7 3 0 0 0 1.000 P7 3 0 0 0 0.333 (a) (b)

Table 5.5: (a) MSO and f − values and, (b) MSO and fp3 − index values for the papers of Figure 5.3. discussed these calculations are required for the fas3 − index formula where the citations are examined at the (author, paper) level, rather than at paper level alone.

Author Paper age 1-gen 2-gen 3-gen fp3 − index

P1 7 3 0 1 0.619

A1 P5 5 0 0 0 0.200

P6 4 1 0 0 0.500

P1 7 2 1 0 0.500

P4 6 0 1 0 0.250 A2 P6 4 1 0 0 0.500

P7 3 0 0 0 0.333

P2 6 1 0 0 0.333

A3 P3 6 1 1 0 0.417

P7 3 0 0 0 0.333

P3 6 1 2 0 0.500 A4 P4 6 1 2 0 0.500

A5 P5 5 0 0 0 0.200

Table 5.6: MSO and fp3 − index values for the (author, paper) pairs of Figure 5.3.

Finally, Table 5.7 presents the authors of the papers included in the graph, along with the list of papers co-authored by each researcher along with the age range of their published papers. In addition this table contains the calculated values for the three indicators compared in this section, namely fa − value, fak − index and fask − index with k = 3.

Comparing the calculated values for fa3 and the fas3 indices of the authors, we observe that the author scores become lower when removing self-citations. The calculated value for author A5 remains the same since he has already received the maximum value for the single paper that he co-authored 5 years ago and which has attracted no citations. In addition, the value of author A4 5.4. COMPARISON 117

Papers Age range fa − value fa3 − index fas3 − index

A1 P1 P5 P6 4 7 1.329 0.571 0.440

A2 P1 P4 P6 P7 3 7 1.668 0.586 0.396

A3 P2 P3 P7 3 6 0.837 0.389 0.361

A4 P3 P4 6 6 0.793 0.500 0.500

A5 P5 5 5 0.100 0.200 0.200

Table 5.7: Author-based indicator values for the authors of Figure 5.3

also remains constant since none of the citations received belongs to papers co-authored by A4. 3 3 The fas −index values for authors A1, A2 and A3 are lower than their corresponding fa −index values since all authors have received at least one self-citation.

We can make the results more easily comparable by replacing the actual values with the rank each authors receives if we rank them in descending order based on the calculated value of each indicator. In Table 5.8 we present the three indicators with a ranking of the authors.

Rank fa − value fa3 − index fas3 − index

1 A2 A2 A4 2 A1 A1 A1 3 A3 A4 A2 4 A4 A3 A3 5 A5 A5 A5

Table 5.8: Author rankings using the fa − value, fa3 − index and fas3 − index values

All three indicators place author A5 at the bottom of the list since he has co-authored a single paper,

P5, that has not received any direct or indirect citations. In addition all indicators place author A1 in the second position in the ranking. We should note thought, that this means does not mean that A1 has not had any self-citations. It just means that after excluding all self-citations from the generation 3 citation counts of all the authors, A1 still has the second highest fas − index value.

The authors that have positions that vary between the three indicators are A2, who is ranked first by fa − value and fa3 − index and third by the fas3 − index. Indeed if we look at the generation counts for the papers co-authored by A2 we will see that from the 6 1-gen, 4 2-gen and 1 3-gen citations of the papers he had 3 1-gen, 2 2-gen and 1 3-gen self citations (more than half) that were excluded from the calculations of his fas3 − index.

Finally, A4 occupies three different positions in the rankings produced by the three indicators. A4 is 3 3 ranked fourth by fa − value, third by fa − index and first by fas − index. A4 has received zero self-citations therefore maintains the whole fp3 − index values of the papers co-authored. On the 118 CHAPTER 5. PROPOSED AUTHOR INDICATORS

other hand the papers co-authored by A4 have had a single direct citation each from the same paper P6, which in turn has had 2 direct citations from papers P5 and P7 that have received no direct or indirect citations themselves, therefore their f − values are relatively low. Chapter 6

Bibliographic databases

The information provided by two bibliographic databases have been utilized in the present study.

The first database examined was CiteSeerx [cit, 1997, Giles et al., 1998] and the second one the

DBLP database [b]. In this chapter we are going to introduce the two bibliographic databases and provide an analysis of their data.

Part of the analysis is going to be the type of information provided by each database and the format of the data. We are going to present the process by which we parsed and extracted the data, how we stored them for further analysis and the subset used in each case for generating the

Paper-Citation graphs that we run our experiments on.

The rest of the chapter is organized as follows: In Section 6.1 we present an overview of the DBLP bibliographic database, along with a description of the data included in the database and the parser that we used to extract, store and transform the provided data. We also present a small analysis on the data found in the database with regards to publications and authors, their citations and the different generations of citations found in the database. In Section 6.2 we present an overview of the CiteSeerx database along with a description of the provided data and the parser we used to parse and store the provided data. In addition we present the cc-IF algorithm used to generate the Medal Standings Output table (MSO) for the Paper-Citation graph generated from the CiteSeerx database.

119 120 CHAPTER 6. BIBLIOGRAPHIC DATABASES

6.1 DBLP

DBLP is a Computer Science database that provides an online index of scientific publications.

6.1.1 Data

The different types of publications included in the DBLP dataset are presented in [DBLP] and mainly include articles (published in a journal or magazine), papers from conferences or workshops and

Proceeding volumes. Other publication types, like authored , parts or chapters in a , PhD and master theses, are also included but in smaller numbers.

The data is formatted in XML and is released under the ODC-BY 1.0 license. The XML formatted file can be downloaded from the DBLP website.

An example of an XML record found in the DBLP database is presented below.

Krzysztof Cetnarowicz Maciej Paszynski David Pardo Tibor Bosse Han La Poutré Agent−based computing, adaptive algorithms and bio computing. 1951−1952 2010 conf/iccS/2010 ICCS 1 http://dx.doi.org/10.1016/ j .procs.2010.04.218 db/journals/procedia/procedia1. html#CetnarowiczPPBP10

The particular record is of a paper published in the Proceedings of a Conference, therefore the root element of the record is named ‘‘inproceedings’’. Part of the main definition is the ‘‘mdate’’ and the

‘‘key’’ attributes. The ‘‘mdate’’ represents the last modification date of the particular record and the ‘‘key’’ is a string that uniquely identifies this record in the DBLP database.

The common metadata information included in this record that are also available for all different types of publications included in the database are the: 6.1. DBLP 121

ˆ field(s) with the co-authors of the publication

ˆ field that contains the publication title

ˆ field that contains the year the publication became available

There are other elements included in the record that we have not considered as required for the purposes of this study. An example of such a field is the the which represents the pages of the publication from within the journal, proceedings or chapter.

One more element that is included in the record that is quite important for the experiments run as part of this study is the . One record can contain zero or more such elements, each of which contains the ‘‘key’’ attribute of another publication included in the DBLP database.

The ‘‘crossref’’ fields basically represents citations provided by one publication to another publication included in the DBLP database and are used to link to the publications together.

It is worth noting that DBLP uses the WWW record type to provide details about a particular author, such as the list of synonyms of an author’s name. DBLP’s methodology of identifying and mapping authors to their respective publications is described in dbl 2009. For the purposes of our study we have not made any attempt to identify any author type synonyms or distinguish between authors with the same name. This means that metrics presented for some authors may be misleading since publications of two authors with the same name are attributed to a single author.

6.1.2 Parser

PHP (DOM extension) was used in order to parse the XML file and store the data in a relational DBMS

(MySQL) for easier retrieval and access. The schema of the MySQL database for the seven tables used to store the parsed data is described below. author (name) publication (id, type, title , year ) publication_author (publication , author) publication_references (publication , reference) mso_gs (publication , gen1, gen2, ... , gen30) mso_hm (publication , gen1, gen2, ... , gen30) mso_gs_author (publication , author, gen1, gen2, ... , gen30) 122 CHAPTER 6. BIBLIOGRAPHIC DATABASES

As mentioned before we have not taken any steps to reproduce the unique author mapping procedure followed by DBLP, so the author table contains a list of all unique author names identified when parsing the data. Authors with the same name are considered to be the same person.

The publication table contains the list of all publications with the id field being their Primary Key that maps the key attribute from the XML schema. The rest of the fields stored in the publication table are the type of publication (i.e. inproceedings), the title and the year of publication.

We have also used two more tables, the publication_author and the publication_references which associate a particular publication with its co-authors and a publication with the publications it references. The publication_author represents the N:N relation between authors and publications and the publication_references represents the N:N relation between publications.

Finally, the mso_gs, mso_hs and mso_gs_author tables hold the Medal Standings Output (MSO) for the first 30 generations of citations. The Gs definition was used to identify the papers included in each generation for the mso_gs and the mso_gs_author tables, whereas the Hm definition was used to produce the mso_hm table. The mso_gs and mso_hm tables contain the data for the MSO tables when the citations are examined at the publication level and the mso_gs_author table contains the corresponding information for when we examine the citations at the (publication, author) level.

Like in previous studies [Sidiropoulos and Manolopoulos, 2005, ?], we chose to only consider articles and papers in our study. We should also mention that during parsing, we considered records to be complete if apart from the DBLP Key (uniquely identifies a publication within the DBLP dataset), they also provided a Title, Year of Publication and a list of Authors.

No authors No references Publication type # records # records % total # records % total

Article 1308552 6565 0.50% 1306765 99.86%

In Proceedings 1641467 2419 0.15% 1640414 99.94%

Table 6.1: Imported DBLP records per publication type along with the percentage compared with the original set of publication records.

Table 6.1 presents the data imported from the XML file along with some statistics about the corresponding numbers of authors and references. With regards to the number of references, we observe that most publications do not provide references to other publications. This means that if we were to represent the dataset as a citation graph we would indeed have most of the publications appear as isolated nodes with no incoming or outgoing edges. Thus, we decided the citation graph 6.1. DBLP 123 to include all journal articles and conference papers that provide at least one reference to any other publication or receive at least one citation from any of the publications in the original dataset. This data was then extracted to a different database and Table 6.2 displays the summary statistics.

# Publications with Authors Publication type Count References Citations Count Distinct

Article 8087 1767 7406 17646 8304

In Proceedings 12786 6177 10912 30473 11487

Table 6.2: Records included in the Citation Graph along with the number of references provided and citations received. The table also presents the total number of co-authors and the distinct count of authors per publication type.

We observe that the number of publications that provide references to other publications included in the data-set is smaller than the number of publications that receive citations. This means that the publications that include references, reference more than one publication each (not necessarily of the same type).

For the remaining of this study, we will not distinguish between the two publication types, i.e., Article and InProceedings, and we will refer to all publications included in the Paper-Citation graph as papers.

6.1.3 Publication data analysis

In order to better understand the dataset used we have generated the MSO table for the extracted

Paper-Citation graph till the 30th generation. We have calculated the number of papers included in each generation and the total number of citations per generation. As already discussed, these values were stored in a separate Medal Standings Output (MSO) table in the relational DBMS and are presented in Figure 6.1.

The generations present in the citation graph are displayed on the x-axis of Figure 6.1. On the primary y-axis we plot the number of papers that have received at least one citation of the specified generation, and, on the secondary y-axis, we plot the total number of citations per generation.

We notice that the Publications series starts high with many papers receiving a gen-1 citation. The values gradually reduce to eventually reach 0 for generations 29 and 30, since no paper in our citation graph is part of a citation path of that length. With regards to the total number of citations for 124 CHAPTER 6. BIBLIOGRAPHIC DATABASES

Figure 6.1: Summary statistics of the publications included in the Paper-Citation graph and the citations received for each generation of citations identified. each generation, we notice that the number increases substantially from generation 1 to generation

5 and then it decreases down to 0 for generations 29 and 30.

We should note at this point that there are a couple of factors that would affect the plots presented in this graph. The scientific age of the Paper-Citation graph examined combined with the citation patterns found in the graph can affect both the number of generations found in the graph and the number of citations per generation.

For example, let us consider a Paper-Citation graph that contains a set of relatively young scientific papers whose age ranges from 1 to 5 years. Also, let us assume that the average time till a paper receives its first citation is 1 year. This combination of information for the graph makes it highly likely the number of generations of citations that we will find if we generate the MSO table for the graph will not exceed 5, at least not by much.

6.1.4 Author data analysis

In the Citation graph database, we also hold information about the list of co-authors for each paper from which we can identify the Publication Record for each author. 6.1. DBLP 125

We should mention, though, that the Publication Record for each author is far from complete since the DBLP database does not contain the complete list of papers for the examined authors. In addition, we do not distinguish between authors with the same name, so, it is possible that papers from two or more authors have been attributed to the same person.

Figure 6.2: Summary statistics of the authors and the citations received for each generation of citations identified.

Figure 6.2 presents some summary statistics about the authors that have (co-) authored the papers of the citation graph. The generations are displayed on the x-axis. On the primary y-axis we plot the number of authors with at least one publication that has received at least one citation of the specified generation and on the secondary y-axis we plot the total number of citations per generation received by all the papers the authors have co-authored.

An interesting thing to note in this plot, is that the numbers of citations appear to be higher than the ones presented in Figure 6.1. This is to be expected since a publication with several co-authors will have its citations accounted for more than once. 126 CHAPTER 6. BIBLIOGRAPHIC DATABASES

6.2 CiteSeerx

CiteSeerx [cit, 1997, Giles et al., 1998] is an online digital that mainly indexes the fields of computer science and information science. CiteSeerx also provides a number of related services, such as software, statistics, full-text searches etc.

6.2.1 Data

CiteSeerxprovides the bibliographic data using the Initiative (OAI) format , which is

XML based, and we should also mention that the data used for this part of the study are from 2005.

A sample record is shown below. For simplicity, only the identifiers that were actually extracted and used in our study are listed.

oai:CiteSeerPSU:number#
The Title oai:CiteSeerPSU:number# oai:CiteSeerPSU:number#

As we can see each publication is considered to be a record included in the CiteSeerx database, and each record has two main sections the

and the .

The

contains the tag which is a CiteSeerx generated identifier that uniquely identifies each record from within the database.

The section contains all the metadata information for any particular record. As we can see the title of the publication is specified in the tag, whereas each individual co-author is listed in a separate tag. Finally, part of the metadata 6.2. CITESEERX 127 for each record are the tags that define a relation between the currently examined record and another record included in the CiteSeerx database. The connection is established using the identifier of each record.

We should note at this point that during the parsing of the CiteSeerx database we have not made any attempt to normalize the author provided data. For example, we are not distinguish among authors with the same name and we are not identifying name variations of any particular author.

This means that we might have assigned papers that belong to two authors to one single author after parsing the data, or papers that belong to the same author might have been assigned to two distinct authors if the names appear differently under the metadata information of the two publications. Based on information from the CiteSeerx website the library does currently provide author author disambiguation but at the point when we first retrieved the data we were not aware of this functionality.

6.2.2 cc-IF algorithm

In order to generate the MSO and the additional information of the Cascading Citations Indexing

Framework (cc-IF), for the Paper-Citation graph included in our dataset the following algorithm was implemented [Fragkiadaki et al., 2009]. The presented algorithm considers the citations at the (paper, author) level and calculates the 1-gen, 2-gen, 3-gen, 1-gen-self, 2-gen-self, 3-gen-self,

2-gen-chord, 3-gen-chord, 2-gen-self-chord, and 3-gen-self-chord citations. In addition it calculates and stores in the database the corresponding citation path in the tables presented in the previous section, namely gen1, gen2 and gen3.

Three data structures are necessary for the execution of the algorithm, the Paper Direct Citations

(PDC), the Paper Authors (PA) and the list of the papers P that need to be processed. These structures are created utilizing the information present on the articles, art_has_authors and citations tables presented earlier.

Based on the mathematical notations we have been using throughout this study, the set of papers to be processed is defined as P = {P1,P2,...,PNP }, with C (Pi) denoting the set of papers that directly reference paper Pi. Therefore the Paper Direct Citations (PDC) data structure will be defined as PDC = {C (P1) ,C (P2) ,...,C (PNP )}. In addition let A = {A1,A2,...,ANA} denote the set of authors that participate in the Paper-Citation graph, with A(Pi) denoting the set of co-authors for paper Pi. Therefore the Paper Authors (PA) structure is defined as

PA = {A(P1),A(P2),...,A(PPN )}. 128 CHAPTER 6. BIBLIOGRAPHIC DATABASES

Αλγόριθµος 6.1 cc-IF algorithm

1 Input: 2 depth: the depth for which to run the algorithm 3 PDC: data structure with direct citations for each paper 4 PA: data structure with the authors for each paper 5 P: set of papers that need to be processed 6 Output: 7 (MSO table stored in the relational database) 8 9 Variables: 10 CurP: is the currently examined paper 11 CitP: is a paper directly citing the current paper (CurP) 12 A: one of the co-authors of the current paper (CurP) 13 throughInfo: collection of columns with the path info 14 15 cc-IF(depth, P, PDC, PA) 16 if depth = 1 then 17 foreach CurP in P do 18 foreach CitP in PDC[CurP] do 19 foreach A in PA[CurP] do 20 S = check_self(PA[CurP], PA[CitP]) 21 insert_gen{1}(autoid, CurP, A, CitP, [], S) 22 else 23 cc-IF(depth-1, P, PDC, PA) 24 prev_gen = data from table gen{depth-1} 25 foreach row in prev_gen 26 CurP = row[identifier] 27 A = row[authorid] 28 CitP = row[fromid] 29 foreach L in PDC[CitP] do 30 S = check_self(PA[L], A) 31 C = check_chord(CurP, L) 32 TI = make_row[row[throughInfo], S] 33 insert_gen{depth}(autoid, CurP, A, TI, S) 34 35 calculate_mso()

As we can see from the definition presented in Figure 6.1, the algorithm presented is recursive and the number of iterations is equal to the depth at which we want to examine the citations. The output of the algorithm is the full list of all citation paths up-to the desired depth, and the characterization of each path at the (paper, author) level based on the Cascading Citations Indexing Framework

(cc-IF).

The algorithm recursively executes until the value of depth is equal to 1. At this point (line 16) the if condition is met and the algorithm begins the calculations for 1-gen citations. For each paper CurP in the list of papers P, the algorithm iterates through all the citations that this paper has received

(line 18). This information is found in the PDC structure. Then, for each such citing paper, identified 6.2. CITESEERX 129 by CitP, and for each author A of the paper, the algorithm checks whether the specific author also exists in the list of authors of CitP (line 20), and if she does then the current citation path is marked a self-citation for this author. Finally, a new record is inserted in table gen1 (line 21).

As soon as all the 1-gen citations and paths have been identified and stored in the database, the algorithm returns to the incomplete recursive calls starting with the one where the value of depth equals two. For each recursive call the algorithm re-uses the information it calculated in the exact previous recursive call in order to calculated the new citations of higher rank. In other words, in order to calculate the 2-gen citations the algorithm retrieves all (paper, author) 1-gen pair citations and for each pair calculates the 2-gen citations. For each record present in the previous level citations table we retrieve information for the paper (CurP), the author (A) and the source article of the citation

(CitP). All direct references that the source paper has received are considered n-gen citations for the target paper CurP.We then check whether the citation is a chord and moreover if it is a self citation or not and finally, we store the with the information in the database.

After the calculation of all gen citations up-to the defined depth we summarize the results for each

(article, author) pair thus producing the MSO table (line 35). A function named calculate_mso() is used for this purpose, that counts the total number of citations for each distinct value that needs to be stored in the MSO table.

6.2.3 Parser

PHP (DOM extension) was used in order to parse the XML file and store the data in a relational DBMS

(MySQL) for easier retrieval and access. The schema of the MySQL database for the seven tables used to store the parsed data is described below Fragkiadaki et al.[2009]. author (authorid , name) publication (id, title , year ) publication_authors (publication , authorid) publication_references (publication , reference) gen1 (id, publication , authorid, fromid, self) gen2 (id, publication , authorid , fromid, throughid01, chord, self) gen3 (id, publication , authorid , fromid, throughid01, throughid02, chord, self) mso (publication , authorid , gen1, gen2, gen3, sgen1, sgen2, sgen3, cgen2, cgen3, scgen2, scgen3)

We created eight tables to hold the bibliographic information provided by CiteSeerx. The publications were store in the publications table that contains the identifier of the paper as defined by CiteSeerx, the title of the paper and the year of publication. The author table contains an id which is an 130 CHAPTER 6. BIBLIOGRAPHIC DATABASES automatically incremented field defined by the import process and the name of the author we have mapped this id against. The publication_authors table holds the N:N relationship between publications and authors, whereas the publication_references table maps the N:N relationship between the papers and the citations they have received from other papers included in the dataset.

The remaining four tables contain calculated information for the CiteSeerx database. The gen1 table lists all 1-gen citations for all (publication, author) pairs. The citation path can be reconstructed as fromid → identifier where fromid is the source paper and identifier is the target. The self field is used to maintain the information about whether this citation is a self citation for the specified author.

Similarly the gen2 table holds a list of (publication, author) citation paths found in the Paper-Citation graph, with the self field defining whether any one citation path is actually a self-citation for the particular author. In this case the citation path is defined as fromid → throughid01 → identifier, where throughid01 being the intermediate paper in this 2-gen citation. We should note that in this case one more extra field has been defined named chord that holds the information about whether this 2-gen citation is actually a 2-gen chord, meaning that it co-exists with a 1-gen citation from fromid → identifier. The gen3 table is defined in a similar manner.

Finally, the mso table stores the MSO table (for the Hm definition of citations) with summarized information about all types of citations received by each (paper, author) pair.

The CiteSeerx database consisted of 72 files, each holding 10.000 papers with their corresponding bibliographic details. Papers appearing in the list of references of a particular paper, are also part of the CiteSeerx database. The algorithm parses the files and stores all necessary information in the first four tables mentioned earlier.

Table 6.3 presents the total number of records generated after parsing the OAI XML files and populating the appropriate tables in the relational database. As we can see we extracted approximately 700 thousand publications, co-authored by approximately 400 thousand authors, that provide 1.75 million direct citations among the rest of the publications included in our dataset.

Some of these records hold what we considered to be incomplete bibliographic information. For the purposes of this study a record was considered ‘‘complete’’ if we could retrieve its identifier, title, author list and year of publication. These information are required so that we can run the calculations for the indicators examined in later sections.

So, from the full set of extracted data we excluded records that did not fit the description provided above, and Table 6.4 presents more information on the actual publication records that were 6.2. CITESEERX 131

Table # Records

publication 716772

author 411022

publication_authors 1663045

publication_references 1751492

Table 6.3: Counts of the CiteSeerx extracted data.

Year Title Without

= 0 = null = null = ’’ Authors

211956 68154 70 2 58723

Table 6.4: CiteSeerx extracted data that include the number of publications, authors, citations and references. considered incomplete. As we can see from the data presented on the table we identified a relatively high percentage of publications with incomplete publication year data (29.5% of the total), and a smaller but still significant number of records that did not provide (or we were unable to parse) author information (8.2% of the total). In order to ensure that we will be able to calculate all of the examined indicators for the full dataset we have excluded these records from the database.

We have also made an attempt to detect and remove any cycles present in the Paper-Citation graph created by the records considered complete. Our calculations show that there were a total of 6745 Level 1 cycles and just 14 Level 2 cycles detected in the dataset. Finally, we removed any records that we could identify as duplicates and any publications (along with their authors) that neither receive or provide any citations to any of the papers included in the dataset.

Table 6.5 presents some summary statistics for the entities that were considered as part of this study which formulate the Paper-Citation graph we used to calculate the indicators we are going to examine.

It is interesting to note at this point that based on the extracted data each paper has an average of 2.5 co-authors and has received an average of 4.8 direct citations from the rest of the papers included in the current dataset. If we examine the number of co-authors across all the papers in the

Paper-Citation graph we will see that the number of co-authors receives values in the range of 1 to

72, but only a handful of papers (10) actually has more than 30 co-authors. 132 CHAPTER 6. BIBLIOGRAPHIC DATABASES

Table # Records

publication 180744

author 169403

publication_authors 449954

publication_references 511559

Table 6.5: Counts of the CiteSeerx records included in the Paper-Citation graph. Chapter 7

Experimental results

In this chapter we are going to present some experimental results for the indicators proposed by this study. The results will be based on the two datasets we presented in Chapter6 and are going to include the proposed paper and author indicators. Apart from presenting the raw calculated values for the papers and authors we are also going to generate a relative ranking of the entities and examine the methodology used during this process.

In addition we have chosen a subset of the indicators that can be found in the literature that we are going to use in order to compare the rankings generated by each indicator. This will give us a better insight on the different aspects of the Paper-Citation graph that each indicator focuses on and on the strengths and weaknesses of the proposed indicators.

The rest of the chapter is structured as follows: In Section 7.1 we examine in detail the paper and author indicators that we were used as part of the experimental study of the proposed indicators and their other well known indicators found in the literature. In Section 7.2, we present the details of how we generate the rankings off the raw calculated values of each indicator. In Sections 7.3 and

7.4 we present the results for the implemented versions of the paper and author indicators included in the study when using the DBLP and the CiteSeerx generated datasets.

7.1 Comparison indicators

Following the analysis of the citation graph, we selected a list of indicators to be implemented and compiled against the citation databases. In this section we are going to present a description of

133 134 CHAPTER 7. EXPERIMENTAL RESULTS each of the indicators considered as part of the comparison of the proposed indicators with other well known indicators.

7.1.1 Paper indicators

A number of existing paper indicators were selected in order to compare the results generated from the proposed paper indicators. The selected indicators were the Number of Citations (NC), defined in Section 2.1, the Contemporary h-index score (hc − index), defined in Subsection 3.2.1, the SCEAS rank defined in Subsection 3.1.2, and, finally PageRank previously defined in Chapter3 and Subsection 4.3.2.

As previously mentioned, PageRank in its Base form uses a damping factor of 0.85 as defined by the original authors. In bibliographic networks a damping factor of 0.50 has also been used. In the calculations presented in the rest of the paper, we will be showing four different rankings for the

PageRank indicator, two for the Base version and two for the Normalized one (with damping factors of d = 0.50 and d = 0.85).

7.1.2 Author indicators

In order to compare the proposed author indicators we also selected a number of existing direct and indirect author based indicators. The selected indicators the the Number of Citations (NC), defined in Section 2.1, Mean number of citations (MNC), defined in Subsection 3.2.1, the h-index, defined in Subsection 3.2.1, the g-index, defined in Subsection 3.2.1, Contemporary h-index (hc − index), defined in Subsection 3.2.1, the SCEAS Rank, defined in 3.1.2, and, finally PageRank previously defined in Chapter3 and Subsection 4.3.2.

The author ranking generated from the SCEAS score was defined in the original paper [Sidiropoulos and Manolopoulos, 2005], as the average SCEAS score of an author’s papers. It is worth noting though that the average is not calculated across the full Publication Record for an author but using the top 25 publications from the author’s publication record. When an author has less than 25 papers in the Paper-Citation graph, we consider all of them in the calculations of the SCEAS rank.

As with SCEAS rank, we calculated the PageRank of an author based on the average PageRank of a set of publications from the author’s publication record. The rankings produced for PageRank use either the Base or Normalized version of PageRank, with a damping factor of either 0.50 or 0.85, and the final ranking is based either on the full publication record of an author or his/her top 25 papers. 7.2. RANKING METHODOLOGY 135

7.2 Ranking methodology

As we are going to see in the following sections the indicators examined produce different types of results with regards to the actual calculated values. This is to be expected among different bibliographic indicators, since each indicator is defined independently and considers different aspects of the Paper-Citation graph. As a result trying to compare the raw values produced for a particular paper does not necessarily allow us to understand the relative significance attributed to that paper by each of the indicators.

Ranking the papers based on the raw values produces by each indicator would be a straightforward process if the indicators produced a distinct value for each paper included in the Paper-Citation graph, since then a simple ordering would be equal to the ordering of the papers. In most cases examined though we have found that an indicator produces the same value for papers with identical characteristics. For example, we would expect two papers published in the same year, written by a single author, having received 0 direct citations would indeed receive the same score by any one indicator. So, the question we had to answer was what is the ranking position of papers that have the same calculated value for a particular indicator?

We have chosen to produce ordinal rankings for the indicators included in the experiments performed in this study. So, for all papers with the same value, we sum the ranks they would have been assigned if their values were distinct and divide by the number of papers with the identical score. All papers examined are then assigned the same score.

We used the term ‘‘relative significance’’ earlier, since each indicator can only evaluate the relative significance of the papers enclosed in the Paper-Citation graph that we are currently examining.

The way to actually compare the significance of two papers that exist in different Paper-Citation graphs, like the DBLP and the CiteSeerx would actually be to use the raw values produced by the indicators and calculate the ranking that our paper would have had, had it been included in the corresponding graph.

7.3 DBLP experimental results

7.3.1 Paper indicators

For each indicator discussed we have calculated the raw value for the indicator as well as the ordinal ranking of all papers included in the citation graph. Table 7.1 shows the number of distinct values 136 CHAPTER 7. EXPERIMENTAL RESULTS produced by each indicator for the 20873 papers included in the Paper-Citation graph generated from the DBLP dataset.

We observe that the indicators that only consider the direct impact of a publication in their calculations have lower granularity, producing less distinct values than the indicators that consider both the direct and indirect impact of a publication. More specifically the Number of Citations (NC) produces just 144 distinct values and the Contemporary h-index score (hc score) 929.

The PageRank variations provide more granularity with distinct values ranging from 9150 (for the

Normalized version with d = 0.50 - PageRank N50), to 11365 (for the base version with d = 0.50 -

PageRank B50). The convergence criterion was set to 0.000001 for all four versions of PageRank and for the Base version the algorithm required 15 iterations for d = 0.50 and 19 iterations for d = 0.85. For the Normalized versions, 9 and 10 iterations where performed for the damping factors d = 0.50 and d = 0.85, respectively.

For the remaining indirect indicators, we can see that SCEAS1 and SCEAS2 produce 11687 and 10293 distinct values respectively, whereas the f − value indicator produces 9535 distinct values. Finally, fp3 − index produces 6776 distinct values when k = 3.

# Distinct values

Number of Citations (NC) 144 Direct impact Contemporary h-index score (hcscore) 929

Base, d = 0.50 (B50) 11365

PageRank Base, d = 0.85 (B85) 11251

Normalized, d = 0.50 (N50) 9150 Indirect impact Normalized, d = 0.85 (N85) 11344

SCEAS 1 11687 SCEAS SCEAS 2 10293

f − value 9535 Proposed indicators fp3 − index 6776

Table 7.1: Number of distinct values generated by the paper indicators for the DBLP dataset

In Table 7.2, we present the top 10 papers based on the ranking produced by the f − value indicator, along with the rankings these papers hold in the ranks of all the paper indicators described in the previous section. Each paper is usually referred to by the last part of its DBLP key (i.e. Chen76) or if that does not provide sufficient information to uniquely identify the paper within the citation 7.3. DBLP EXPERIMENTAL RESULTS 137 graph, we have also included the second part of the key (i.e. tods/SmithS77). In the same table, we also present the citation counts for the first three generations, calculated using the Hm definition, which is used for the f − value calculations and the Gs definition which is used for the fp3 − index.

SCEAS PR Citation counts Paper Year f fp3 NC hc B&N Hm& Gs 1 2 50 85 g1 g2 g3 Codd70 1970 1 1 2 2 1 1 1 1 580 3150 2580 Astrahan BCEGGKLMM 1976 2 2 9 13 5 7 4 4 239 2653 2991 PTWW76 StonebrakerWKH76 1976 3 3 11 15 8 8 7 6 228 2490 2924 persons/Codd71a 1971 4 23 30 77.5 10 10 9 3 130 1291 2989 Codd72 1972 5 10 17.5 40 11 11 11 9 170 1620 3662 Chen76 1976 6 4 1 1 2 2 2 2 604 1583 2471 tods/SmithS77 1977 7 7 5 6 6 5 6 8 313 1672 2690 SelingerACLP79 1979 8 5 3 3 4 4 5 7 370 1671 2541 ChamberlinB74 1974 9 13 134.5 274.5 16 18 15 13 66 1376 3549 ifip/Codd74 1974 10 27 177 357 35 42 25 16 57 1295 2700 Best rank 1 1 1 1 1 1 1 1 Worst rank 10 27 177 357 35 42 25 16 Median 5.50 6.00 10.00 14.00 7.00 7.50 6.50 6.50 Standard Deviation 2.87 8.54 59.71 121.90 9.42 11.41 6.79 4.57

Table 7.2: Top 10 papers based on the f − value indicator. The table includes the positions these papers have received in the rankings of all other indicators described in section 7.3.1 along with the citation counts for their first three generations according to the Hm and Gs definitions. In this table the PageRank Base and Normalized produce the same rankings.

By examining the results, we can see that the top 10 papers according to the f − value indicator hold high rank positions among the indirect indicators but not among the direct ones. In particular the ifip/Codd74 paper that holds the 10th position according to f − value holds positions 16 to 42 amongst the indirect indicators, whereas it sits at the 177th position according to the Number of

Citations and at the 357th position according to the Contemporary h-index score.

It is also interesting to note that the two versions of PageRank (Base and Normalized) produce the same ranking when the value of the damping factor does not change. This means that both Base and Normalized versions of PageRank for a damping factor equal to 0.5 produce the same ranking for the papers, and the same applies when we apply a damping factor equal to 0.85. Also from the data presented in the table we would say that for the top 10 papers the f − value indicator is closer in the positions assigned to the papers to PageRank (d=0.85), since the top 10 papers occupy positions between 1 to 16 in the PageRank (d=0.85) ranking. 138 CHAPTER 7. EXPERIMENTAL RESULTS

Finally one more interesting thing to note about the citation data for these papers is that the number of 1-gen, 2-gen and 3-gen citations identified for each of them are the same regardless of which definition of generation of citation counts we apply to the dataset.

In Table 7.3, we present the top 10 papers based on the ranking produced by the fp3 − index indicator. The table includes the same columns as the Table with the Top 10 papers according to the f − value indicator (Table 7.2).

SCEAS PR Citation counts Paper Year f fp3 NC hc B&N Hm& Gs 1 2 50 85 g1 g2 g3 Codd70 1970 1 1 2 2 1 1 1 1 580 3150 2580 Astrahan BCEGGKLM 1976 2 2 9 13 5 7 4 4 239 2653 2991 MPTWW76 Stonebraker 1976 3 3 11 15 8 8 7 6 228 2490 2924 WKH76 Chen76 1976 6 4 1 1 2 2 2 2 604 1583 2471 SelingerACLP79 1979 8 5 3 3 4 4 5 7 370 1671 2541 Stonebraker75 1975 11 6 25.5 49 17 17 17 15 140 1815 3394 tods/SmithS77 1977 7 7 5 6 6 5 6 8 313 1672 2690 tods/Codd79 1979 15 8 7 8 9 9 10 12 280 1623 2491 EswarranGLT76 1976 12 9 4 5 3 3 3 5 326 1180 3304 Cod72 1972 5 10 17.5 40 11 11 11 9 170 1620 3662 Best rank 1 1 1 1 1 1 1 1 Worst rank 15 10 25.5 49 17 17 17 15 Median 6.50 5.50 6.00 7.00 5.50 6.00 5.50 6.50 Standard Deviation 4.34 2.87 7.35 15.87 4.59 4.58 4.63 4.11

Table 7.3: Top 10 papers based on the fp3 − index indicator. The table includes the positions these papers have received in the rankings of all other indicators described in section 7.3.1 along with the citation counts for their first three generations according to the Hm and Gs definitions. In this table the PageRank Base and Normalized produce the same rankings.

The top 10 papers according to fp3 − index populate high positions on all indicator rankings. In particular, there seems to be an agreement across all indicators that Codd70 is the most influential publication and it populates either the 1st or 2nd position on all rankings. All the indirect indicators seem to agree that it should be the top paper, whereas the direct impact indicators (NC and hc − index score) seem to place the publication at the second position, since it has received less direct citations than the Chen76 publication (580 versus 604).

In general, the paper from the top 10 listing that populates the lower position in the other ranks is Stonebraker75 that holds the 6th position in fp3 − index but populates positions 15 to 49 on the 7.3. DBLP EXPERIMENTAL RESULTS 139 other ranks (still very high positions in the overall ranking but not part of the top 10 publications). The lowest positions are assigned by the Number of Citations (NC) and the Contemporary h-index score (25.5 and 49 respectively), which is to be expected since there are papers with more direct citations included in the graph. This again highlights the effect that indirect citation counting can have on the rankings produced by the indicators.

With regards to the four versions of PageRank and the two different damping factors, it seems that the damping factor has had a stronger influence for these top 10 publications than whether we considered the total number of publications or the dangling nodes in the graph, since if we look at the ranking positions they follow the same pattern for the same values of the damping factor. In some cases the four rankings are in agreement (Codd70, Chen76 and Astrahan-

BCEGGKLMMPTWW76), whereas in others the base version ranks the papers higher

(SelingerACLP79) or lower (tods/SmithS77).

PageRank f fp3 NC hc SCEAS Base (B) Normalized (N)

1 2 50 85 50 85

f 1.0000 0.9715 0.9045 0.8158 0.8085 0.8031 0.8208 0.8502 0.8125 0.8481

fp3 0.9715 1.0000 0.8468 0.7924 0.7494 0.7433 0.7633 0.7971 0.7564 0.7948

NC 0.9045 0.8468 1.0000 0.9605 0.8925 0.8922 0.8927 0.8909 0.8924 0.8918

hc 0.8158 0.7924 0.9605 1.0000 0.8474 0.8484 0.8445 0.8909 0.8469 0.8363q

SCEAS1 0.8085 0.7494 0.8925 0.8474 1.0000 0.9999 0.9994 0.9930 0.9907 0.9934

SCEAS2 0.8031 0.7433 0.8922 0.8484 0.9999 1.0000 0.9987 0.9911 0.9905 0.9916

PRB50 0.8208 0.7633 0.8927 0.8445 0.9994 0.9987 1.0000 0.9964 0.9903 0.9965

PRB85 0.8502 0.7971 0.8909 0.8343 0.9930 0.9911 0.9964 1.0000 0.9844 0.9994

PRN50 0.8125 0.7564 0.8924 0.8469 0.9907 0.9905 0.9903 0.9844 1.0000 0.9852

PRN85 0.8481 0.7948 0.8918 0.8363 0.9934 0.9916 0.9965 0.9994 0.9852 1.0000

Top Cor. fp3 f hc NC SC2 SC1 SC1 B85 SC1 B85

Low Cor. SC2 SC2 fp3 fp3 fp3 fp3 fp3 fp3 fp3 fp3

Table 7.4: Spearman rank correlation matrix for the paper indicators applied to the DBLP dataset

In Table 7.4, we present the Spearman rank correlation matrix for all the combinations of paper indicator ranks. For each indicator, the bottom two rows of the table report the indicators that have the highest and lowest correlation with the indicator under scrutiny. The diagonal indicates has a value that is always set to 1.0 as it compares an indicator with itself. 140 CHAPTER 7. EXPERIMENTAL RESULTS f − value and fp3 − index have the strongest correlation between them (0.9715), and the lowest correlation with SCEAS2, 0.8031 and 0.7433 respectively. All other indicators appear to be less correlated with fp3 − index whereas the strongest correlation appears to be shared between the SCEAS1 and SCEAS2 scores with both of them reporting values of 0.9999. It is also worth noting that both the Base and the Normalized version of PageRank with a damping factor of 0.50 appear to have the strongest correlation with SCEAS1, in contrast to the Base and Normalized versions of

PageRank with a damping factor of 0.85 that report a high correlation amongst themselves.

7.3.2 Author indicators

Table 7.5 shows the number of distinct values produced by each indicator for the 15862 (co-) authors of the papers. For some of the indirect indicators we are displaying two counts, one that is marked as All and it includes all of the publications of a particular author in the calculations and one that is marked as Top which only includes the top 25 publications of any particular author. This method has been previously used in [Sidiropoulos and Manolopoulos, 2005] and we have also applied it to all author indicators that divide by the number of publications considered.

We observe that, in general, the direct indicators have low granularity, with h − index generating only 24 distinct values and the Mean number of Citations (MNC), the most granular in this category, 939 distinct values. The indirect indicators, in general, produce many more distinct values ranging from 7239 for SCEAS2 to 9085 for fa − value. All other indirect indicators produce distinct values that fall in between the previous two counts.

In Table 7.6 (a), we present the top 10 authors based on the fa − value and in Table 7.6 (b) we present the top 10 authors based on the fa3 − index. For each author we present some summary information about the year of their first and last known publication, along with the total count of their publications that are included in the dataset.

As we can see from the results the top 10 authors according to the fa − value indicator, have had a long career in publishing papers, ranging from a decade of publishing papers for John Miles Smith

(with a total of 13 publications), to 30 years of publications for Jeffrey D. Ullman (with a total of 110 papers). With regards to fa3 − index, we can see that the top 10 authors do not have a large number of publications included in the dataset, ranging from 1 to 4, but there papers seem to have been considered as highly influential in their respective areas.

In Table 7.7 we present the rankings of the top 10 authors according to the fa − value along 7.3. DBLP EXPERIMENTAL RESULTS 141

# Distinct values

Number of Citations (NC) 368

Mean number of Citations (MNC) 939 Direct impact h − index 24

g − index 39

hc − index 466

Base, d = 0.50, All (B50A) 8125

Base, d = 0.50, Top (B50T) 8127

Base, d = 0.85, All (B85A) 8003 PageRank Base, d = 0.85, Top (B85T) 8007

Normalized, d = 0.50, All (N50A) 7271

Indirect impact Normalized, d = 0.50, Top (N50T) 7268

Normalized, d = 0.85, All (N85A) 8218

Normalized, d = 0.85, Top (N85T) 8220

SCEAS1 8413 SCEAS SCEAS2 7239

fa − value 9085

fa3 − index All 7532 fa3 fa3 − index Top 7531

fas3 − index All 7515 fas3 fas3 − index Top 7515

Table 7.5: Number of distinct values generated by the author indicators for the DBLP dataset

with their corresponding ranks for the full list of direct and indirect author indicators included in the experimental study. The top 10 authors occupy relatively high positions in all other indicators with positions ranging from 1 to 307 among the direct impact indicators and 6 to 700 among the indirect impact indicators.

It is interesting to note that in this table the rankings produced by fa3 − index differ when we consider the full publication record of the authors or just their top 25 publications. In some cases, like for example for Jeffrey D.Ullman with a publication record of 110 papers, the ranking position increases significantly making him move from the 700th position to the 86th. We would say that in most cases considering the top subset of papers increasing the ranking of an author but as we can see from the results included in the list that is not universally true. For example for the top ranked 142 CHAPTER 7. EXPERIMENTAL RESULTS

Pub. Year Pub. Pub. Year Pub.

Author First Last count Author First Last count

E. F. Codd 1969 1989 25 Vera Watson 1976 1976 1

Michael Stonebraker 1972 1999 131 Daniel Frank 1986 1986 1

Philip A. Bernstein 1975 1999 69 C. G. Hoch 1987 1987 1

Ronald Fagin 1976 1998 41 E. C. Chow 1987 1987 1

Donald D. Chamberlin 1974 2000 25 H. P. Cate 1987 1987 1

Jim Gray 1975 2000 46 J. W. Davis 1987 1987 1

Jeffrey D. Ullman 1971 2001 110 T. A. Ryan 1987 1987 1

David J. DeWitt 1978 2000 95 Christopher L. Reeve 1980 1981 2

Eugene Wong 1971 1991 27 Paul R. McJones 1976 1981 3

John Miles Smith 1975 1985 13 Patricia P. Griffiths 1976 1976 4

(a) fa − value ranking (b) fa3 − index ranking

Table 7.6: Top 10 authors according to the fa − value (a) and fa3 − index (b), along with the year of first and last publication included in the dataset and the total number of publications author according to the fa − value, which is E. F. Codd, his position actually decreases when we only consider his top 25 papers. This could be the result of the publication that were not included being scored high due to their impact in their respective areas.

In addition, the rankings of the authors do change slightly when we do examine the self-citations for the authors as part of the fas3 − index indicator. The rankings when all publications are considered does change for some of the authors which would imply that some of the citations that they have received have actually been self-citations , and a similar change can be observed when we consider only their top 25 publications. We should mention though at this point that if the ranking of an author does not seem to change when we compare their fa3 − index and fas3 − index values, does not necessarily mean that they have not received any self-citations but that even if they have received some self-citations their relative ranking compared with the author that were ranked higher has not changed. If we wish to be absolutely certain about whether an author has or has not received any self-citations we should compare not their rankings but the actual values calculated by the two indicators.

In Table 7.8 we a similar ranking but this time for the top 10 authors according to the fa3 − index. In this table we are only displaying a single column for the fa3 − index All and Top since both types of ranking place the authors in the same positions, since the publication record of the authors includes 7.3. DBLP EXPERIMENTAL RESULTS 143 PageRank Base Normalized SCEAS 50A 50T 85A 85T 50A 50T 85A 85T 1 2 c along with the direct and indirect impact indicator rankings for the DBLP dataset. value − fa NC MNC h g h 3 fas 3 fa All Top All Top fav Author E. F. CoddMichael Stonebraker 2Philip A. Bernstein 1Ronald 327 Fagin 3 83Donald D. Chamberlin 19 328 5 89 4Jim Gray 336 49 83 106Jeffrey 450 D. Ullman 20 329David 89 113 J. 180 DeWitt 1 6 7 50Eugene 113 Wong 7 450 282 700 282 8 5 120John 180 Miles Smith 27 9 85 438 86 1 21 22Best 232 rank 70.5 10 284 712 194 47 1Worst 91 218 rank 4 32 144 85 88 169 458Median 2 70.5 14 23 1 5 155 4 196 3 57SD 32 16.5 284 10 9.5 6 83 144 169 17.5 2 51 98 305 700 31 154 211 5.5 29 6 228 19 42 2.5 247 207 14 180 304.5 61 55 125 147 83 6 2.87 44 29 3 712 87.5 2.5 4 205 72 218 179.84 55.5 20 306.5 180 6 281 31 19.5 2 4 56 51.83 108 24.5 307 88.5 1 61 183.22 30 60 33 6 339 213 1 231 34 194 50.70 6 305 252 117.5 27 19 17.86 42 125 45 55 119 262 6 177 89.58 43 28 307 218 62 332 205 112 44 47 1 6 88.53 14 194 31 46 109 54 56 90 24 31 290 55.45 1 117.5 10.75 6 125 30 33.33 33 343 51 81 18.5 60 61 339 33 1 113.16 125 12 35 45 165 50 35.70 122 125 258 49 19 6 13 113.88 45.5 332 113 332 43 47 51 62 28.37 147.5 6 108 113.29 54 90 46 288 23 44.5 35.93 343 42 6 82 61 33 167.5 114.07 19 125 28.85 41 45.5 127 41 35 6 18 37.77 332 147.5 134 39.26 40 48 6 44.5 109 48 45 125 6 44.5 134 6 6 12 13 Table 7.7: Ranking of the Top 10 authors according to the 144 CHAPTER 7. EXPERIMENTAL RESULTS less than 25 publications in the current dataset. This means that we should not see any difference in the rankings produces in either case.

With regards to the fa − value, it places the authors further down the rankings occupying positions from 149 to 2114, whereas the Mean number of citations (MNC) places the authors in high positions that range from 1 to 24.5. The h − index, g − index and hc − index indicators place the authors further down the ranking list with the worst ranks being close to the bottom of the list (14733 out of 15862 authors for hc − index). These differences are to be expected since most of these authors have just one publication and based on these indicators definitions their corresponding values and, therefore, rankings can not be high.

We observe that when looking at the rankings produced by the indirect impact indicators, the rankings of the authors have improved considerably, now ranging from positions 1 to 84. In particular, there are two authors that the indicators place in lower ranks, Daniel Frank (rankings range from 2 for fa3 − index to 84 in SCEAS2) and Christopher L. Reeve (rankings range from 8 in fa3 − index to 36 for the Base and Normalized versions of PageRank with a damping factor of 0.85 and whilst using the top 25 publications per author in order to produce the ranking).

The indicators appear to be in agreement for the remaining 8 authors that are placed in positions 1 to 19, whereas, all indirect impact indicators seem to agree that the most influential author in the citation graph is Vera Watson, even though she has co-authored only one paper titled ‘‘System R:

Relational Approach to Database Management’’ and published in 1976. The particular paper has been co-authored by Vera Watson and 13 other authors all of which have more than one papers included in the Paper-Citation graph (publication record counts range from 3 to 46). It is very interesting to note that all indicators place these authors further down the ranking list with maximum three authors appearing at the different top 10 rankings across all examined indicators. This leads us to assume that all the indicators examined are indeed sensitive to the number of publications included in the publication record of an author. It is also worth noting that the fa3 and fas3 rankings of the authors are identical. This is to be expected for all the authors with only one publication, since they cannot receive a self-citation.

In order to present some comparative results with the ones found in the literature when the DBLP data-set is being used, we present the SIGMOD Edgar F. Codd Innovations Award winners (1992 -

2004) rankings in Table 7.9. Almost all of these authors have more than 25 papers included in their publication record, thus, we can see from the table that the rankings occupied by these authors when we examine their full publication record versus their top 25 papers are different. 7.3. DBLP EXPERIMENTAL RESULTS 145 PageRank Base Normalized SCEAS 50A 50T 85A 85T 50A 50T 85A 85T 1 2 c along with the direct and indirect impact indicator rankings index − 3 fa NC MNC h g h 3 fav fa Table 7.8: Top 10 authors according to the Author Vera WatsonDaniel FrankC. 383 G. HochE. C. 1165 Chow 1H. P. Cate 187.5 2 2114J. W. Davis 740.5 2114 1 5T. A. 24.5 Ryan 2114 5 383 8762Christopher L. Reeve 8762 2114 383 9133.5 5 430 5Paul 9133.5 R. McJones 14733 2114 5 14733 383 5Patricia P. Griffiths 8 1 8762 309 69 383 5Best 9133.5 5 rank 325 149 8762 1 73 14733 383 9 5Worst 9133.5 rank 74 8762 18 1 10 11 14733 143Median 5 9133.5 78 8762 149 11 11 1 2401 149 14733Standard 11 9133.5 Deviation 70 19 2114 8762 11 15.5 3157 1 11 14733 850.6 1 74 9133.5 2401 19 19 14733 10 1148 11 11 1639.5 1 2.9 14733 143 73 1859.5 29 740.5 11 19 1260.5 5.5 11 19 163.7 11 13152 1 77 315.5 32 24.5 11 11 1 383 19 19 7.1 11 3 81 34 1 5 19 11 8762 19 11 19 3123.2 1148 5 3 84 36 9133.5 19 19 5 1 11 11 3253 1859.5 19 14733 31 3 8762 19 9 4298.5 315.5 11 19 5 1 11 69 9133.5 34 19 3 9 1 19 19 9 11 5 73 14733 20.3 34 19 3 9 9 1 11 19 74 20 5 36 9 11 21.1 19 78 3 1 9 29 5 19.4 19 70 9 9 3 1 20.8 27 5 19 74 19.7 9 3 1 20.9 11 5 73 22.6 4 1 11 77 23.4 5 19 4 1 81 5 19 84 1 9 1 9 1 146 CHAPTER 7. EXPERIMENTAL RESULTS

As a whole, the authors included in Table 7.9 rank higher in among the Direct Impact indicators with positions ranging from 1 to 165.5 with only the Mean number of citations (MNC) placing the authors in positions ranging from 91 to 1062. This is to be expected though since MNC divides with the total number of publications and as we have already discussed this approach is sensitive to the number of publications in the scientist’s publication record. In addition, for the MNC we see only the ranking that is the equivalent of the All ranking presented for other indicators.

Regarding the indirect impact indicators, it is interesting to note that the authors occupy higher ranks in the fa − value indicator, which places these authors in positions ranging from 2 for Michael Stonebaker to 115 for Patricia G. Selinger. In fact these positions are the second highest amongst all indicators examined.

For the remaining Indirect Impact Indicators we could say that the authors rank higher when we only consider the top 25 publications of all authors. In particular the authors hold higher positions in the SCEAS 1 & 2 ranks, followed by the rankings produced by PageRank (d = 0.85, base and normalized for the top 25 publications), followed by the fa3 − index (again when using the top 25 publications). The indirect indicators that use the full publication record for these authors place them in lower positions in their ranks.

Finally, the Spearman rank correlation matrix of the author indicators is shown in Table 7.10, where, we can see that there is a positive correlation among all indicators. The direct impact indicators present their highest correlation with other direct impact indicators and the lowest correlations are split between the fa − value and fas3 − index indicators.

From the Indirect Impact Indicators, fa − value reports its lowest correlation with SCEAS 2 and its highest with the Number of Citations. All other indirect impact indicators are highly correlated with their variation (A vs T), and report that their lowest correlation is with the fa − value indicator.

7.4 CiteSeerx experimental results

7.4.1 Paper indicators

For each of the indicators presented at the beginning of this chapter we have calculated the raw values for all the papers in the CiteSeerx dataset. In addition, we have generated the ordinal rankings for the papers. In Table 7.11 we present the number of distinct values generated by each indicator for the 180744 papers included in the dataset. 7.4. CITESEERX EXPERIMENTAL RESULTS 147 PageRank Base Norm SCEAS 85T 85A 85T 85A 1 2 c NC MNC h g h 3 fas 3 fa All Top All Top fav # Median 15 450 170 458 180 12 282 14 12 17.5 116 261 118 268 108 107 Best rank 2 106 19 113 20 1 91 1 1 1 24 31 23 31 19 18 Worst rank 115 2349 525 2424 561 90 1062 165.5 122 117.5 327 1291 325 1287 293 298 Publications Year Standard Deviation 31 587.7 171 608.3 177.9 24.3 259.1 56.5 33.8 37.9 87.9 361.8 87.7 361.3 83.1 84 First Last Table 7.9: SIGMOD Edgar F. Codd Innovations Award winners (1992 - 2004) rankings for the DBLP dataset Author C. MohanDavid J. DeWittDavid Maier 1982 1978 1999Donald D. Chamberlin 2000 58 1974Hector 1978 95 Garcia-Molina 2000 60 2000 1978 8Jim Gray 25 76 2000 891Michael 438 Stonebraker 155 5 25 265 38 47 1972Patricia G. 106 926 Selinger 741 1975 1999 2349 458 113Philip 141 2000 A. 279 Bernstein 131 1979 525 46 113 57 745 1998Rakesh Agrawal 33 2 2424 1975 17 6 120 143 2Ronald Fagin 1999 327 561 564 115 1983 21 282 69 12Rudolf 19 Bayer 10 2000 207 21.5 223 85 3 85 1976 336 392 91Serge 26.5 Abiteboul 240 1062 2.5 1998 284 46 328 29.5 223 1970 20 21.5 70.5 41 14 2 1983 327 1999 85 1399 49 8.5 240 32 1 1999 16.5 4 30 486 829 1 329 4 6 97 53 51 17.5 1442 31 450 282 325 61 50 15 188 116 33 511 98 106 180 621 825 290 1 1291 972 515 5 450 31 493 165.5 13 270 14 187 61 170 118 1 122 622 271 180 232 33 487 1287 1012 4 288 520 112 493 2 22 140 31 189 4 5.5 19.5 101 41 129 137 90 29 218 49 106 24 12 8 40 120 5 247 550 51 14 129 62 7 9.5 479 28 165.5 16.5 120 23 130 56 75.5 9 252 17.5 145 591 62 117.5 205 108 146 31 8.5 222 130 56 19 218 30 12 261 596 18 205 109 220 225 108 51 218 878 107 268 50 125 218 293 125 884 298 162 157 148 CHAPTER 7. EXPERIMENTAL RESULTS al .0 pamnrn orlto arxfrteato ae niaos nAapne otenm fa niao denotes indicator an of name the to appended A An indicators. based author the 25 for Top matrix correlation rank Spearman 7.10: Table . o o.SCEAS2 Cor. Low Cor. Top 2 SCEAS 1 SCEAS N85T PR N85A PR B85T PR B85A PR N50T PR N50A PR B50T PR B50A PR h g h MNC NC fas fas fa fa fav c 3 3 3 3 T A T A 1.000 .3 .8 .8 .7 .7 .6 .4 .2 .3 .6 .9 .9 .9 .9 .8 .8 .9 .9 1.000 0.990 0.990 0.993 0.989 0.992 0.989 0.989 0.990 1.000 0.989 0.990 0.984 0.992 0.998 0.984 0.992 0.984 0.998 0.999 0.999 0.984 0.560 0.999 0.999 0.999 0.996 0.635 1.000 0.560 0.999 0.996 0.983 0.624 0.996 0.635 0.554 0.983 0.845 0.996 0.983 0.624 0.989 0.636 0.552 0.983 0.847 0.989 0.768 0.625 0.989 0.633 0.561 1.000 0.676 0.852 0.989 0.769 0.623 0.996 0.642 0.559 0.675 0.683 0.852 0.996 0.771 0.629 0.996 0.639 0.681 0.553 0.682 0.735 0.852 0.996 0.681 0.769 0.626 0.637 0.688 0.551 0.734 0.430 0.734 0.852 1.000 0.688 0.773 0.626 0.635 0.740 0.558 0.734 0.433 0.696 0.851 0.739 0.772 0.623 0.636 0.739 0.556 0.695 0.453 0.695 0.851 0.739 0.772 0.625 0.633 0.701 0.694 0.450 0.738 0.849 0.700 0.770 0.622 0.757 0.700 0.737 0.438 0.737 0.848 0.699 0.770 0.743 0.743 0.736 0.435 0.699 0.678 0.742 0.768 0.943 0.742 0.698 0.456 0.698 0.719 0.742 0.746 0.705 0.698 0.454 0.551 0.697 0.704 0.843 0.704 0.549 0.440 0.625 0.703 0.809 0.572 0.623 0.438 0.613 0.570 0.894 0.643 0.611 0.506 0.832 0.641 0.630 0.831 0.651 0.754 0.628 0.833 0.753 0.614 0.832 0.775 1.000 0.501 0.773 0.995 0.714 0.995 0.995 0.516 0.995 0.514 .4 1.000 0.543 0.542 NC fav 1.000 .4 .4 .1 .1 .1 .0 .1 .5 .0 .3 .4 .5 .5 .3 .3 .5 .5 .3 0.430 0.433 0.453 0.450 0.438 0.435 0.456 0.454 0.440 0.438 0.506 0.651 0.614 0.501 0.714 0.516 0.514 0.543 0.542 fa a a a a a fav fav fav fav fav fav O L O L 0T5 5T8 0T5 5T8 A 85 T 85 A 50 T 50 A 85 T 85 A 50 T 50 ALL TOP ALL TOP o l o All Top All Top 3 fa 3 1.000 .0 .9 .9 .7 .3 .2 .4 .7 .0 .0 .4 .4 .9 .0 .3 .3 .8 0.681 0.688 0.739 0.739 0.700 0.699 0.742 0.742 0.704 0.703 0.570 0.641 0.628 0.832 0.773 0.995 0.995 1.000 fa 3 fas 1.000 .9 .9 .7 .3 .3 .4 .7 .0 .0 .4 .4 .0 .0 .3 .4 .8 0.681 0.688 0.740 0.739 0.701 0.700 0.743 0.742 0.705 0.704 0.572 0.643 0.630 0.833 0.775 0.995 0.995 3 fas 3 fas 1.000 .0 .5 .3 .1 .2 .4 .9 .9 .3 .3 .9 .9 .3 .3 .8 0.675 0.682 0.734 0.734 0.695 0.694 0.737 0.736 0.698 0.698 0.549 0.623 0.611 0.831 0.753 1.000 3 N Cghg h g NC MNC 1.000 .5 .3 .1 .2 .5 .9 .9 .3 .3 .9 .9 .3 .3 .8 0.676 0.683 0.735 0.734 0.696 0.695 0.738 0.737 0.699 0.698 0.551 0.625 0.613 0.832 0.754 CMChgh g h MNC NC 1.000 .9 .0 .4 .4 .6 .7 .7 .7 .7 .7 .6 .7 .6 0.768 0.769 0.771 0.769 0.773 0.772 0.772 0.770 0.770 0.768 0.746 0.843 0.809 0.894 fas 1.000 .9 .1 .7 .4 .4 .5 .5 .5 .5 .5 .5 .4 0.845 0.847 0.852 0.852 0.852 0.852 0.851 0.851 0.849 0.848 0.678 0.719 0.697 L ALL ALL 3 fas 1.000 .4 .4 .2 .2 .2 .2 .2 .2 .2 .2 .2 0.624 0.624 0.625 0.623 0.629 0.626 0.626 0.623 0.625 0.622 0.743 0.943 3 1.000 .5 .3 .3 .3 .3 .3 .4 .3 .3 .3 0.635 0.635 0.636 0.633 0.642 0.639 0.637 0.635 0.636 0.633 0.757 a a a a a a a a a a fav fav fav fav fav fav fav fav fav fav fav c 1.000 .5 .5 .5 .5 .5 .6 .5 .5 .6 0.560 0.560 0.554 0.552 0.561 0.559 0.553 0.551 0.558 0.556 RBP RBP RNP RNP N PR N PR N PR N PR B PR B PR B PR B PR 0 0 5 5 0 0 5 5 2 1 85T 85A 50T 50A 85T 85A 50T 50A aeakBs aeakNraie SCEAS Normalized PageRank Base PageRank 1.000 .0 .9 .9 .8 .8 .9 .9 .9 0.998 0.999 0.996 0.996 0.989 0.989 0.996 0.996 1.000 1.000 .9 .9 .8 .8 .9 .9 .9 0.998 0.999 0.996 0.996 0.989 0.989 0.996 0.996 1.000 .0 .8 .8 .9 .9 .9 0.990 0.992 0.999 0.999 0.983 0.983 1.000 1.000 .8 .8 .9 .9 .9 0.990 0.992 0.999 0.999 0.983 0.983 1.000 .0 .8 .8 .8 0.989 0.989 0.984 0.984 1.000 1.000 .8 .8 .8 0.989 0.989 0.984 0.984 1.000 .0 .9 0.990 0.992 1.000 CA2SCEAS1 SCEAS2 1.000 .9 0.990 0.993 All n denotes T a and 1.000 1.000 7.4. CITESEERX EXPERIMENTAL RESULTS 149

As we can see from the values included in the table the indicators that only consider the direct impact of a publication produce considerably less distinct values for the papers. This means that they are less effective in distinguishing between the papers and identifying their different characteristics. The less sensitive indicator in this category and overall is the Number of Citations (NC) that produces 203 distinct values, followed by the Contemporary h-index score (hc − score) that generates 1224 distinct values.

The Indirect impact indicators provide greater granularity with distinct value counts that range from 7802 for fp3 − index to 61215 for SCEAS1. The rest of the indicators produce values somewhere in between but still in much greater numbers than the Direct indicators.

# Distinct values

Number of Citations (NC) 203 Direct impact Contemporary h-index score (hcscore) 1224

Base, d = 0.50 (B50) 55760

PageRank Base, d = 0.85 (B85) 60566

Normalized, d = 0.50 (N50) 29911 Indirect impact Normalized, d = 0.85 (N85) 46257

SCEAS 1 61215 SCEAS SCEAS 2 61084

f − value 35069 Proposed indicators fp3 − index 7802

Table 7.11: Number of distinct values generated by the paper indicators for the CiteSeerx dataset

In Table 7.12, we present the top 10 papers based on the ranking produced by the f − value indicator, along with the rankings these papers hold in the ranks of all paper indicators described in the previous section. Each paper is referred to by its CiteSeerx identifier and we also list part of the title of the papers along with their publication year. In the same table, we also present the citation counts for the first three generations, calculated using the Hm definition, which is used for the f − value calculations and the Gs definition which is used for the fp3 − index.

Under both generation definitions, Hm and Gs, the counts produced for the first generation are identical by definition since as we have mentioned before a paper can only reference another paper once and it cannot reference itself. If we examine the second and third generation counts though, we will see that the Gs counts are in some cases considerably lower than the ones produced 150 CHAPTER 7. EXPERIMENTAL RESULTS aogwt h iaincut o hi is he eeain codn othe to according generations three first their for counts citation the with along 7.3.1 the section on in described based indicators papers 10 Top 7.12: Table 207Otmzto ySml..18 01111111117320 777311 3187 1619 1611 733 996 5787 286 2100 5754 733 1677 286 1 1178 1720 810 5 636 1400 446 1 391 628 254 5 82 630 1728 6393 5891 262 1 1234 3902 939 2155 1137 250 37 5 2106 292 391 754 254 557 36 1 2600 8457 873 82 189 11 7 3129 5 1912 422 140 4232 1 481 45 292 7024 9 37 1094 3 6 36 1343 2 189 86 774 1 11 140 1030 7 98 1107 5 45 1346 2 3 792.5 1 8 80 1068 3 1099 86 30 2 1352 4 103 1401 9 1 126 3 1686 10 1289 2 83 3 1565 111 9 1 3641.5 3 6 1783 85 1384 2 4 458 10 34 1446 9 707 3 1983 84 270 2 11 278 1986 8 17 ... 121 Simul by 62 Optimization 3 7 1980 8 59 ... Mathe Implementing 18 527057 6 1993 40 ... 25394 Data 3 Interprocedural 164 1989 5 7 ... Flow-Sensitive Efficient 517518 33 1988 Logic 86.5 4 Temporal Really A 70967 13 Year ... 1990 3 by Predict 7 17094 to Learning 2 ... Checki 1990 Model Symbolic 142710 1 1976 ... 19422 Slicing Interprocedural 1987 ... Cryp in 589519 Directions New ... 340126 Translation Automatic 578766 Title ID tnadDvain2922759813. 6. 0. 8. 6. 8 358 485 369.7 484.9 609.9 564.2 1131.3 549.8 212.7 2.9 Deviation Standard os ak1 0 4634. 5518 3216 361030 1346 1068 1352 1686 1565 3641.5 1446 707 10 rank Worst etrn 1 1 1 1 1 1 1 1 1 1 rank Best ein551. 856. . 065979 7 9 6.5 10 8.5 60.5 28.5 15.5 5.5 Median f − fp f value 3 niao.Tetbeicue h oiin hs aeshv eevdi h aknso l other all of rankings the in received have papers these positions the includes table The indicator. Ch NC c 2 1 CA RCtto counts Citation PR SCEAS 08 08 1g 3g 2g3 g2 g1 g3 g2 g1 85 50 85 50 aeNorm Base H H m m and G s definitions. G s 7.4. CITESEERX EXPERIMENTAL RESULTS 151 by Hm. For example, the 340126 (New Directions in Cryptography) paper has a total of 8457 3-gen citations according to Hm, but only 1728 according to Gs. This could imply that the part of the Paper-Citation graph where the paper is located includes a number of chords or that each citing paper provides multiple citation path towards other papers in the same subgraph.

With regards to the rest of the indicators and the top 10 papers we can see that the positions of the papers vary, with the papers occupying positions 1 to 3641.5. It is interesting to note though that all indicators include paper 527057 (Optimization by Simulated Annealing) which is ranked 10th according with f − value but is ranked first across all other direct and indirect indicators.

In addition, looking at the Worst Ranks generated from the rankings we can see that even though they range from 707 for fp3 − index to 3641.4 for hc their corresponding Medians are not that high, 15.5 for fp3 − index and 60.5 for hc. This implies that some of these top 10 papers are ranked high on these indicators and some papers are ranked in lower positions. Indeed if we examine the rankings of the individual papers across the table we will notice that the papers that are ranked quite lower for the rest of the indicators are papers 70967 (Efficient Flow-Sensitive Interprocedural

Computation of Pointer-Induced Aliases and Side Effects) and 517518 (Interprocedural Data Flow

Analysis In The Presence Of Pointers, Procedure Variables, And Label Variables).

In Table 7.13, we present the top 10 papers based on the ranking produced by the fp3 − index indicator. The table includes the same columns as the Table with the Top 10 papers according to the f − value indicator (Table 7.12). Again the difference in the citation counts across the two definitions, Hm and Gs, is visible across all papers included in the list.

Overall, the top 10 papers according to fp3 − index occupy high positions in the ranks of all other indicators with their worst ranks ranging from 20 for SCEAS1 to 164 for hc.

In addition apart from f − value, all other direct and indirect indicators agree that the most important paper included in the Paper-Citation graph for the our dataset is 527057 (Optimization by Simulated Annealing), which is ranked 10th according to f − value.

In Table 7.14, we present the Spearman rank correlation matrix for all the combinations of paper indicator ranks. For each indicator, the bottom two rows of the table report the indicators that have the highest and lowest correlation with the indicator under scrutiny. The diagonal has a value that is always set to 1.0 as it compares an indicator with itself.

As we can see from the table, f − value reports its lowest correlation with the Number of Citations (NC), fp3 − index with SCEAS2 and Contemporary h-index score (hc) with Base PageRank 152 CHAPTER 7. EXPERIMENTAL RESULTS al .3 o 0ppr ae nthe on based papers 10 Top 7.13: Table aogwt h iaincut o hi is he eeain codn othe to according generations three first their for counts citation the with along 7.3.1 section in described indicators 049SpotVco ewrs19 0 01 92 24 44 4 1223 4 6 864 761 888 915 244 1321 666 1619 736 2935 2106 685 276 996 1132 291 873 264 2061 286 244 2597 2842 140 857 5754 1166 44 1720 7024 913 276 1677 291 1791 1400 1343 24 264 286 1583 59 71 391 140 992 1198 66 42 5 3187 17 6393 27 327 403 3 1611 22 30 2155 4463 4277 58 5 63 733 391 21 4 1491 1572 67 17 5787 27 5 327 403 7 19 29 2100 3 13 15 22 27 733 15 5 19 3 14 4 17 12 1 7 20 20 6 10 7 11 4 23 21 25 1 9 5 11 3 8 12 8 13 7 108 30 1 10 164 10 3 1995 5 5 8 86.5 10 148 1 12 3 1995 142 7 6 65 12 6 1 1996 3 1992 ... Development 2 Software 1 5 Systematic 9 1 3 225734 Networks Support-Vector Year 1987 1986 2 1 4 ... Manipulation 500489 Boolean Symbolic 3 .... FORTRAN of 573617 Translation 2 22 Automatic 1 ... with Mathematics Implementing 4 578766 1989 32 Predictors 1 Bagging 25394 1990 ... 1996 Virtu Shared in 25286 Coherence .... Memory Sta 10 20 10 Checking: Model Symbolic 581086 1983 Specification Language 19422 Java The Annealing Simulated 219179 by Optimization 527057 Title ID tnadDvain5. . 424. . . . 491. 26.1 10.2 24.9 9.9 6.4 6.5 46.8 24.2 2.9 55.1 Deviation Standard fp os ak181 65142 12 73 71 30 67 29 21 20 164 86.5 10 148 rank Worst 3 etrn 1 1 1 1 1 1 1 1 1 1 rank Best ein2 . 11 05952 . 24.5 9.5 24 9.5 10.5 10 11 9 − 5.5 27 Median index fp f niao.Tetbeicue h oiin hs aeshv eevdi h aknso l other all of rankings the in received have papers these positions the includes table The indicator. 3 Ch NC c 2 1 CA RCtto counts Citation PR SCEAS 08 08 1g 3g 2g3 g2 g1 g3 g2 g1 85 50 85 50 aeNorm Base H H m m and G s definitions. G s 7.4. CITESEERX EXPERIMENTAL RESULTS 153

PageRank

f fp3 NC hc SCEAS Base (B) Normalized (N)

1 2 50 85 50 85

f 1.0000 0.9590 0.9760 0.9608 0.9409 0.9394 0.9441 0.9507 0.9409 0.9536

fp3 0.9590 1.0000 0.9371 0.9345 0.8962 0.8948 0.8992 0.9055 0.8965 0.9086

NC 0.9760 0.9371 1.0000 0.9921 0.9521 0.9523 0.9517 0.9501 0.9544 0.9520

hc 0.9608 0.9345 0.9921 1.0000 0.9342 0.9346 0.9332 0.9304 0.9372 0.9329

SCEAS1 0.9409 0.8962 0.9521 0.9342 1.0000 0.9999 0.9997 0.9977 0.9972 0.9922

SCEAS2 0.9394 0.8948 0.9523 0.9346 0.9999 1.0000 0.9995 0.9971 0.9972 0.9914

PRB50 0.9441 0.8992 0.9517 0.9332 0.9997 0.9995 1.0000 0.9989 0.9968 0.9936

PRB85 0.9507 0.9055 0.9501 0.9304 0.9977 0.9971 0.9989 1.0000 0.9944 0.9953

PRN50 0.9409 0.8965 0.9544 0.9372 0.9972 0.9972 0.9968 0.9944 1.0000 0.9909

PRN85 0.9536 0.9086 0.9520 0.9329 0.9922 0.9914 0.9936 0.9953 0.9909 1.0000

Top Cor. SCEAS2 f hc NC SCEAS2 SCEAS1 SCEAS1 PR B50 SCEAS2 PR B85

Low Cor. NC SCEAS2 fp3 PR B85 fp3 fp3 fp3 fp3 fp3 fp3

Table 7.14: Spearman rank correlation matrix for the paper indicators applied to the CiteSeerx dataset

(d = 0.85). The rest of the paper indicators report their lowest correlation with fp3 − index. The top correlations are much more spread out across the examined paper indicators. SCEAS2 is the indicator that f − value, SCEAS1 and the Normalized PageRank (d = 0.50) are mostly correlated with, whereas SCEAS1 is the same for SCEAS2 and the Base PageRank (d = 0.50). There is no other grouping that can be applied for the rest of the indicators since they are reporting different indicators for the top correlations.

7.4.2 Author indicators

Table 7.15 shows the number of distinct values produced by each indicator for the 169403 (co-) authors of the papers. As already discussed for the DBLP dataset the two counts displayed for some of the indicators account for the complete Publication Record or the Top 25 papers of each author.

Again, the indicators that consider only the Direct Impact of the papers produce significantly fewer distinct values, ranging from 21 for h − index to 2668 for the Mean Number of Citations (MNC). This small number of distinct values means that using these indicators makes it harder to distinguish between different authors. The indicators that consider both direct and indirect impact of the 154 CHAPTER 7. EXPERIMENTAL RESULTS papers included in the Publication Record of the authors, produce more distinct values which range from 45523, for the Normalized version of PageRank (d = 0.50) when only considering the top 25 publications per author, to 64432 for fa − value. The rest of the Indirect indicators produce distinct value counts in between.

# Distinct values

Number of Citations (NC) 450

Mean number of Citations (MNC) 2668 Direct impact h − index 21

g − index 31

hc − index 420

Base, d = 0.50, All (B50A) 58414

Base, d = 0.50, Top (B50T) 58414

Base, d = 0.85, All (B85A) 61313 PageRank Base, d = 0.85, Top (B85T) 61310

Normalized, d = 0.50, All (N50A) 45526

Indirect impact Normalized, d = 0.50, Top (N50T) 45523

Normalized, d = 0.85, All (N85A) 55769

Normalized, d = 0.85, Top (N85T) 55770

SCEAS1 63280 SCEAS SCEAS2 59398

fa − value 64432

fa3 − index All 47794 fa3 fa3 − index Top 47793

fas3 − index All 46911 fas3 fas3 − index Top 46910

Table 7.15: Number of distinct values generated by the author indicators for the CiteSeerx dataset

In Table 7.16 (a), we present the top 10 authors based on the fa − value and in Table 7.16 (b) we present the top 10 authors based on the fa3 − index. For each author we present some summary information about the year of their first and last known publication, along with the total count of their publications that are included in the dataset.

With regards to the top 10 authors according to the fa − value indicator we can note that all authors have had a long scientific career that usually spans more than 20 years, and a rich 7.4. CITESEERX EXPERIMENTAL RESULTS 155

Publication Record with publication counts that range between 1 for Randy Allen to 71 for Ken

Kennedy and Thomas A. Henzinger. It is very interesting to note that Randy Allen is indeed included in the top 10 authors even though he has co-authored only one publication included in the dataset.

We should also note that there were no ties identified in the produced ordinal ranking which places each author in a distinct position from 1 to 10.

With regards to fa3 − index the top 10 author list is quite different, which does not actually overlap with the top 10 fa − value authors at all. The first thing to note with regards to this listing is that all of the authors included in the top 10 have only a single publication included in the dataset, whose age is relatively old since most of them were published before 1990. Another interesting observation is that there ties between the calculated raw values that led to the ordinal ranking having 3 authors at the second position, 2 authors at position 5.5 and finally 8 authors at position 11.5 of which we have included the first three in the listing in ascending alphabetical order.

Pub. Year Pub. Pub. Year Pub.

Author First Last count Rank Author First Last count Rank

Ken Kennedy 1987 2002 71 1 C. D. Gelatt 1983 1983 1 2

Thomas A. Henzinger 1989 2004 71 2 M. P. Vecchi 1983 1983 1 2

Richard S. Sutton 1988 2003 24 3 S. Kirkpatrick 1983 1983 1 2

Rajeev Alur 1989 2004 50 4 Harlow England 1996 1996 1 4

David W. Wall 1986 1994 10 5 J. R. Burch 1990 1990 1 5.5

Sally Floyd 1991 2004 31 6 L. J. Hwang 1990 1990 1 5.5

Teodor C. Przymusinski 1987 2000 24 7 Kaj Lj 1989 1989 1 7

Randy Allen 1987 1987 1 8 D. J. Howe 1986 1986 1 11.5

San-qi Li 1990 2001 32 9 H. M. Bromley 1986 1986 1 11.5

Bernhard Nebel 1987 2003 56 10 J. F. Cremer 1986 1986 1 11.5

(a) fa − value ranking (b) fa3 − index ranking

Table 7.16: Top 10 authors according to the fa − value (a) and fa3 − index (b), along with the year of first and last publication included in the dataset and the total number of publications for the CiteSeerx dataset

In Table 7.17 we present the rankings of the top 10 authors according to the fa − value along with their corresponding ranks for the full list of direct and indirect author indicators included in the experimental study. Examining the rankings we would probably say that some of the fa − value top 10 authors appear in the top 10 rankings of other indicators like the Number of Citations (NC), 156 CHAPTER 7. EXPERIMENTAL RESULTS h − index, g − index, Contemporary h-index (hc) and the different variations of PageRank but the majority of the authors appears in lower positions for the rest of the indicators.

The Best Ranks for the authors vary from positions 1 for h − index to 88.5 for the Mean Number of Citations (MNC). If we examine the Worst Ranks the positions vary from 1190.5 for the Number of Citations (NC) to 136230 for the Contemporary h-index (hc).

In Table 7.18 we a similar ranking but this time for the top 10 authors according to the fa3 − index. In this table we do not present separate columns for the All (full Publication Record) and Top (Top 25 publications) since for these top-10 authors they would present the same values. As discussed earlier, this is due to the fact that these authors only have a single publication included in the dataset.

Another interesting aspect of the fact that these authors only have a single paper in the dataset is that they all occupy the same position in the h − index, g − index and Contemporary h-index (hc) rankings, and they are indeed placed quite low in these rankings as well with positions ranging from 71711.5 for h − index to 136230 for Contemporary h-index (hc).

If we exclude the three indicators just mentioned and fa − value which places these authors from positions 138 to 1662, we would say that these authors occupy high rankings on all other indicators.

Most indicators appear to agree that the Best Rank is position 2 that is shared among the top 3 authors. From this list of indicators the one that disagrees with that is the Number of Citations (NC) that places these authors in position 15. Again, from the same list of indicators the Worst Ranks vary from position 13.5 for the Mean Number of Citations (MNC) to position 282 for the Number of Citations (NC). It is worth noting though that the Worst Ranks according to the PageRank variations and SCEAS1/2 range from 17 to 29 which are rather high.

Finally, in Table 7.19 we present the Spearman rank correlation matrix produced for the indicators examined based on their calculated values for the CiteSeerx dataset. We should note that there is a positive correlation among all examined indicators. The indicators do appear to be less correlated with two of the indicators and there seems to be a split between the Contemporary h-index (hc) and fa − value. The top correlations are much more kept withing the actual variations of the different indicators presented, where for example SCEAS1 is highly correlated with SCEAS2 and vise versa. The same is true for the different variations of PageRank and between fa3 − index and fas3 − index, whereas fa − value is mostly correlated with the Number of Citations (NC) and h − index and hc − index are mostly correlated with g − index. 7.4. CITESEERX EXPERIMENTAL RESULTS 157

fa3 fas3

Author fav All Top All Top NC MNC h g hc

Ken Kennedy 1 2234 412 2355 438 4 3996 1 2 9

Thomas A. Henzinger 2 3297 661 3514 735 7 4401 7.5 2 4

Richard S. Sutton 3 1988 2011 2021 2042 87 2863 352.5 56 1344.5

Rajeev Alur 4 3774 1555 3950 1616 19 4341 18.5 12 24

David W. Wall 5 471 472 472 473 164 891 604 1016.5 250

Sally Floyd 6 2176 1623 2227 1652 42 3524 35.5 27 50.5

Teodor C. Przymusinski 7 4300 4377 4486 4556 542.5 8568.5 604 334 262

Randy Allen 8 16 16 16 16 1190.5 88.5 71711.5 74855.5 136230

San-qi Li 9 3674 2734 4206 3148 243 7528 110.5 184 206

Bernhard Nebel 10 4639.5 1658 4809 1730 29 7267 18.5 27 25

Best rank 1 16 16 16 16 4 88.5 1 2 4

Worst rank 10 4639.5 4377 4809 4556 1190.5 8568.5 71711.5 74855.5 136230

Median 5.5 2765.5 1589 2934.5 1634 64.5 4168.5 73 41.5 128.25

SD 2.9 1481 1233.9 1580.9 1311.9 355.8 2637.3 21456.3 22403.3 40798.3

(a)

PageRank

Base Normalized SCEAS

Author 50A 50T 85A 85T 50A 50T 85A 85T 1 2

Ken Kennedy 2653 567 1604 353 2673 581 1562 346 695 759

Thomas A. Henzinger 4454 1022 3271 866 4512 1028 3277 854 1073 1123

Richard S. Sutton 2032 2052 1664 1678 2046 2066 1655 1670 2202 2301

Rajeev Alur 4325.5 1720 3137 1336 4270 1683 3135 1337 1873 1922

David W. Wall 513 513 371 372 511 511 369 370 594 629

Sally Floyd 3341 2523 2825 2223 3399.5 2549 2908 2265 2680 2764

Teodor C. Przymusinski 5997 6097 3681 3744 6151 6254 3632 3692 7518 8240

Randy Allen 8 8 6 6 8 8 6 6 17 18

San-qi Li 5645 4082 3376 2543 5825 4143 3334 2492 4947 5362

Bernhard Nebel 6502 2194 4821 1804 6639 2195 4833 1765 2404 2490

Best rank 8 8 6 6 8 8 6 6 17 18

Worst rank 6502 6097 4821 3744 6639 6254 4833 3692 7518 8240

Median 3833.25 1886 2981 1507 3834.75 1874.5 3021.5 1503.5 2037.5 2111.5

SD 2131.5 1750.5 1445.8 1096.3 2184.8 1793.7 1447.1 1084.1 2160.7 2369.3

(b)

Table 7.17: Ranking of the Top 10 authors according to the fa − value along with the direct and indirect impact indicator rankings for the CiteSeerx dataset. 158 CHAPTER 7. EXPERIMENTAL RESULTS

Author fav fa3 fas3 NC MNC h g hc

C. D. Gelatt 278 2 2 15 2 71711.5 74855.5 136230

M. P. Vecchi 278 2 2 15 2 71711.5 74855.5 136230

S. Kirkpatrick 278 2 2 15 2 71711.5 74855.5 136230

Harlow England 1603 4 4 116 4 71711.5 74855.5 136230

J. R. Burch 237.5 5.5 5.5 126.5 5.5 71711.5 74855.5 136230

L. J. Hwang 237.5 5.5 5.5 126.5 5.5 71711.5 74855.5 136230

Kaj Lj 138 7 7 195.5 7 71711.5 74855.5 136230

D. J. Howe 1662 11.5 11.5 282 13.5 71711.5 74855.5 136230

H. M. Bromley 1662 11.5 11.5 282 13.5 71711.5 74855.5 136230

J. F. Cremer 1662 11.5 11.5 282 13.5 71711.5 74855.5 136230

Best rank 138 2 2 15 2 71711.5 74855.5 136230

Worst rank 1662 11.5 11.5 282 13.5 71711.5 74855.5 136230

Median 278 5.5 5.5 126.5 5.5 71711.5 74855.5 136230

SD 690.10 3.79 3.79 105.39 4.64 0.00 0.00 0.00

(a)

PageRank

Base Normalized SCEAS

Author 50 85 50 85 1 2

C. D. Gelatt 2 2 2 2 2 2

M. P. Vecchi 2 2 2 2 2 2

S. Kirkpatrick 2 2 2 2 2 2

Harlow England 17 29 17 29 16 8

J. R. Burch 6.5 17.5 6.5 17.5 6.5 6.5

L. J. Hwang 6.5 17.5 6.5 17.5 6.5 6.5

Kaj Lj 18 27 18 27 18 17

D. J. Howe 12.5 12.5 12.5 12.5 11.5 12.5

H. M. Bromley 12.5 12.5 12.5 12.5 11.5 12.5

J. F. Cremer 12.5 12.5 12.5 12.5 11.5 12.5

Best rank 2 2 2 2 2 2

Worst rank 18 29 18 29 18 17

Median 9.5 12.5 9.5 12.5 9 7.25

SD 5.84 9.24 5.84 9.24 5.55 5.04

(b)

Table 7.18: Top 10 authors according to the fa3 − index along with the direct and indirect impact indicator rankings for the for the CiteSeerx dataset. 7.4. CITESEERX EXPERIMENTAL RESULTS 159 1.000 1.000 SC2 SC1 0.987 0.986 1.000 1.000 0.987 0.986 1.000 0.987 0.987 0.996 0.996 1.000 1.000 0.987 0.986 0.995 0.995 1.000 fav fav fav fav fav fav 0.992 0.992 0.991 0.991 0.997 0.996 1.000 c h 1.000 0.992 0.992 0.991 0.991 0.997 0.996 1.000 c 0.998 0.998 0.995 0.995 0.989 0.989 1.000 0.999 1.000 dataset. An A appended to the name of an indicator x 1.000 0.998 0.998 0.995 0.995 0.989 0.989 0.999 0.999 1.000 PageRank Base PageRank Normalized SCEAS 50T 50A 85T 85A 50T 50A 85T 85A 50A 50T 85A 85T 50A 50T 85A 85T 1 2 PR B PR B PR B PR B PR N PR N PR N PR N 0.609 0.611 0.614 0.615 0.610 0.612 0.611 0.613 0.609 0.609 1.000 c 0.836 0.791 0.792 0.795 0.797 0.785 0.787 0.786 0.788 0.790 0.790 1.000 0.977 0.833 0.797 0.799 0.801 0.803 0.793 0.795 0.795 0.797 0.797 0.797 1.000 0.821 0.818 0.647 0.929 0.929 0.928 0.928 0.930 0.931 0.936 0.936 0.928 0.928 1.000 A g h g 3 fav fav fav fav fav fav h 0.925 0.908 0.919 0.735 0.872 0.873 0.875 0.877 0.867 0.869 0.864 0.866 0.872 0.871 1.000 c h NC MNC h g h 0.851 0.929 0.757 0.753 0.600 0.855 0.855 0.864 0.865 0.852 0.853 0.876 0.877 0.850 0.849 1.000 A MNC fa c 3 h 1.000 0.850 0.929 0.756 0.752 0.599 0.855 0.855 0.864 0.864 0.852 0.853 0.876 0.877 0.850 0.848 1.000 3 fas T fas c 3 h 0.986 0.987 0.895 0.946 0.803 0.800 0.641 0.881 0.882 0.891 0.891 0.878 0.879 0.898 0.898 0.877 0.875 1.000 . A fas c 3 h 1.000 0.987 0.987 0.894 0.946 0.802 0.799 0.640 0.881 0.882 0.891 0.891 0.878 0.878 0.898 0.898 0.877 0.875 1.000 3 Top 25 fa T fa c 3 All Top All Top h 0.677 0.678 0.627 0.628 0.780 0.634 0.690 0.709 0.547 0.605 0.607 0.617 0.618 0.593 0.595 0.597 0.599 0.602 0.600 1.000 c h fav NC fa 0.677 0.678 1.000 0.627 0.9870.628 0.986 0.9870.780 0.987 0.8940.634 0.895 1.000 0.9460.690 0.946 0.850 0.8020.709 0.803 0.929 0.799 0.851 0.547 0.800 0.756 0.640 0.9290.605 0.641 0.752 0.881 0.7570.607 0.925 0.881 0.599 0.882 0.7530.617 0.908 0.882 0.855 0.891 0.6000.618 0.919 0.821 0.891 0.855 0.891 0.8550.593 0.735 0.818 0.891 0.864 0.878 0.855 0.977 0.595 0.872 0.647 0.878 0.864 0.878 0.864 0.8330.597 0.873 0.929 0.836 0.879 0.852 0.898 0.865 0.7970.599 0.875 0.929 0.791 0.898 0.853 0.898 0.852 0.7990.602 0.877 0.928 0.609 0.792 0.898 0.876 0.877 0.853 0.8010.600 0.867 0.928 0.611 0.795 0.877 0.877 0.875 1.000 0.876 0.803 0.869 0.930 0.614 0.797 0.875 0.850 0.998 0.877 0.793 0.864 0.931 0.615 0.998 0.785 0.848 0.998 0.850 0.795 0.866 0.936 0.610 0.998 0.787 0.995 0.849 0.795 0.872 0.936 1.000 0.612 0.995 0.786 0.995 0.797 0.871 0.928 0.992 0.611 0.995 0.788 0.992 0.989 0.797 0.928 0.992 0.613 0.989 0.790 0.992 0.989 0.797 0.991 0.609 1.000 0.989 0.790 0.991 0.999 0.991 0.609 0.987 1.000 0.991 0.999 0.987 0.997 0.986 0.999 0.997 0.987 0.996 0.995 1.000 0.996 0.996 0.995 0.987 0.996 0.987 0.986 0.986 1.000 1.000 A T A T 3 3 and a T denotes 3 3 c fav fa fa fas fas NC MNC h g h PR B50A PR B50T PR N50A PR N50T PR B85A PR B85T PR N85A PR N85T SCEAS 1 SCEAS 2 Top Cor. Low Cor. All Table 7.19: Spearman rank correlation matrix for the author based indicators of the CiteSeer denotes 160 CHAPTER 7. EXPERIMENTAL RESULTS Chapter 8

Summary

The final chapter of this study includes a summary long with the conclusions reached based on both the review of the literature and the detailed examination of a subset of existing and proposed indicators.

The summary contains a review of some of the decisions made during the different stages of the study that have affected the definitions of the proposed indicators, along with a number of observations that were made during the experimental phase, based on the application of the indicators against real bibliographic databases. Finally, the chapter concludes with a detailed listing of the conclusions reached for each of the areas examined by this study and with a list of proposed topics for future work.

8.1 Summary

Defining a common mathematical framework and categorizing a subset of the indicators found in the literature has assisted in making some important observations. Firstly, the common mathematical framework revealed similarities among the indicators. This has been highlighted with the two Hirsch algorithms defined in Section 3.2.1 of the present study. The attempt that was made to categorize the indicators of the Hirsch family has proved that there are indeed two clearly defined algorithms that can be used to successfully describe a subset of the h − index variations. These algorithms could potentially be used to identify or define new variations of h−index that can perhaps consider different characteristics of the participating entities.

161 162 CHAPTER 8. SUMMARY

Exploring the Paper-Citation graph in detail and the different meta-data information that one can extract from it has led to the algorithmic definition of the steps required to define a Derived graph.

Derived graphs have appeared in the literature and have been used as part of the process of calculating various indicators, but to the best of our knowledge this is the first time that a framework that covers most of them has appeared in the literature. The framework defines additional Derived graphs not yet explored by any other study. The information available from them is significant and it does mean that some of the indicators that have been applied to the Paper-Citation graph, could be ported and calculated in a similar manner to the Author and Journal citation graphs, using one or more of the Derived graphs.

The study also examined indicators found in the literature and one categorization that was made, was that there are indicators that only consider the direct impact and indicators that considered both the direct and indirect impact that a research entity has had. A subset of these indicators was also examined as part of the experimental section of this study, where we examined the values generated by each indicator for two distinct datasets.

The first result that can be reported is that the granularity of information that the Direct Impact indicators provide is significantly lower than what the Indirect indicators provide. This became apparent by the small number of distinct calculated values produced by these indicators. The smaller the number of distinct values the larger the number of papers for which the same value is generated, which means the larger the groups of papers that the indicators consider equivalent with regards to their scientific impact.

Based on the examination of the experimental results, the rankings produced by the proposed indicators and the chosen list of other indicators all appear to have a positive correlation. On the other hand, the actual positions that different entities are assigned vary significantly among the indicators. With regards to the top ranked entities based on the different indicators, it was observed that the two groups, direct and indirect indicators, do have some entities in common in their top ranks but the majority is different. This implies that the fact that these indicators differ on the basis of impact considered, does actually affect our understanding of which entities are the most important from within each dataset.

Now, as we mentioned earlier indirect indicators provide more granularity in their generated values and consequently to the rankings of entities that one can produce from these raw values. The common characteristic of all indirect indicators is that they consider both direct and indirect citations.

An aspect that could be highlighted here, is exactly which indirect citations one should consider as 8.1. SUMMARY 163 part of the calculations of an indirect indicator.

As discussed in Section 2.5 there are different ways to actually define the entities included in each generation of citations, and the chosen definition can have a big impact on the generated citation counts for each generation depending on the characteristics of the citation graph used. In Section

4.2.1 we discuss in detail the different definitions and how each definition copes with some of the different characteristics found in a Citation graph, like the existence of cycles and the fact that multiple citation paths can exist between a source and destination node of the citation graph. One approach that we took in the initial definition of f − value was to examine the graph in a pre-processing stage and remove any identified cycles. This was due to the recursive nature of the indicator which could be significantly affected by the existence of different levels of cycles.

The selected definition eliminates the need to pre- process the citation graph since it copes well with the existence of cycles. Based on the selected definition a publication is only included to the generation that is closest to the target publication so even if it does participate in a cycle it will not affect the citation counts per generation. The same behavior can be applied for coping with multiple citation paths of different lengths existing between a source and target publication. Since each publication is only considered once in the closest generation to the target paper, the fact that multiple citation paths of different length might exist in the graph will not fluctuate the number of indirect citations considered per generation. Finally, the selected definition also copes with the existence of multiple citation paths originating from the same source paper that target the same publication. By including each publication only once per generation the number of citation paths considered from within the graph is also reduced to the unique paths between the source and target pairs.

With regards to the proposed paper indicators, both f − value and fpk − index are indirect indicators since they both consider more than one generations of citations. f − value is defined recursively and for any given publication it requires the entirety of the Paper-Citation graph to be traversed, whereas the fpk − index indicator can be calculated for the first k generations selected, thus making the the calculations in a more localized manner. One of the strengths of fpk − index with regards to other indirect indicators is also the fact that it uses a specific definition for the generations of citations and the available counts which eliminates some of the side effects of the existence of cycles and multiple citation paths between the source and target publication pairs.

With regards to the proposed author indicators, fa − value is based on f − value whereas both fak − index and fask − index are both based on fpk − index with their main difference being 164 CHAPTER 8. SUMMARY that they fask excludes author self-citations from the generation citation counts. fa − value appears to be more robust with regards to how much it is affected by the number of publications included in an authors Publication Record, since the actual number of publications is not included in the calculations. On the other hand, fa − value considers two other factors, the number of co-authors of each publication and the year of first publication for the author under scrutiny. These act as indicators of the scientific age of the author and of the number of co-authors that participated in his studies. Both fak − index and fask − index are affected by the number of publications of the author since they represent the average fpk − index of the considered publications.

As with other indicators found in the literature this type of approach is sensitive to authors that have a single or a small number of publications included in the examined dataset that have had particularly high scientific impact. For this type of indicators limiting the actual number of publications considered to a smaller number that includes the top rated publications of an author can have a significant effect on an author’s raw calculated value and on his/hers relative ranking among the rest of the authors of a particular dataset.

8.2 Conclusions

Based on the results generated by both the review of the literature and the contributions of the present study we can categorize our conclusions as follows:

Common mathematical notation

The common mathematical notation that was used in order to express the existing indicators found in the literature has assisted us in better understanding the indicators and in identifying their similarities.

Therefore, we believe that the use of a common mathematical notation can assist researchers with the review of the literature and the categorization of the indicators based on the different factors that they consider.

In addition, it has led us to define the Hirsch algorithms proposed by this study that organize the several variations and adaptations of the popular h − index indicator. Apart from categorizing existing indicators, the algorithms can also be used in order to propose new variations that might examine one or more supplementary factors.

Derived graphs

As already discussed the present study proposes the use of the Derived Graphs framework that allows 8.2. CONCLUSIONS 165 us to construct multiple Author and Journal graphs that are based on the bibliographic information available in the Paper-Citation graph. To the best of our knowledge this is the first time such a framework has been defined in detail.

Derived Graphs can be used with existing indicators that can potentially be applied to different research entities or with existing indicators that can be applied to different graphs. In addition the graphs can also be used to define completely novel indicators or provide variations to already existing ones.

Granularity of generated scores

From the experimental results that were gathered from running the algorithms to calculate the raw indicator values against the DBLP and CiteSeerx bibliographic databases we can draw some conclusions with regards to the number of unique generated values of the different types of indicators.

From our results it is clear that Direct impact indicators generate significantly less distinct values for the research entities when compared with the Indirect impact indicators. This means that the Direct impact indicators categorize large number of entities in a small number of categories which implies that all the entities that fall in the same category can be considered equal.

Comparison of indicators

In general, all indicators appear to be positively correlated according to the Spearman rank correlation matrix. The correlation appears to be stronger between variations of the same indicator.

In addition the correlations are stronger among the indicators that belong to the same group depending on whether they examine only the Direct or the Indirect impact of a research entity.

Finally, an interesting point to note is that even though all indicators appear to be have a positive correlation the actual individual rankings produced for a research entity according to each of them can vary significantly with specific entities shifting several positions between the different ranks.

Generations of citations

There are different ways that one can define the different generations of citations and the definition that one picks can have a big impact on the citation counts produced for a closed set of research entities.

In order to select the definition that we believe better describes the relation between a Source and

Target entity we examined three scenarios: (a) the existence of cycles in the citation graph, (b) 166 CHAPTER 8. SUMMARY the existence of multiple citation paths of the same length, and finally, (c) the existence of multiple citation paths of different lengths. The definition that best copes with all three scenarios is the Gs. According to the Gs definition each entity appears in a generation if it has not yet been included in any previous generation and also it appears in a generation once.

This definition better describes the relation between a Source and Target entity that we believe should express the fact that there is indeed a relation between them and that this relation is considered stronger the closer the Source and Target entities are in the Citation graph.

Proposed paper indicators

As already discussed two paper indicators were proposed by this study. The first one is called f − value and is an indirect recursively calculated indicator that requires full knowledge of the Paper-Citation graph in order to be calculated.

The second one is called fpk − index and is also an indirect indicator. It requires knowledge of the first k generations of citations, thus, it can be considered more localized. It utilizes the Gs definition for generating the citation counts per generation and it also considers the scientific age of a publication.

The number of generations examined can vary depending on the scientific field, the density of the

Paper-Citation graph as well as the different citation patterns observed among different scientific fields.

Proposed author indicators

Two author indicators were also proposed, the first one being the fa − value that is based on the calculated f − value’s of the papers included in the Publication Record of a scientist and the second one being the fak − index that is based on the calculated fpk − index values for the papers.

The fa − value considers the number of co-authors per publication as well as the scientist’s scientific age. In general, the fa − value appears to be less affected by the number of papers included in the Publication Record of the scientist.

The fak − index on the other hand considers the total number of papers included in the Publication Record of a scientist in order to accommodate for the fact that different scientists have different productivity levels (something that could be affected by the scientific age of the individual scientists).

This indicator appears to be more sensitive to the number of an author’s publications. A variation of 8.3. FUTURE WORK 167 this indicator was also proposed, called the fask − index, which excludes any author self-citations from the calculated generation citation counts.

8.3 Future work

There are several areas that could be examined in more detail following the results presented in this study. We could briefly define the following:

ˆ Multidisciplinary Paper-Citation graphs: Within the context of the fpk − index indicator we have mentioned that the number k of generations examined could be affected by the different scientific fields examined due

to the different publication and citation patterns they follow. A more detailed study of

multidisciplinary Paper-Citation graphs could reveal how the indicators are affected and

whether they do cope well with papers belonging to different areas.

ˆ Author indicators based on the average paper indicator values of an author’s publications:

Following the terminology used for some variations of the Hirsch algorithm let us define the Core

of a researcher’s publications as the number of top publications from within the Publication

Record that participate in calculating such Author indicators. The aim of this research would

be to either identify an Author indicator that considers all publications as part of the Core or

identify an indicator that could define the publications that one should consider as part of the

Core in a less arbitrary way.

ˆ Application of the proposed paper indicators to the Author-Citation graphs:

One might wish to explore the application of the proposed Paper indicators to the Author

citation graphs and produce an experimental study of the results. More specifically, the f −value and fpk −index paper indicators could be applied to the derived Author-Citation graphs as is and then produce the equivalent fa − value and fak − index rankings for the Journals that the authors have published their scientific work in.

ˆ Different approaches to how an author’s self-citations should be accounted for:

In the present study the self-citations have been defined as Own self citations, which means

that a citation is considered to be a self-citation only if the currently examined author is the

co-author of both the Source and Target papers. As we have already seen there are other 168 CHAPTER 8. SUMMARY

definitions for what constitutes a self-citation that could be examined in order to identify the

effect that they have in the calculated results.

ˆ Strengthening the criteria for calculating an entity’s indirect impact: As part of the definition of the fpk − index and our understanding of the meaning of indirect citations we have chosen to follow the Gs definition for the generations of citations. This defines that indirect citations define a relation between the Source and Target paper that is

considered stronger the closer the two papers are in the Paper-Citation graph. This approach

could be extended to add an additional parameter that would allow us to consider an indirect

citation between a Source and Target paper only if the citation co-exists with a Chord that

clearly defines that the Source paper has been somehow affected by the Target paper. This

approach should be thoroughly examined based on generated and real graph data in order

to be able to demonstrate the potential benefits and/or weaknesses.

Additional broader research areas might also include:

ˆ Investigation of the applicability of Hypergraphs while constructing the Author-Citation graphs

ˆ Investigation of the applicability of the HITS algorithm to the Paper-Citation graph, first

examined by Sidiropoulos et al.[2007]. Chapter 9

Publication List

9.1 Journals

1. Eleni Fragkiadaki, Georgios Evangelidis, Nikolaos Samaras, and Dimitris A. Dervos. f-Value:

measuring an article’s scientific impact. Scientometrics, 86(3):671–686, March 2011. ISSN

0138-9130. doi: 10.1007/s11192-010-0302-9

2. Eleni Fragkiadaki and Georgios Evangelidis. Review of the indirect citations paradigm: theory

and practice of the assessment of papers, authors and journals. Scientometrics, 99(2):261–288,

May 2014. ISSN 0138-9130. doi: 10.1007/s11192-013-1175-5

3. Eleni Fragkiadaki and Georgios Evangelidis. Three novel indirect indicators for the assessment

of papers and authors based on generations of citations. Scientometrics, 106(2):1–38, February

2016. ISSN 0138-9130. doi: 10.1007/s11192-015-1802-4

9.2 Conferences

1. Eleni Fragkiadaki, Georgios Evangelidis, Nikolaos Samaras, and Dimitris A. Dervos. Cascading

Citations Indexing Framework Algorithm Implementation and Testing. Informatics, Panhellenic

Conference on, 0:70–74, 2009. doi: 10.1109/PCI.2009.30

169 170 CHAPTER 9. PUBLICATION LIST List of Figures

2.1 Example of Forward and Backward citations examined for paper P3...... 15

2.2 Cycles encountered in a Paper-Citation graph...... 16

2.3 Example of a Paper-Citation graph...... 17

2.4 Constructed Author-Citation graph...... 24

2.5 Constructed Journal-Citation graph...... 30

2.6 Paper-Citation graph with only Papers and Citations...... 34

2.7 Paper-Citation graph that includes the paper meta-data information...... 37

4.1 Paper-Citation graph...... 86

4.2 Paper-Citation graph with citation paths of different lengths...... 91

4.3 Paper-Citation graph that includes the year of publication for each paper...... 97

4.4 Example Paper-Citation graph with papers of different scientific ages...... 103

5.1 Paper-Citation graph of Figure 4.1 with Publication Years and Author information... 109

5.2 Paper-Citation graph with Year of Publication and Author information...... 111

5.3 Paper-Citation graph with Author and Publication Year information...... 115

6.1 Summary statistics of the publications included in the Paper-Citation graph and the

citations received for each generation of citations identified...... 124

6.2 Summary statistics of the authors and the citations received for each generation of

citations identified...... 125

171 172 LIST OF FIGURES List of Tables

2.1 Citation weights for the Paper-Citation graph of Figure 2.3...... 18

2.2 Paper-Citation table for the Paper-Citation graph of Figure 2.3...... 19

2.3 Paper (FUC) and Intermediate Author-Citation graph edge weights...... 25

2.4 Paper (FRC) and Intermediate Author-Citation graph edge weights...... 26

2.5 Author-Citation graph edge weights when using FUC in the Paper-Citation graph... 27

2.6 Author-Citation graph edge weights when using FRC in the Paper-Citation graph... 28

2.7 Edge weights in the Paper and Intermediate Journal-Citation graphs...... 31

2.8 Journal-Citation graph edge weights...... 32

2.9 Generations definitions...... 33

2.10 Different types of forward citation generations for paper P1...... 35

2.11 Paper-Citation table for the Paper-Citation graph of Figure 2.7...... 37

2.12 Direct and indirect citation paths for paper P1 of Figure 2.7. Self-citations are considered at the (paper, author) level...... 38

2.13 Forward citation generations for (a) paper P1 and author A1 and, (b) for paper P1

and author A2 of Figure 2.7...... 39

3.1 Classification of the paper-based indirect indicators...... 47

3.2 Factors considered by the Standard Bibliometric indicators...... 51

3.3 Weighted increments used in the calculations of the j − index ...... 56

3.4 Factors considered by the indicators in the Hirsch core approach subcategory... 57

173 174 LIST OF TABLES

3.5 First Hirsch algorithm indicators and definition details...... 61

3.6 Factors considered by the indicators in the First Hirsch algorithm subcategory...... 62

3.7 Second Hirsch algorithm indicators with their definition details...... 64

3.8 Factors used by the indicators of the First Hirsch algorithm...... 65

3.9 Factors considered by the indicators of the Derived h-index subcategory...... 69

3.10 Factors considered by the Standalone indicators category...... 72

3.11 Classification of the author-based indirect indicators...... 75

3.12 Classification of the journal-based direct indicators...... 76

3.13 Classification of the journal-based indirect indicators...... 80

4.1 Paper-Citation table up to Length 2 for the graph of Figure 4.1...... 87

4.2 Iterations of the f − value algorithm for the Paper-Citation graph of Figure 4.1.... 87

4.3 Paper-Citation tables for paper P1 presented in Figures 2.2 (a) and (b)...... 90

4.4 MSO table for the G and H definitions for paper P1 of Figure 2.2 (a) and (b)...... 90

4.5 Paper-Citation table for paper P1 of the Paper-Citation graph of Figure 4.2...... 92

4.6 MSO table for the G and H definitions for paper P1 of Figure 4.2...... 92

4.7 Paper-Citation table for the graph of Figure 4.3...... 97

4.8 MSO table and fp3 − index values for the graph of Figure 4.3...... 98

4.9 MSO and Categories tables for the papers included in Figure 4.2...... 102

4.10 MSO table and indicator values for the Paper-Citation graph of Figure 4.4...... 104

4.11 Scores and papers distribution for the four indicators included (Number of citations (NC), Base and Normalized PageRank, f − value and fp3 − index)...... 106

5.1 Author metadata for the Paper-Citation graph of Figure 5.1 along with the fa−value scores for each author...... 110

5.2 MSO table and fp3 − index values for the graph of Figure 5.2...... 112

5.3 fa3 − index values for the co-authors of the papers in the Paper-Citation graph of Figure 5.2...... 112 LIST OF TABLES 175

5.4 fas3 − index values for the co-authors of the papers in the Paper-Citation graph of Figure 5.2...... 114

5.5 (a) MSO and f − values and, (b) MSO and fp3 − index values for the papers of Figure 5.3...... 116

5.6 MSO and fp3 − index values for the (author, paper) pairs of Figure 5.3...... 116

5.7 Author-based indicator values for the authors of Figure 5.3...... 117

5.8 Author rankings using the fa − value, fa3 − index and fas3 − index values.... 117

6.1 Imported DBLP records per publication type...... 122

6.2 Records in the extracted DBLP Paper-Citation graph...... 123

6.3 Counts of the CiteSeerx extracted data...... 131

6.4 CiteSeerx extracted data that include the number of publications, authors, citations

and references...... 131

6.5 Counts of the CiteSeerx records included in the Paper-Citation graph...... 132

7.1 Number of distinct values generated by the paper indicators for the DBLP dataset.. 136

7.2 Top 10 papers based on the f − value indicator for the DBLP dataset...... 137

7.3 Top 10 papers based on the fp3 − index indicator for the DBLP dataset...... 138

7.4 Spearman rank correlation matrix for the paper indicators applied to the DBLP dataset 139

7.5 Number of distinct values generated by the author indicators for the DBLP dataset.. 141

7.6 Top 10 authors according to the fa − value (a) and fa3 − index (b), along with the year of first and last publication included in the dataset and the total number of

publications...... 142

7.7 Ranking of the Top 10 authors according to the fa − value along with the direct and indirect impact indicator rankings for the DBLP dataset...... 143

7.8 Top 10 authors according to the fa3 − index along with the direct and indirect impact indicator rankings...... 145

7.9 SIGMOD Edgar F. Codd Innovations Award winners (1992 - 2004) rankings for the DBLP

dataset...... 147 176 LIST OF TABLES

7.10 Spearman rank correlation matrix for the author based indicators. An A appended to

the name of an indicator denotes All and a T denotes Top 25...... 148

7.11 Number of distinct values generated by the paper indicators for the CiteSeerx dataset 149

7.12 Top 10 papers based on the f − value indicator for the CiteSeerx dataset...... 150

7.13 Top 10 papers based on the fp3 − index indicator for the CiteSeerx dataset...... 152

7.14 Spearman rank correlation matrix for the paper indicators applied to the CiteSeerx

dataset...... 153

7.15 Number of distinct values generated by the author indicators for the CiteSeerx dataset 154

7.16 Top 10 authors according to the fa − value (a) and fa3 − index (b), along with the year of first and last publication included in the dataset and the total number of

publications for the CiteSeerx dataset...... 155

7.17 Ranking of the Top 10 authors according to the fa − value along with the direct and indirect impact indicator rankings for the CiteSeerx dataset...... 157

7.18 Top 10 authors according to the fa3 − index along with the direct and indirect impact indicator rankings for the for the CiteSeerx dataset...... 158

7.19 Spearman rank correlation matrix for the author based indicators of the CiteSeerx

dataset. An A appended to the name of an indicator denotes All and a T denotes

Top 25...... 159 Bibliography

Open Archives Initiative. URL https://www.openarchives.org.

Attribution-NonCommercial-ShareAlike 3.0 Unported License. URL http://

creativecommons.org/licenses/by-nc-sa/3.0.

ODC-BY 1.0 license, a. URL http://opendatacommons.org/licenses/by/summary/.

DBLP Website, b. URL http://dblp.uni-trier.de/db/.

MySQL Website. URL https://www.mysql.com.

PHP Website. URL https://secure.php.net.

CiteSeer, 1997. URL http://citeseer.ist.psu.edu/.

Eigenfactor —Score and Article Influence —Score: Detailed methods [version 2.01]. (Accessed via

www.eigenfactor.org/methods.pdf), nov 2008.

DBLP - Some lessons learned, volume 2, 2009.

S. Alonso, F.J. Cabrerizo, E. Herrera-Viedma, and F. Herrera. h-Index: A review focused in its variants,

computation and standardization for different scientific fields. Journal of Informetrics, 3(4):273–289,

2009. ISSN 1751-1577. doi: 10.1016/j.joi.2009.04.001.

S. Alonso, F. Cabrerizo, E. Herrera-Viedma, and F. Herrera. hg-index: a new index to characterize the

scientific output of researchers based on the h- and g-indices. Scientometrics, 82:391–400, 2010.

ISSN 0138-9130. doi: 10.1007/s11192-009-0047-5.

Thomas R. Anderson, Robin K. S. Hankin, and Peter D. Killworth. Beyond the Durfee square: enhancing

the h-index to score total publication output. Scientometrics, 76(3):577–588, September 2008. doi:

10.1007/s11192-007-2071-2.

177 178 BIBLIOGRAPHY

John Antonakis and Rafael Lalive. Quantifying Scholarly Impact: IQp Versus the Hirsch h. Journal

of the American Society for Information Science and Technology, 59(6):956–969, 2008. ISSN

1532-2890. doi: 10.1002/asi.20802.

Gamal Atallah and Gabriel Rodríguez. Indirect patent citations. Scientometrics, 67:437–465, 2006.

ISSN 0138-9130. doi: 10.1007/s11192-006-0063-7.

Pablo D. Batista, Mônica G. Campiteli, and Osame Kinouchi. Is it possible to compare researchers

with different scientific interests? Scientometrics, 68(1):179–189, 2006. ISSN 0138-9130. doi:

10.1007/s11192-006-0090-4.

C. T. Bergstrom. Eigenfactor: Measuring the value and prestige of scholarly journals. C&RL News, 68

(5), 2007.

Carl T. Bergstrom, Jevin D. West, and Marc A. Wiseman. The Eigenfactor —Metrics. The Journal of

Neuroscience, 28(45):11433–11434, nov 2008.

Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, 69

(3):669–687, December 2006. doi: 10.1007/s11192-006-0176-z.

Lutz Bornmann, Rüdiger Mutz, and Hans-Dieter Daniel. Are there better indices for evaluation

purposes than the h index? A comparison of nine different variants of the h index using data

from biomedicine. J. Am. Soc. Inf. Sci. Technol., 59(5):830–837, March 2008. ISSN 1532-2882. doi:

10.1002/asi.v59:5.

Lutz Bornmann, Rüdiger Mutz, Sven E. Hug, and Hans-Dieter Daniel. A multilevel meta-analysis of

studies reporting correlations between the h index and 37 different h index variants. Journal of

Informetrics, 5(3):346–359, 2011. ISSN 1751-1577. doi: 10.1016/j.joi.2011.01.006.

F.J. Cabrerizo, S. Alonso, E. Herrera-Viedma, and F. Herrera. q2-Index: Quantitative and qualitative

evaluation based on the number and impact of papers in the Hirsch core. Journal of Informetrics,

4(1):23–28, jan 2010. ISSN 1751-1577. doi: 10.1016/j.joi.2009.06.005.

Rodrigo Costas and María Bordons. Is g-index better than h-index? An exploratory study at the

individual level. Scientometrics, 77(2):267–288, November 2008. ISSN 0138-9130. doi: 10.1007/

s11192-007-1997-0.

DBLP. DBLP - Data description. URL http://dblp.uni-trier.de/faq/How+to+parse+

dblp+xml.html. BIBLIOGRAPHY 179

Alex De Visscher. What does the g-index really measure? Journal of the American Society for

Information Science and Technology, 62(11):2290–2293, 2011. ISSN 1532-2890. doi: 10.1002/

asi.21621.

Dimitris A. Dervos and T. Kalkanis. cc-IFF: A Cascading Citations Impact Factor Framework for

the Automatic Ranking of Research Publications. In Intelligent Data Acquisition and Advanced

Computing Systems: Technology and Applications, 2005. IDAACS 2005. IEEE, page 668–673, sept.

2005. doi: 10.1109/IDAACS.2005.283070.

Dimitris A. Dervos, Nikolaos Samaras, Georgios Evangelidis, and Theodore Folias. A New Framework

for the Citation Indexing Paradigm. Proceedings of the American Society for Information Science

and Technology, 43(1):1–16, 2006a. ISSN 1550-8390. doi: 10.1002/meet.14504301152.

Dimitris A. Dervos, Nikolaos Samaras, Georgios Evangelidis, Jaakko Hyvärinen, and Ypatios Asmanidis.

The Universal Author Identifier System (UAI_Sys). 1st International Scientific Conference, eRA: The

Contribution of Information Technology to Science, Economy, Society and Education, page

330–337, 16-17 September 2006b. URL http://hdl.handle.net/10150/105755.

L. Egghe. The single publication H-index and the indirect H-index of a researcher. Scientometrics,

88:1003–1004, sep 2011a. ISSN 0138-9130. doi: 10.1007/s11192-011-0417-7.

L. Egghe. The single publication H-index of papers in the Hirsch-core of a researcher and the

indirect H-index. Scientometrics, 89:727–739, December 2011b. ISSN 0138-9130. doi: 10.1007/

s11192-011-0483-x.

Leo Egghe. Theory and practise of the g-index. Scientometrics, 69(1):131–152, 2006. ISSN 0138-9130.

doi: 10.1007/s11192-006-0144-7.

Leo Egghe and Ronald Rousseau. An h-index weighted by citation impact. Information Processing

& Management, 44(2):770–780, mar 2008. ISSN 0306-4573. doi: 10.1016/j.ipm.2007.05.003.

Dalibor Fiala, François Rousselot, and Karel Ježek. PageRank for bibliographic networks. Scientomet-

rics, 76(1):135–158, 2008. ISSN 0138-9130. doi: 10.1007/s11192-007-1908-4.

Eleni Fragkiadaki and Georgios Evangelidis. Review of the indirect citations paradigm: theory and

practice of the assessment of papers, authors and journals. Scientometrics, 99(2):261–288, May

2014. ISSN 0138-9130. doi: 10.1007/s11192-013-1175-5. 180 BIBLIOGRAPHY

Eleni Fragkiadaki and Georgios Evangelidis. Three novel indirect indicators for the assessment of

papers and authors based on generations of citations. Scientometrics, 106(2):1–38, February 2016.

ISSN 0138-9130. doi: 10.1007/s11192-015-1802-4.

Eleni Fragkiadaki, Georgios Evangelidis, Nikolaos Samaras, and Dimitris A. Dervos. Cascading

Citations Indexing Framework Algorithm Implementation and Testing. Informatics, Panhellenic

Conference on, 0:70–74, 2009. doi: 10.1109/PCI.2009.30.

Eleni Fragkiadaki, Georgios Evangelidis, Nikolaos Samaras, and Dimitris A. Dervos. f-Value: measuring

an article’s scientific impact. Scientometrics, 86(3):671–686, March 2011. ISSN 0138-9130. doi:

10.1007/s11192-010-0302-9.

Eugene Garfield. Journal impact factor: a brief review. Canadian Medical Association Journal, 161

(8):979–980, 1999.

Eugene Garfield. The Agony and the Ecstasy - The History and Meaning of the Journal Impact Factor.

In International Congress on Peer Review And Biomedical Publication. Chicago, September 2005.

C. Lee Giles, Kurt D. Bollacker, and Steve Lawrence. CiteSeer: An Automatic Citation Indexing

System. page 89–98. ACM Press, 1998.

Wolfgang Glänzel. On the Opportunities and Limitations of the H-index. Science Focus, 1:10–11,

2006. doi: 10.1016/j.joi.2007.02.001.

Borja González-Pereira, Vicente P. Guerrero-Bote, and Félix Moya-Anegón. A new approach to the

metric of journals’ scientific prestige: The SJR indicator. Journal of Informetrics, 4(3):379–391, 2010.

ISSN 1751-1577. doi: 10.1016/j.joi.2010.03.002.

Vicente P. Guerrero-Bote and Félix Moya-Anegón. A further step forward in measuring journals’

scientific prestige: The SJR2 indicator. Journal of Informetrics, 6(4):674–688, 2012. ISSN 1751-1577.

doi: 10.1016/j.joi.2012.07.001.

Raf Guns and Ronald Rousseau. Real and rational variants of the h-index and the g-index. Journal

of Informetrics, 3(1):64–71, January 2009. ISSN 17511577. doi: 10.1016/j.joi.2008.11.004.

J. E. Hirsch. An index to quantify an individual’s scientific research output. Proceedings of the

National Academy of Sciences of the United States of America, 102(46):16569–16572, November

2005. ISSN 1091-6490. doi: 10.1073/pnas.0507655102. BIBLIOGRAPHY 181

J. E. Hirsch. Does the h-index have predictive power?, August 2007.

J. E. Hirsch. An index to quantify an individual’s scientific research output that takes into account

the effect of multiple coauthorship. Scientometrics, 85:741–754, 2010. ISSN 0138-9130. doi:

10.1007/s11192-010-0193-9.

Xiaojun Hu, Ronald Rousseau, and Jin Chen. On the definition of forward and backward citation

generations. Journal of Informetrics, 5(1):27–36, 2011. ISSN 1751-1577. doi: 10.1016/j.joi.2010.

07.004.

B. Jin, L. Liang, Ronald Rousseau, and Leo Egghe. The R- and AR-indices: complementing the

h-index. Chinese Science Bulletin, (52(6)):855–863, 2007. doi: 10.1007/s11434-007-0145-9.

Dimitrios Katsaros, Antonis Sidiropoulos, and Yannis Manolopoulos. Age Decaying H-Index for Social

Network of Citations. In SAW’07, 2007.

M. Kosmulski. A new Hirsch-type index saves time and works equally well as the original h-index. ISSI

Newsletter, page 4–6, 2006. doi: 10.1371/journal.pone.0059912.

Marek Kosmulski. Hirsch-type approach to the 2nd generation citations. Journal of Informetrics, 4(3):

257–264, 2010. ISSN 1751-1577. doi: 10.1016/j.joi.2010.01.003.

Nan Ma, Jiancheng Guan, and Yi Zhao. Bringing PageRank to the citation analysis. Information

Processing & Management, 44(2):800–810, 2008. ISSN 0306-4573. doi: 10.1016/j.ipm.2007.06.

006.

Sergei Maslov and Sidney Redner. Promise and Pitfalls of Extending Google’s PageRank Algorithm

to Citation Networks. The Journal of Neuroscience, 28(44):11103–11105, oct 2008. doi: 10.1523/

JNEUROSCI.0002-08.2008.

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank Citation Ranking:

Bringing Order to the Web. Technical Report 1999-66, Stanford InfoLab, November 1999.

John Panaretos and Chrisovaladis Malesios. Assessing scientific research performance and

impact with single indices. Scientometrics, 81:635–670, 2009. ISSN 0138-9130. doi:

10.1007/s11192-008-2174-9.

Gabriel Pinski and Francis Narin. Citation influence for journal aggregates of scientific publications:

Theory, with application to the literature of physics. Information Processing & Management, 12(5):

297–312, 1976. ISSN 0306-4573. doi: 10.1016/0306-4573(76)90048-0. 182 BIBLIOGRAPHY

Gangan Prathap. Is there a place for a mock h-index? Scientometrics, 84:153–165, 2010. ISSN

0138-9130. doi: 10.1007/s11192-009-0066-2.

Filippo Radicchi, Santo Fortunato, Benjamin Markines, and Alessandro Vespignani. Diffusion of

scientific credits and the ranking of scientists. Phys. Rev. E, 80:056103, Nov 2009. doi: 10.1103/

PhysRevE.80.056103.

Filippo Radicchi, Santo Fortunato, and Alessandro Vespignani. Citation Networks, page 233–257.

Understanding Complex Systems. Springer Berlin Heidelberg, 2012. ISBN 978-3-642-23067-7. doi:

10.1007/978-3-642-23068-4_7.

R. Rousseau. The Gozinto theorem: Using citations to determine influences on a scientific publication.

Scientometrics, 11:217–229, 1987. ISSN 0138-9130. doi: 10.1007/BF02016593.

Ronald Rousseau. New developments related to the Hirsch index. Science Focus, 1(4):23–25, 2006.

Frances Ruane and Richard Tol. Rational (successive) h-indices: An application to economics

in the Republic of Ireland. Scientometrics, 75:395–405, 2008. ISSN 0138-9130. doi: 10.1007/

s11192-007-1869-7.

M. Schreiber. Self-citation corrections for the Hirsch index. EPL (Europhysics Letters), 78(3):30002,

2007.

Michael Schreiber. A modification of the h-index: The hm-index accounts for multi-authored

manuscripts. Journal of Informetrics, 2(3):211–216, 2008a. ISSN 1751-1577. doi: 10.1016/j.joi.

2008.05.001.

Michael Schreiber. To share the fame in a fair way, h m modifies h for multi-authored manuscripts.

New Journal of Physics, 10(4):040201, 2008b.

Michael Schreiber. The influence of self-citation corrections and the fractionalised counting of

multi-authored manuscripts on the Hirsch index. Annalen der Physik, 18(9):607–621, 2009. ISSN

1521-3889. doi: 10.1002/andp.200910360.

András Schubert. Using the h-index for assessing single publications. Scientometrics, 78:559–565,

2009. ISSN 0138-9130. doi: 10.1007/s11192-008-2208-3.

Antonis Sidiropoulos and Yannis Manolopoulos. A citation-based system to assist prize awarding.

SIGMOD Rec., 34(4):54–60, December 2005. ISSN 0163-5808. doi: 10.1145/1107499.1107506. BIBLIOGRAPHY 183

Antonis Sidiropoulos, Dimitrios Katsaros, and Yannis Manolopoulos. Generalized Hirsch h-index for

disclosing latent facts in citation networks. Scientometrics, 72:253–280, 2007. ISSN 0138-9130. doi:

10.1007/s11192-007-1722-z.

José M. Soler. A rational indicator of scientific creativity. Journal of Informetrics, 1(2):123–130, 2007.

ISSN 1751-1577. doi: 10.1016/j.joi.2006.10.004.

Cheng Su, YunTao Pan, YanNing Zhen, Zheng Ma, JunPeng Yuan, Hong Guo, ZhengLu Yu, CaiFeng

Ma, and YiShan Wu. PrestigeRank: A new evaluation method for papers and journals. Journal of

Informetrics, 5(1):1–13, 2011. ISSN 1751-1577. doi: 10.1016/j.joi.2010.03.011.

Roberto Todeschini. The j-index: a new bibliometric index and multivariate comparisons between

other common indices. Scientometrics, 87:621–639, 2011. ISSN 0138-9130. doi: 10.1007/

s11192-011-0346-5.

Richard Tol. The h-index and its alternatives: An application to the 100 most prolific economists.

Scientometrics, 80:317–324, 2009. ISSN 0138-9130. doi: 10.1007/s11192-008-2079-7.

Richard S.J. Tol. A rational, successive g-index applied to economics departments in Ireland. Journal

of Informetrics, 2(2):149–155, 2008. ISSN 1751-1577. doi: 10.1016/j.joi.2008.01.001.

Nees Jan van Eck and Ludo Waltman. Generalizing the h- and g-indices. Journal of Informetrics, 2

(4):263–271, 2008. ISSN 1751-1577. doi: 10.1016/j.joi.2008.09.004.

Dylan Walker, Huafeng Xie, Koon-Kiu Yan, and Sergei Maslov. Ranking scientific publications using

a model of network traffic. Journal of Statistical Mechanics: Theory and Experiment, 2007(06):

P06010, 2007. doi: 10.1088/1742-5468/2007/06/P06010.

Ludo Waltman and Nees Jan van Eck. The inconsistency of the h-index. Journal of the American

Society for Information Science and Technology, 63(2):406–415, 2012. ISSN 1532-2890. doi:

10.1002/asi.21678.

Ludo Waltman, Nees Jan van Eck, Thed N. van Leeuwen, Martijn S. Visser, and Anthony F.J. van

Raan. Towards a new crown indicator: Some theoretical considerations. Journal of Informetrics, 5

(1):37–47, 2011a. ISSN 1751-1577. doi: 10.1016/j.joi.2010.08.001.

Ludo Waltman, Erjia Yan, and Nees van Eck. A recursive field-normalized bibliometric performance

indicator: an application to the field of library and information science. Scientometrics, 89:301–314,

2011b. ISSN 0138-9130. doi: 10.1007/s11192-011-0449-z. 184 BIBLIOGRAPHY

Jin-kun Wan, Ping-huan Hua, and Ronald Rousseau. The pure h-index: calculating an author’s

h-index by taking co-authors into account. Collnet Journal of Scientometrics and Information

Management, 1:1–5, 2007. doi: 10.1080/09737766.2007.10700824.

Jevin D. West, Michael C. Jensen, Ralph J. Dandrea, Gregory J. Gordon, and Carl T. Bergstrom.

Author-level Eigenfactor metrics: Evaluating the influence of authors, institutions, and countries

within the social science research network community. Journal of the American Society for

Information Science and Technology, 64(4):787–801, 2013. ISSN 1532-2890. doi: 10.1002/asi.

22790.

Gerhard J. Woeginger. An axiomatic characterization of the Hirsch-index. Mathematical Social

Sciences, 56(2):224–232, sep 2008a. ISSN 0165-4896. doi: 10.1016/j.mathsocsci.2008.03.001.

Gerhard J. Woeginger. An axiomatic analysis of Egghe’s g-index. Journal of Informetrics, 2(4):

364–368, October 2008b. ISSN 17511577. doi: 10.1016/j.joi.2008.05.002.

Qiang Wu. The w-index: A measure to assess scientific impact by focusing on widely cited papers.

Journal of the American Society for Information Science and Technology, 61(3):609–614, mar

2010. ISSN 1532-2890. doi: 10.1002/asi.21276.

Erjia Yan and Ying Ding. The effects of dangling nodes on citation networks. In Proceedings of the

13th International Conference on Scientometrics and Informetrics, page 4–8, 2011.

Erjia Yan, Ying Ding, and Cassidy R. Sugimoto. P-Rank: An indicator measuring prestige in

heterogeneous scholarly networks. Journal of the American Society for Information Science and

Technology, 62(3):467–477, 2011. ISSN 1532-2890. doi: 10.1002/asi.21461.

Chun-Ting Zhang. The e-Index, Complementing the h-Index for Excess Citations. PLoS ONE, 4(5):

e5429, 05 2009. doi: 10.1371/journal.pone.0005429.