Bibliographic Database Analysis:
Citation Graphs and Indirect Indicators
Eleni Fragkiadaki
Ph.D. Dissertation
Supervised by Georgios Evangelidis
Submitted to Department of Applied Informatics School of Information Sciences University of Macedonia
Thessaloniki, Greece June 2016 ii iii
© Copyright by Eleni Fragkiadaki, 2016. iv v
Advisory committee
Georgios Evangelidis (supervisor), Professor
Department of Applied Informatics, University of Macedonia, Greece
Nikolaos Samaras, Associate Professor
Department of Applied Informatics, University of Macedonia, Greece
Dimitris A. Dervos, Professor
Department of Information Technology, Alexander T.E.I. of Thessaloniki, Greece
Examination committee
Dimitris A. Dervos, Professor
Department of Information Technology, Alexander T.E.I. of Thessaloniki, Greece
Georgios Evangelidis (supervisor), Professor
Department of Applied Informatics, University of Macedonia, Greece
Dimitrios Katsaros, Assistant Professor
Department of Electrical & Computer Engineering, University of Thessaly, Greece
Georgia Koloniari, Lecturer
Department of Applied Informatics, University of Macedonia, Greece
Yannis Manolopoulos, Professor
Department of Informatics, Aristotle University, Greece
Antonios Sidiropoulos, Lecturer
Department of Information Technology, Alexander T.E.I. of Thessaloniki, Greece
Nikolaos Samaras, Associate Professor
Department of Applied Informatics, University of Macedonia, Greece vi vii
Abstract
Scientific publications with new advances in a vast number of scientific fields are being published and made available to researchers around the world daily. In such an active scientific environment it has become very important for researchers to not only be able to publish their work but to also understand and explore the research performed by other influential scientists. This process of discovery and dissemination of knowledge is one of the areas where Citation Analysis can be of great use. The different techniques, metrics and approaches defined by Citation Analysis allow scientists to identify publications of particular interest, follow the published research of influential scientists and even identify publications that have set the grounds for new research fields.
In our digital era this means that institutions and publishing bodies have a need to store large sets of information in bibliographic databases. The databases hold information about the publications, their respected authors and publishing bodies. Some bibliographic databases also hold the actual published manuscripts and index the publications based on a number of different factors including the list of provided keywords. Publications are also connected, as they always rely to some extend on previously published research performed by the same or other researchers. Therefore the data stored in these bibliographic databases can be expressed in the form of Graphs, and as we will see later in this dissertation, these Citation graphs can express the relationships that are formed between the different research entities (i.e. publications, authors and publishing bodies).
This dissertation examines the different research entities that participate in the publication process of scientific research in an attempt to classify the existing indicators used to identify influential publications, researchers and publishing bodies. It proceeds by examining the use of Citation Graphs in Citation Analysis and describes in detail the concept of Derived Graphs and the algorithms that can be used to produce them, which constitute part of the contribution of this study. We continue by studying the different definitions of generations of citation and critically evaluate them in order to select the definition that is later used in the set of proposed paper and author indicators. Finally a list of well known indicators is examined and compared against the proposed indicators using the data provided by two well known bibliographic databases, namely CiteSeerx and DBLP.
Keywords: Citation analysis, Bibliographic databases, Indirect indicators, Citation generations,
Citation graphs, Derived graphs, Paper-Citation graph, Author-Citation graph, Journal-Citation graph, Hirsch algorithms, Paper indicators, Author indicators, Journal indicators, Self-citations, Indirect citations, Scholarly assessment viii ix
Περίληψη
΄Ενας µεγάλος αριθµός επιστηµονικών δηµοσιεύσεων γίνεται διαθέσιµος καθηµερινά σε ακαδη-
µαϊκούς και ερευνητές ανά τον κόσµο. Οι ερευνητές, συµµετέχουν σε αυτό το τόσο ενεργό
επιστηµονικό περιβάλλον όχι µόνο δηµοσιεύοντας την προσωπική τους έρευνα, αλλά και αναζη-
τώντας πληροφορίες και ερευνητικές πηγές που παρουσιάζουν την έρευνα άλλων διακεκριµένων
επιστηµόνων. Σε αυτήν ακριβώς την διαδικασία της αναζήτησης και διάχυσης της επιστηµονικής πλη-
ϱοφορίας είναι που το πεδίο της Ανάλυσης ϐιβλιογραφικών αναφορών µπορεί να ϐοηθήσει σηµαντικά
το έργο των ερευνητών. Οι διάφορες τεχνικές, δείκτες και προσεγγίσεις που ορίζει, επιτρέπουν
στους επιστήµονες/ερευνητές να εντοπίζουν δηµοσιεύσεις ιδιαίτερης σηµασίας, να ακολουθούν
το έργο άλλων διακεκριµένων επιστηµόνων, ακόµα και να εντοπίζουν δηµοσιεύσεις που έθεσαν τα
ϑεµέλια για την ανάπτυξη νέων ερευνητικών περιοχών.
Στην ψηφιακή εποχή που Ϲούµε αυτό σηµαίνει ότι τα πανεπιστήµια καθώς και άλλοι ϕορείς που συµ-
µετέχουν στη διαδικασία της δηµοσίευσης της επιστηµονικής έρευνας, χρειάζεται να αποθηκεύουν
έναν µεγάλο όγκο δεδοµένων σε ϐάσεις ϐιβλιογραφικών αναφορών. Τα δεδοµένα που αποθη-
κεύονται σε αυτές τις ϐάσεις αφορούν τις δηµοσιεύσεις αυτές καθεαυτές, τους συγγραφείς τους
καθώς και τα περιοδικά/συνέδρια ή άλλους ϕορείς στους οποίους πραγµατοποιήθηκαν αυτές οι
δηµοσιεύσεις. Κάποιες ϐάσεις ϐιβλιογραφικών αναφορών αποθηκεύουν τα δηµοσιευµένα κείµενα
και ευρετηριάζουν διάφορα πεδία συµπεριλαµβανοµένων και των λέξεων κλειδιών που ορίζονται
από τις δηµοσιεύσεις, διευκολύνοντας µε αυτόν τον τρόπο την αναζήτηση των δηµοσιεύσεων.
΄Ενα πρόσθετο χαρακτηριστικό των δηµοσιεύσεων είναι ότι συνδέονται, εφόσον η τρέχουσα επιστη-
µονική έρευνα ϐασίζεται σε προηγούµενη έρευνα που πραγµατοποίησαν είτε οι ίδιοι οι συγγραφείς
είτε άλλοι ερευνητές στο εν λόγω επιστηµονικό πεδίο. Εποµένως τα δεδοµένα που αποθηκεύονται
στις ϐάσεις ϐιβλιογραφικών αναφορών µπορούν να εκφραστούν µε την µορφή Γράφων, οι οποίοι,
όπως ϑα δούµε στη συνέχεια αυτής της διατριβής, εκφράζουν τις συνδέσεις ανάµεσα στις διάφορες
επιστηµονικές οντότητες (δηµοσιεύσεις, συγγραφείς, περιοδικά).
Η παρούσα διδακτορική διατριβή εξετάζει τις διάφορες επιστηµονικές οντότητες που συµµετέχουν
στην διαδικασία της δηµοσίευσης µιας επιστηµονικής έρευνας σε µια προσπάθεια να κατηγοριο-
ποιήσει τους διάφορους δείκτες που ήδη χρησιµοποιούνται για τον εντοπισµό διακεκριµένων δη-
µοσιεύσεων, συγγραφέων και περιοδικών. Στην συνέχεια εξετάζει τους Γράφους ϐιβλιογραφικών
αναφορών καθώς και την χρήση τους στο πεδίο της Ανάλυσης ϐιβλιογραφικών αναφορών. Παρου-
σιάζει και εξετάζει σε ϐάθος την έννοια των Παράγωγων Γράφων καθώς και των αλγορίθµων που x
ορίζουν τα ϐήµατα µε τα οποία αυτοί οι γράφοι µπορούν να παρασκευαστούν ξεκινώντας από τις
ϐασικές πληροφορίες που περιλαµβάνονται στον Γράφο Αναφορών-∆ηµοσιεύσεων. Στην συνέχεια
παρουσιάζονται οι διάφοροι τρόποι ορισµού των γενεών ϐιβλιογραφικών αναφορών, οι οποίοι α-
ναλύονται λεπτοµερώς προκειµένου να καταλήξουµε στον προτεινόµενο ορισµό. Τέλος, µε ϐάση
τον επιλεγµένο ορισµό, ορίζουµε τους προτεινόµενους ϐιβλιογραφικούς δείκτες για την αξιολόγηση
δηµοσιεύσεων και συγγραφέων, οι οποίοι εν συνεχεία συγκρίνονται µε άλλους, υπάρχοντες δε-
ίκτες χρησιµοποιώντας τα δεδοµένα από δύο γνωστές ϐιβλιογραφικές ϐάσεις δεδοµένων, ονόµατι
ἳτεΣεερx και ∆ΒΛΠ.
Λέξεις κλειδιά: Ανάλυση ϐιβλιογραφικών αναφορών, Βάσεις ϐιβλιογραφικών αναφορών, ΄Εµµεσοι
δείκτες, Γενεές ϐιβλιογραφικών αναφορών, Γράφοι ϐιβλιογραφικών αναφορών, Παράγωγοι Γράφοι,
Γράφοι Αναφορών - ∆ηµοσιεύσεων, Γράφοι Συγγραφέων - ∆ηµοσιεύσεων, Γράφοι Περιοδικών - ∆η-
µοσιεύσεων, Ηιρσςη Αλγόριθµοι, ∆είκτες δηµοσιεύσεων, ∆είκτες συγγραφέων, ∆είκτες περιοδικών,
Αυτο-αναφορές, ΄Εµµεσες ϐιβλιογραφικές αναφορές, Αξιολόγηση της έρευνας Contents
Contents xi
1 Introduction 1
1.1 Overview...... 1
1.2 Contribution...... 5
1.3 Dissertation organization...... 5
2 Citation analysis fundamentals9
2.1 Scholarly assessment...... 11
2.2 Mathematical notation...... 13
2.3 Paper-Citation graph...... 14
2.4 Derived graphs...... 20
2.4.1 Author-Citation graph...... 20
Mathematical notation...... 21
Transformations...... 21
Example...... 23
Known applications...... 28
2.4.2 Journal-Citation graph...... 29
Example...... 29
Known applications...... 30
xi xii CONTENTS
2.5 Indirect Citations...... 32
2.5.1 Definitions...... 33
2.5.2 Example...... 34
2.6 Generations of self-citations...... 36
2.6.1 Definition...... 36
2.6.2 Example...... 36
3 Classifying assessment indicators 41
3.1 Paper indicators...... 42
3.1.1 Direct indicators...... 42
3.1.2 Indirect indicators...... 43
3.2 Author indicators...... 47
3.2.1 Direct indicators...... 47
Standard bibliometric indicators...... 50
h − index family of indicators...... 51
Standalone indicators...... 69
3.2.2 Indirect indicators...... 72
3.3 Journal indicators...... 75
3.3.1 Direct indicators...... 75
3.3.2 Indirect indicators...... 75
4 Proposed paper indicators 81
4.1 f-value...... 82
4.1.1 Definition...... 82
4.1.2 The Reducing Factor (RF)...... 83
4.1.3 Algorithm...... 84
4.1.4 Example...... 86 CONTENTS xiii
4.2 fpk − index ...... 88
4.2.1 Critical evaluation of generations...... 88
4.2.2 Definition...... 94
4.2.3 Example...... 96
4.3 Comparison...... 98
4.3.1 f − value and fpk − index ...... 99
4.3.2 Other well known indicators...... 100
First example...... 101
Second example...... 103
5 Proposed author indicators 107
5.1 fa − value ...... 108
5.1.1 Definition...... 108
5.1.2 Example...... 108
5.2 fak − index ...... 110
5.2.1 Definition...... 110
5.2.2 Example...... 111
5.3 fask − index ...... 113
5.3.1 Definition...... 113
5.3.2 Example...... 114
5.4 Comparison...... 115
6 Bibliographic databases 119
6.1 DBLP...... 120
6.1.1 Data...... 120
6.1.2 Parser...... 121
6.1.3 Publication data analysis...... 123 xiv CONTENTS
6.1.4 Author data analysis...... 124
6.2 CiteSeerx ...... 126
6.2.1 Data...... 126
6.2.2 cc-IF algorithm...... 127
6.2.3 Parser...... 129
7 Experimental results 133
7.1 Comparison indicators...... 133
7.1.1 Paper indicators...... 134
7.1.2 Author indicators...... 134
7.2 Ranking methodology...... 135
7.3 DBLP experimental results...... 135
7.3.1 Paper indicators...... 135
7.3.2 Author indicators...... 140
7.4 CiteSeerx experimental results...... 146
7.4.1 Paper indicators...... 146
7.4.2 Author indicators...... 153
8 Summary 161
8.1 Summary...... 161
8.2 Conclusions...... 164
8.3 Future work...... 167
9 Publication List 169
9.1 Journals...... 169
9.2 Conferences...... 169
List of Figures 171 CONTENTS xv
List of Tables 173
Bibliography 177 xvi CONTENTS Chapter 1
Introduction
Scientific research has always been the pioneer for progress and for pushing the limits of our technical abilities and skills. New discoveries, experimental results and theoretical studies, all form the basis for future applications in our every day life. Its importance is undeniable and the scientific community has always found a strong supporter in the academic sector with universities from all over the world providing funds and facilities to accommodate all aspects of research. Apart from the academic sector, research is also supported by research centers and organizations, as well as private companies that form and fund internal, focused research groups.
It has thus become ever more important to be able to assess and quickly identify areas of growing scientific interest, influential scientists and high impact scientific publications. A new area of research has been formed called Citation analysis, that assists with the assessment of these research entities in order to provide better insight in the constantly expanding scientific world.
The aim of the present study has been to further explore the concept of Indirect Citations (citations that do not directly target a paper A but are linked with paper A via a citation path of length greater than one) and the impact that they can have in the way we perceive the contributions of the different research entities in their respected scientific fields.
1.1 Overview
Based on the concept of Indirect Citations and on the Cascading Citations Indexing Framework
(cc-IF) [Dervos et al., 2006a] a new algorithm was defined [Fragkiadaki et al., 2009] that accepts as an input a Paper-Citation graph and recursively calculates the Medal Standings Output (MSO)
1 2 CHAPTER 1. INTRODUCTION table for the provided graph. The Cascading Citations Indexing Framework (cc-IF), defines that citations can be examined at any depth, starting from one (which defines a publication’s direct citations). The Medal Standings Output (MSO) table, lists the number of citations gathered at each depth (generation) for each publication included in the graph up to a certain depth. According to the framework these values can be calculated at the publication level alone or at the (author, publication) level at which point author self-citations can be excluded from the calculations.
The definition of the algorithm was followed by an implementation and experimental phase during which the algorithm was executed against the data provided by the CiteSeerx database [1997].
CiteSeerx is a bibliographic database that mainly indexes the fields of computer and information sciences and provides its data under a Creative Commons license using the OAI XML format . The implementation of the algorithm relies on a relational database to read and store data to, which means that the CiteSeerx data needed to be parsed and stored in a database using a compatible format. The output of the algorithm is the MSO table for the data provided by the graph that is stored in the relational database [Fragkiadaki et al., 2009].
The study continues to define a new bibliometric indicator that attempts to assist with the assessment of a particular publication, which is called f − value [Fragkiadaki et al., 2011]. The f − value is a recursive indicator, calculated for all the publications included in a Paper-Citation graph, that considers both the direct and indirect impact of a scientific publication. The Reducing Factor (RF), which participates in the mathematical formula that calculates the f − value of a publication, is used in order to reduce the contribution of citing publications that belong to distant generations. An implementation of the f − value is executed against the CiteSeerx data in order to acquire the raw values for the publications included in the Paper-Citation graph. In order to compare the results produced by f − value a number of additional implementations of already known bibliometric indicators are also provided. The indicators used at this stage are the Number of Citations [Hirsch,
2005, 2007, Costas and Bordons, 2008, Wu, 2010] and PageRank [Page et al., 1999, Ma et al., 2008].
Based on the f − value a new indicator is also proposed for the assessment of authors, called fa − value. The main elements considered by fa − value are the f − values of the individual papers that a particular author has co-authored. Apart from these values, the fa − value also takes into account the number of co-authors of each individual publication as well as, the number of years since the author’s first publication. The indicator considers the number of years since the first publication as the scientific age of a particular author.
The study continues with a review of the literature, during which a number of indicators are examined 1.1. OVERVIEW 3 in detail. Part of the contribution of this study is in fact the definition of a common mathematical notation used to present the formulas of the examined indicators. Using this common notation highlights the common elements of the indicators and the factors considered by each indicator. In addition, two new algorithms are proposed that cover multiple definitions of indicators that are either adaptations of an existing, popular indicator, called h − index [Hirsch, 2005]. The algorithms specify the framework under which these adaptations can be categorized and the way to define new indicators. Furthermore a number of existing indicators are presented under this common framework.
An additional contribution of the review of the literature is the definition of a framework that defines and produces Derived graphs. The term Derived graphs is used to describe the Author-Citation and Journal-Citation graphs that are generated from the meta-data information available in the
Paper-Citation graph. The steps required in order to produce such Derived graphs are defined in detail by the proposed framework [Fragkiadaki and Evangelidis, 2014]. In addition, a list of Derived graphs already found in the literature is presented along with some details about their type and how they have been used.
Following the review of the literature, three new Indirect Indicators are defined, one of which is used in the assessment of scientific publications and two for the assessment of authors [Fragkiadaki and Evangelidis, 2016]. The Indirect paper indicator is called fpk − index and is defined as an indirect indicator since it considers the first k generations in its calculations. The number k of generations to use can vary based on the characteristics of each individual Paper-Citation graph, particularly with regards to the different citation patterns encountered in different scientific fields. For the purposes of the experimental study of the three indicators the first three generations of citations are utilized.
It should be noted that an important aspect of the fpk − index definition is which citations are included in the generational citation counts for any particular publication, since depending on the chosen definition the calculated values can differ significantly. In Hu et al.[2011], there are four different types of generations defined that can be applied to forward or backward generations of citations. A critical evaluation of the four definitions is performed and the definition that performed the best under the examined scenarios is selected. In brief, the scenarios covered the following:
(a) the existence of citation cycles, (b) the existence of multiple citation paths of different lengths between a pair of publications, and, (c) the existence of multiple citation paths of the same length between a pair of publications.
The next two indicators proposed, namely fak − index and fask − index are both based on the fpk − index values defined for the individual publications of a particular author. Both indicators are 4 CHAPTER 1. INTRODUCTION defined as the average fpk − index value of the publications considered as part of the publication record of an author and their main difference is that fask − index excludes any self-citations from the calculations of the fpk −index values. In general, a citation has been considered a self-citation for a particular author if the same author has participated in both publications. As we can deduct from this definition self-citations where considered at the (author, publication) level and therefore a citation can be a self-citation for only a subset of the co-authors of a particular publication.
The study continued by implementing and executing all proposed indicators against the data provided by two bibliographic databases, namely CiteSeerx mentioned earlier, and DBLP . The
DBLP database is a bibliographic database that mainly indexes publications from the Computer
Sciences field. DBLP also provides its data [DBLP, dbl, 2009] under a Creative Commons license in XML format. During this phase of the study a common relational schema is defined along with the criteria under which the XML records are considered valid and ‘‘complete’’. In general records are considered ‘‘complete’’ if they provide all information required for calculating the values of the different indicators examined. The final step of this process is the actual storing of the data in two relational MySQL databases for further processing.
In order to compare the results produced by the proposed indicators, a set of existing bibliometric indicators is defined, implemented and executed against the same databases. The chosen indicators are a mixture of Direct and Indirect indicators and they cover the assessment of both papers and authors. More specifically, the paper indicators examined are the Number of Citations, the PageRank (d = 0.50, d = 0.85), the Contemporary h − index [Sidiropoulos et al., 2007] and the SCEAS rank[Sidiropoulos and Manolopoulos, 2005]. The set of author indicators examined are the Number of Citations, the Mean Number of Citations [Hirsch, 2005, 2007, Costas and Bordons, 2008, Wu, 2010], the h − index [Hirsch, 2005, 2007], the g − index [Egghe, 2006], the Contemporary h − index, the PageRank (d = 0.50, d = 0.85) and finally the SCEAS rank.
As already mentioned, the raw values of all indicators are calculated and stored in the relational databases. From the examination of the values it became apparent that because of their different formulas, the indicators produce values that can vary in scale, thus making a direct comparison of the produced values impossible. In order to make the comparison of the indicators possible, the indicator values were used to generate the ordinal ranking of the scientific entities, which can now be compared since they all produce values of the same scale. 1.2. CONTRIBUTION 5
1.2 Contribution
The main contributions of the present study are:
the definition of the cc-IF algorithm that generates the Medal Standings Output (MSO) table
for a given Paper-Citation graph
the definition of the two Hirsch algorithms that can be used to describe existing and define new variations of the h − index author indicator
the presentation of a number of indicators under a common mathematical notation in order
to categorize them and highlight the common factors considered by the different direct and
indirect indicators
the definition of a framework that algorithmically specifies the steps required to generate
Derived graphs from any given Paper-Citation graph
the definition of two new indirect paper indicators called f − value and fpk − index that utilize the information present in a Paper-Citation graph to produce a single value that
represents the scientific impact of a particular publication
the definition of three new indirect author indicators called fa − value, fak − index and fask −index the first of which is based on f −value and the remaining two on fpk −index
1.3 Dissertation organization
The dissertation consists of eight chapters a detailed description of which can be found below.
Chapter2, presents the fundamental concepts of Citation analysis and identifies the different research entities that participate in the different stages of scientific research. A common terminology and mathematical notation is defined that will be used throughout this study to identify and describe the different entities. In addition a listing of some of the most common metrics that have been used to perform scholarly assessment is presented, in order to assist us in identifying which characteristics of the different research entities can be used for scholarly assessment. The Paper-Citation graph is described in detail and a new framework for defining and constructing Derived graphs is presented.
As part of the examination of the Paper-Citation graph the Indirect citations paradigm is described 6 CHAPTER 1. INTRODUCTION in detail, along with the adopted framework that describes the different ways of defining these indirect citations. Finally, the concept of self-citations is described.
Chapter3, presents indicators found in the literature that have been used in the assessment of the different research entities. The indicators are categorized according to different aspects of their definition and the type of entity whose impact they attempt to assess. The main categorization is based on the type of entity, thus the chapter is separated into three sections, one for Paper, one for Author and one for Journal indicators. As a secondary categorization, the indicators for each different type are separated based on whether they consider only the Direct impact or both the Direct and Indirect impact an entity has had in its respected scientific area and field. Finally, specifically for the Author based indicators, a novel categorization of the h-index [Hirsch, 2005] based indicators is presented along with the two algorithms that can be used in order to define new variations.
Chapter4, is a detailed presentation of the proposed paper indicators. Both of the proposed paper indicators are based on the Indirect citations concept. The first indicator proposed, named f −value, depends on determining the citation counts for the first two generations of citations across the entire Paper-Citation graph and is calculated recursively for all the publications included in the graph. The second indicator proposed, named fpk − index, depends on the chosen definition of the different generations of citations and the citation counts of the first k generations for each publication. This chapter also contains a comparison of the two indicators among themselves and a comparison of the indicators with other well known indicators found in the literature.
Chapter5, presents the three proposed author indicators. The first indicator proposed, named fa − value, is based on the knowledge of the f − values of the publications included in the Publication Record of a particular author. Apart from the f − values the indicator also considers the number of co-authors of each publication and the first publication year of any particular author. The remaining two indicators, named fak − index and fask − index, are both based on the fpk − index values of the publications included in the calculations and the total number of publications included in the calculations. The difference between the two is that for the calculations of fask − index we need to exclude an author’s self-citations. The chapter also includes example applications of the three indicators on constructed Paper-Citation graphs, as well as a comparison of the three indicators performed over the same graph.
Chapter6, is a presentation of the bibliographic databases examined during this study, namely DBLP and CiteSeerx. The data provided by each database were parsed and stored in a unified format. 1.3. DISSERTATION ORGANIZATION 7
For the DBLP database, the chapter also includes a summary analysis of the paper and author generations of citations. For the CiteSeerx the cc-IF algorithm that was used to produce the MSO table is also presented.
Chapter7, presents the list of chosen bibliometric indicators that were implemented along side the indicators proposed by this study. The results of executing all implemented indicators against both bibliographic databases are presented, and since the data were stored in a unified format the same implementations were used and similar sets of results are presented for both datasets. For each indicator we present the raw calculated values and a generated ranking based on these values.
Finally, in Chapter8 we present the conclusions of the present study. 8 CHAPTER 1. INTRODUCTION Chapter 2
Citation analysis fundamentals
Different research entities participate in the full cycle of a piece of scientific research becoming known to the wider scientific community. Obviously the most fundamental part of the process is the research itself conducted by individual researchers, followed by a publication to one of the publishing bodies, like a scientific journal or conference, at which point it becomes available to the rest of the scientific community.
Since published research work is the means by which knowledge is disseminated in the scientific community we consider this to be the primary research entity. All peered reviewed scientific documents that appear in the literature, and are available to other researchers to examine, are considered by the present study. Examples of such documents are technical reports, published clinical research results, articles in scientific journals, papers that appear in conference proceedings, master and PhD thesis etc. We are going to use the terms publication or paper to describe any of the above scientific documents.
Publications carry knowledge about a specific scientific topic and they also provide additional information that can be used to identify relations between different research entities. So, if we consider the text of a publication to be the data that it carries, then all accompanying pieces of information form the meta-data for that publication. Part of the meta-data for a particular publication would be:
the title of the publication
the list of co-authors
the list of references
9 10 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS
the year of publication
and the publishing body
From the meta-data list we can derive two more research entities whose assessment forms part of the citation analysis fundamentals, namely the authors and the publishing bodies. We are going to use the term author to describe all parties involved in a particular research whose names form the list of co-authors of a publication. Therefore individual researchers, research groups, students, professors, research organizations and analysts would all be referred to as authors for the remaining of this document. Finally we are going to refer to all peer-reviewed published collections of scientific documents as publishing bodies or journals as they are more commonly referred to in the literature.
In this category we would place all printed or online scientific journals, conference proceedings volumes, institutional and open-access repositories etc.
Accessing the meta-data for a set of publications means that we can identify relationships among the different research entities like for example:
the fact that one publication refers to another publication
the fact that an author has co-authored more than one publications
the fact that two publications have been co-authored by the same person
the fact that two publications have been published by the same publishing body
the fact that two authors have co-authored one or more scientific publications together and so on. Different types of indicators have been proposed in the literature that allow us to quantify these relationships in order to produce a meaningful ranking of the relative importance of the different research entities. Some of these indicators are uniquely defined, whereas others try to improve some aspect of an already existing indicator.
Part of the contribution of the present study has been to review the literature in an attempt to provide a framework that would allow us to categorize and classify the existing bibliometric indicators based on their individual characteristics. In addition, a new novel framework has been created that allows us to further utilize the information present in the meta-data of the publications in order to produce different types of citation graphs. 2.1. SCHOLARLY ASSESSMENT 11
The rest of this chapter is structured as follows: In Section 2.1 we present some of most common metrics that appear in the literature that have been used for scholarly assessment. In Section 2.2 we present the fundamental mathematical notations that we are utilizing throughout the present study to refer to the different characteristics of the research entities examined. In Section 2.3 we present the Paper-Citation graph and examine its basic properties. In Section 2.4 we describe in detail the concept of the Derived Graphs, both for the Author and Journal research entities, we define the framework to generate the different variations of these graphs and we refer to the literature to identify which variations have already appeared in other studies. In Section 2.5 we discuss the different ways in which we can define the Indirect Citations in a Paper-Citation graph. Finally in
Section 2.6 we describe the Generations of Self-Citations.
2.1 Scholarly assessment
Bibliometric indicators are numbers (unique or not) that capture the past achievements of a researcher. They are used in evaluations, on the idea that, if a researcher has been successful in the past, he is expected to be successful in the future. The data used for obtaining measures of scholarly impact for a researcher, are mainly his publication record along with the citations that these publications have received. In some cases external factors are also considered, like the Impact
Factor [Garfield, 1999, 2005] of the publishing bodies and/or information about the scientific field he currently treats.
Below, we list the factors that can be obtained from the available meta-data and that are used in the literature to define various bibliometric indicators.
Number of papers: The number of papers a researcher has (co-)authored can be a measure of how productive he has been during his scientific career. It is a measure that has been used/mentioned in many articles [Hirsch, 2005, 2007, Costas and Bordons, 2008, Wu, 2010]. When used as an author indicator the total Number of Citations (NC) has also been referred to as the s − index van Eck and Waltman[2008] and the c − method Wu[2010].
Number of citations: The cumulative number of citations a researcher has received for all the papers that he has (co-)authored can be a measure of scholarly impact and has been used/mentioned in many articles [Hirsch, 2005, 2007, Costas and Bordons, 2008, Wu, 2010]. 12 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS
Scientific age: The number of years passed since the researcher published his first paper. It has been argued that a bibliometric indicator should account for the scientific age of a researcher, otherwise, younger promising researchers will not get the proper recognition until their achievements become comparable to the ones of the scientifically older researchers of their respective fields, both in the productive and impact scales [Hirsch, 2005, Glänzel, 2006, Jin et al., 2007, Antonakis and Lalive, 2008].
Age of individual papers: The number of years passed since each individual paper was published.
In general, the value of a paper, as perceived by its citation count, can only increase over time.
It has been argued that cases where a researcher relies solely on his already published papers without keeping on producing equally important papers, should be detected and accounted for by bibliometric indicators. This way, it is possible to distinguish between currently active and inactive researchers [Glänzel, 2006, Sidiropoulos et al., 2007, Katsaros et al., 2007, Jin et al., 2007].
Age of individual citations: The age of the individual citations that each of the papers has received.
In general, the age of the individual citations received by a paper can demonstrate the current impact of the paper. A paper that keeps on receiving citations can be considered to have a higher current impact than a paper that has stopped receiving citations. It has been argued that bibliometric indicators should consider the age of individual citations to distinguish between currently active and inactive papers [Sidiropoulos et al., 2007, Katsaros et al., 2007, Jin et al., 2007].
Self-citations: We would say that paper B cites paper A, if paper A is present in the list of references provided by paper B. Self-citations can occur for the authors of a paper A, when there is an overlap between the (co-)authors of paper A and the authors of a citing paper, B. Let us now consider paper A, co-authored by researchers U, W and Z, and a set of citing papers, and, let us assume that we are interested in the self-citations of author U. The following cases have been identified:
Own self-citations: The number of citations of paper A that include U in their author list. These
have also been referred to as (author, article) level self-citations [Dervos and Kalkanis, 2005],
self-citations [Hirsch, 2005, Kosmulski, 2006], and self-citations of the first kind [Schreiber, 2007].
Co-author self-citations: Calculate the number of own self-citations of each of the co-authors
of U for paper A, namely W and Z. The co-author self-citations are defined as the sum of all 2.2. MATHEMATICAL NOTATION 13
own self-citations of the co-authors of U for paper A. These have also been referred to as
self-citations of the second kind [Schreiber, 2007].
All self-citations: Identify the number of citations that include at least one of the (co-)authors
of A in their author list. Citations that include more than one of the co-authors of A in
their author list are accounted for once. These have also been referred to as article level
self-citations [Dervos and Kalkanis, 2005] and as self-citations of the third kind [Schreiber, 2007].
Co-authors: Both the number and order of authors included in the author list of a paper have been associated with publication patterns. It has been argued that a bibliometric indicator should try to assign different credit to each of the contributors based on the number and ordering of co-authors.
Different types of orderings can be identified and the following cases are only some of the possible scenarios: (a) contributors are listed alphabetically, (b) the most important contributor is listed first (or last) with the rest listed either by contribution or alphabetically, or, (c) contributors usually publishing together can use a rotating scheme in the ordering of the author list. Thus, when a bibliometric indicator needs to take paper co-authorship into account it can do so by considering either the number of authors alone [Batista et al., 2006, Hirsch, 2005, Schreiber, 2008a,b], or the number and order of authors [Wan et al., 2007], or the number of common co-authors [Fiala, Rousselot, and Ježek,
2008], or perhaps, by defining a completely novel way of accounting for co-authorship [Hirsch, 2010].
Scientific field: The scientific field in which a researcher is active can also affect our judgment of how successful he is. It has been pointed out that different scientific fields present different scientific patterns in the number of papers published, the number of citations received by each paper, and even, the number of authors included in each paper. It has been argued that a bibliometric indicator should consider differences in the scientific fields as well [Hirsch, 2005, Batista et al., 2006, Glänzel,
2006, Antonakis and Lalive, 2008, van Eck and Waltman, 2008].
2.2 Mathematical notation
In order to better describe the meta-data information available that define the different relationships we encounter in Citation Analysis we are going to use a number of mathematical notations described in the following list, which were originally defined in Fragkiadaki and Evangelidis[2014]: 14 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS
P = {P1, P2, ..., PNP} denotes the closed set of papers participating in a Paper-Citation graph and NP is the total number of papers included in the collection.
P(Al) = {Pi|PiP} denotes the set of papers that belong to set P that author Al has co-authored.
A = {A1, A2, ..., ANA} denotes the set of authors that have participated in any of the papers included in the Paper-Citation graph. NA denotes the total number of authors participating in the Paper-Citation graph.
J = {J1, J2, ..., JNJ} denotes the set of journals in which the papers of the Paper-Citation graph where published. NJ denotes the total number of journals participating in the Paper-Citation graph.
C = {CPiPj |Pi, Pj ∈ P} denotes the set of citations between the papers included in the Paper-
Citation graph. CPiPj denotes that paper Pj is cited by paper Pi and NC denotes the total number of citations (edges) present in the Paper-Citation graph.
a(Pi) denotes the total number of authors that have co-authored paper Pi.
c(Pi) denotes the total number of (weighted) citations received by paper Pi.
r(Pi) denotes the total number of papers referenced by paper Pi.
w(CPiPj ) denotes the weight of citation CPiPj .
2.3 Paper-Citation graph
We can consider a graph to be a representation of a closed set of elements that provide links to other elements within the same set. The elements form the nodes of the graph and the links between them represent the edges of the graph. The graph edges can be directional, i.e. originating from one node and terminating at another node, or not. A non-directional edge signifies that the two 2.3. PAPER-CITATION GRAPH 15 nodes are linked, whereas a directional edge represents a relationship that originates from one node and targets another node in the graph.
Two nodes can be connected directly, i.e. an edge exists between the two nodes, or indirectly. We identify an indirect connection between two nodes if there is a path that leads from one node to the other that potentially involves visiting more than one nodes in between.
We define a cycle as a path that originates from and terminates at the same node. Zero or more nodes may appear on the same path in which case we define the different levels of cycles, starting with level 0 where a node has an edge that points back to itself.
The most commonly examined graph in Citation Analysis is the Paper-Citation graph, where the different papers correspond to the nodes of the graph and the references provided by each paper act as the graph’s edges. The Paper-Citation graph is a directed, usually acyclic graph. When a source paper S references a target paper T, this signifies a one-way relationship that originates from paper S and links it with paper T. We are going to refer to this relationship as ‘‘S references T’’ or ‘‘T is cited by S’’ depending on the currently examined paper, and the notation used to identify this relationship is S → T .
With regards to a particular paper, citations can be defined as Forward or Backward. Forward citations are all the citations that reference the current paper whereas Backward citations are all the citations that the current paper provides via its Reference list. An example of how Forward and
Backward citations are defined for a particular paper (P3) is presented in Figure 2.1. For the rest of this dissertation we are only going to refer to Forward citations and examine how they are being used in Citation Analysis.
References Citations
P1 P2 P3 P4 P5
Backward Forward
2010 2012 2016
Figure 2.1: Example of Forward and Backward citations examined for paper P3.
As previously mentioned we would consider the Paper-Citation graph to be usually acyclic, if not on its entirety, at least for its biggest part. After a paper has been published its contents never really change which means that a paper will only ever be able to reference other papers that already 16 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS existed at the time of publication. It is not uncommon though for a paper to reference another paper that has not yet been officially published, either because it appears on an author’s personal web page or institutional repository or because it is being cited in a draft, pre-publication or online-first form. In these cases it is possible for a cycle to be created [Sidiropoulos and Manolopoulos, 2005].
In general, a level n cycle is going to include n + 1 papers [Fragkiadaki and Evangelidis, 2016]. So, for example three papers would participate in a Level 2 cycle as shown in Figure 2.2 (b). Using the notation defined earlier this cycle could also be presented as P 4 → P 1 → P 5 → P 4. It is true though, that in the absence of any additional information it is difficult to identify the order in which the papers were added in the graph. If we had knowledge of the publication dates of these papers we would be able to identify which paper provided a reference to a paper with a publication date set to the future.
P2 P3 P2 P3 P4 P2 P3 P4
P1 P1 P1
P6
P4 P5 P5
(a) (b) (c)
Figure 2.2: Cycles encountered in a Paper-Citation graph (a) Level 1 (b) Level 2 and (c) Level 3 cycle
As already discussed, in a Paper-Citation graph papers constitute the nodes of the graph and by using the reference list of each paper we can populate the graph with a set of directed edges.
Additional information about each paper can be depicted as properties of each node as shown in
Figure 2.3. The list of properties is formulated by the list of co-authors, the publishing body and the year of publication.
Based on the mathematical notations defined in Section 2.2, we would describe this Paper-Citation graph as
P = {P1,P2,P3,P4,P5,P6} with NP = 6 since we have six papers
A = {A1,A2,A3,A4,A5} with NA = 5 since we have five distinct authors
J = {J1,J2,J3} with NJ = 3 since we have three distinct journals papers 2.3. PAPER-CITATION GRAPH 17
A5 A4 J3 2014 J1 2013
w(C P5P4 ) w(C P4P3 ) P5 P4 P3
J2 2015 w(C ) w(C P3P2 ) A1 P4P2 J1 2010 A3
w(C P6P2 ) w(C P2P1 ) P6 P2 P1
J3 2015 J2 2012
A4 A2,A3
Figure 2.3: Example of a Paper-Citation graph The year of publication, journal and the list of co-authors are depicted as properties of the paper nodes.
and C = {CP2P1 ,CP3P2 ,CP4P2 ,CP4P3 ,CP5P4 ,CP6P2 } with NC = 6 since we have a total of six citations (edges) present in the graph
If we were to examine P2 in more detail we would also say that:
a(P2) = 2, since P2 has two authors, A2 and A3
c(P2) = 3, since P2 is referenced by P3, P4 and P6
r(P2) = 1 since P2 provides one reference, to P1
It is also worth noting that in the Paper-Citation graph we can only ever have one edge between a pair of papers, and this is due to the fact that even though a paper A may provide multiple in-text references to the same paper B, paper B can only appear in the Reference List of paper A once.
A common metric used in citation analysis is the number of citations a paper has received and it has been referred to in the literature as the Number of Citations (NC), s-index [van Eck and Waltman,
2008], or c-method [Wu, 2010]. With regards to the Paper-Citation graph this can be translated as the number of incoming edges to the paper node, or as the sum of the weights of these edges if the weight has been set to one. We are going to refer to this approach as Full Counting.
Utilizing the weight property of the edges one can provide additional information about the relationship of two papers, like the fact that a citing paper usually references multiple papers in its
Reference list. If we wish to account for the number of references each paper provides then the weight of each edge can be set to one divided by the total number of papers referenced by the source paper. We are going to refer to this approach as Fractional counting. 18 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS
In Figure 2.3 the weights of the edges have been represented with their mathematical notation and in Table 2.1 we are presenting the calculated values for the weights when using Full or Fractional counting. Fractional counting will always produce values that are either the same or lower than the ones produced by Full counting.
Citation Counting Full (FUC) Fractional (FRC)
w (CP2P1 ) 1 1
w (CP3P2 ) 1 1 1 w (CP4P2 ) 1 2 1 w (CP4P3 ) 1 2
w (CP5P4 ) 1 1
w (CP6P2 ) 1 1 Total 6 5
Table 2.1: Citation weights for the Paper-Citation graph of Figure 2.3.
As we can see from comparing the generated weights in Table 2.1, the edges that are affected by the different counting methods are the ones that originate from paper P4. P4 is the only paper in our Paper-Citation graph that references more than one papers, and therefore the weight of each of the edges is set to 1/2. It is also worth noting that the sum of all the weights of the graph edges is equal to the number of edges when using Full Counting and to the number of nodes when using
Fractional Counting.
Now that we have defined the weights of the edges in the Paper-Citation graph we can go back to one of the notations presented in Section 2.2 and provide more details about its actual definition.
More specifically, we have referred to c(Pi) as the total number of weighted citations received by paper Pi and in the example presented later we said that c(P2) = 3, since P2 is referenced by P3,
P4 and P6. This is still true if we apply Full counting in the original Paper-Citation graph, and it also represents the Number of Citations (NC) value for this paper. But, if Fractional counting is applied
1 then c(P2) = 2.5, since now w(CP4P2 ) = /2 instead of 1. Therefore we are going to refer to c(Pi) as the Weighted Citation count and it is going to be expressed as:
X c(Pi) = w(CPj Pi ) (2.1)
And, as previously mentioned, when the weight of all the edges in the Paper-Citation graph is set to 1 then the Weighted Citation count is identical with the Citation count, whereas when Fractional counting is applied to the Paper-Citation graph the Weighted Citation count is always going to be 2.3. PAPER-CITATION GRAPH 19 lower than or equal to the Number of Citations (NC).
In general the same details can also be presented in the Paper-Citation table which for the graph in Figure 2.3 is shown in Table 2.2. As we can see the only paper that does not reference any other paper in the graph is P1. The rest of the papers in the set do provide at least one reference to one of the papers in the graph. With regards to citations, the papers that do not receive any citations are papers P5 and P6, which only provide references.
Paper Publication year Journal Is cited by Co-authors References
P1 2010 J1 P2 A1 -
P2 2012 J2 P3,P4,P6 A2,A3 P1
P3 2013 J1 P4 A4 P2
P4 2014 J3 P5 A5 P3,P2
P5 2015 J3 - A3 P4
P6 2015 J2 - A4 P2
Table 2.2: Paper-Citation table for the Paper-Citation graph of Figure 2.3.
It is also worth noting at this point that the fact that a paper that does not receive any citations in the present citation graph does not mean that the paper has not received any citations at all. As it has been previously highlighted, no one citation database actually holds information about all the citations and references provided by a paper. It is not uncommon for papers to reference other papers not included in a particular citation database and is also not uncommon for a paper to receive citations by papers not included in the same database.
Unfortunately, it is not easy to link data that are stored in different bibliographic databases since there is no universal source of truth for citation data. One approach that could assist in combining the paper citation data from different bibliographic databases would be to use a paper’s DOI as part of any provided reference. This would mean that each paper could be uniquely identified across all bibliographic databases, thus making it easier for researchers to gather and combine citation data from different sources.
An equivalent approach for uniquely identifying authors has also been proposed by Dervos et al.
[2006b]. It is named the Universal Author Identifier (UAI) and would allow a researcher to be uniquely identified across all bibliographic sources. 20 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS
2.4 Derived graphs
The Paper-Citation graph presented in the previous section can be constructed from the information derived from the papers included in a closed set of papers. From the same information we can construct other types of graphs that utilize different aspects of the provided meta-data. More specifically, the two types of Derived graphs that we are going to examine more closely are the
Author-Citation graph and the Journal-Citation graph. We are going to refer to these graphs as Derived as they are constructed by applying a finite set of transformations to the originating
Paper-Citation graph.
The framework that defines how these graphs can be constructed in a structured way from the meta-data provided by the Paper-Citation graph consists part of the contribution of the present PhD thesis.
2.4.1 Author-Citation graph
The Author-Citation graph is a directed graph whose nodes are authors of papers and its edges represent the citations provided from one author to another. As discussed earlier, from any closed set of papers for which we have all required meta-data information available, i.e. title, author list and publication year we can construct the corresponding Author-Citation graph.
The nodes of the graph will be the set of co-authors of the papers in the collection. We consider an author A to reference another author B in the set if A has been the co-author of a paper that references at least one of the papers co-authored by B. Similar to the Paper-Citation graph and the definition of a paper citation, we say that ‘‘A references B’’ or ‘‘B is referenced by A’’ and the notation used to represent this relationship would be A → B.
The steps and transformations we need to apply to the originating Paper-Citation graph in order to produce the corresponding Author-Citation graph are:
Step 1: Define the weight of the edges in the originating Paper-Citation graph
Step 2: Produce the intermediate graph by transforming the paper citations to author citations
and define the weight of these edges
Step 3: Collapse the multiple edges between two authors to a single edge with a suitable
weight 2.4. DERIVED GRAPHS 21
Mathematical notation
Expanding on the mathematical notation defined in Section 2.2 the following can be defined in order to better describe the Author-Citation graph [Fragkiadaki and Evangelidis, 2014].
V V E = {EAkAl,PiPj |∃CPiPj ∈ C Pi ∈ P(Ak) Pj ∈ P(Al)} denotes the set of edges
between authors in the intermediate Author-Citation graph. EAkAl,PiPj denotes an edge from
author Ak to author Al that exists because there exists a citation from paper Pi co-authored
by Ak to paper Pj co-authored by Al.
e(Ak) denotes the total number of outgoing edges, originating from author Ak.
w(EAkAl,PiPj ) denotes the weight of edge EAkAl,PiPj .
Cd = {Cd |∃E ∈ E} denotes the set of edges between authors, or derived AkAl AkAl,PiPj author citations, in the final Author-Citation graph.
r(Ak) denotes the total number of authors referenced by author Ak in the final Author- Citation graph.
w(Cd ) denotes the weight of author citation Cd . AkAl AkAl
Transformations
So, using the defined mathematical notation we can now define the three steps needed to generate the Author-Citation graph from the original Paper-Citation graph, as originally described in
[Fragkiadaki and Evangelidis, 2014].
The weight of each paper citation CPiPj between papers Pi,Pj of the original Paper-Citation graph is 0 , citation does not exist 1 w(CPiPj ) = , fractional counting (2.2) r(Pi) 1 , full counting
The weight of each individual edge EAkAl,PiPj between authors Ak,Al of papers Pi,Pj respec- tively of the intermediate Author-Citation graph is
0 , edge does not exist w(EA A ,P P ) = (2.3) k l i j w(CPiPj ) , No normalization 1 w(CP P ) · , Normalize per citation i j a(Pi)∗a(Pj ) 22 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS
The weight of each derived author citation Cd between authors A ,A in the resulting Author- AkAl k l Citation graph is
0 , citation does not exist P w(EA A ,PiPj ) , full counting k l Pi,Pj ∈P ! 1 P d · w(EA A ,P P ) , fractional citation counting w(C ) = r(Ak) k l i j AkAl Pi,Pj ∈P ! 1 P · w(EAkAl,PiPj ) , fractional edge counting e(Ak) Pi,Pj ∈P ! 1 P P · w(EA A ,P P ) , fractional weight counting w(EA A ,P P ) k l i j k M i j P ,P ∈P Pi,Pj ∈P ∧AM ∈A i j (2.4)
So, as discussed in the previous section and based on Formula 2.2, there are two ways that we can define weights in the Paper-Citation graph, either by assuming that all edges have an equal weight of one (Full counting - FUC) or by assigning a weight to each outgoing edge of a paper, equal to one divided by the number of outgoing edges (Fractional counting - FRC).
In order to better demonstrate how we map the citations between the papers to links between the authors, let us consider a single pair of papers (P1, P2) from Figure 2.3. Authors A1, A2 and A3 are the co-authors of these papers and therefore the resulting graph would have three nodes, one for each author, with one outgoing edge from A2, one outgoing edge from A3 and two incoming edges for author A1.
In other words if there is a citation between two papers in the Paper-Citation graph, then, there is a citation between all the co-authors of the source paper to all the co-authors of the target paper in the resulting intermediate graph. This means that if the source paper has been co-authored by two authors and the target paper has been co-authored by a different set of three authors then the citation between the two papers will result in 2 ∗ 3 = 6 edges between the authors of the papers in the intermediate graph. These edges will be directed since the original edges are directed and therefore the nature of the graph remains the same.
While constructing the intermediate citation graph we might encounter author self-citations. This means that we might encounter a case where a particular author is part of the co-author list of both the source and target paper. In this case the generated edge will appear as an edge originating and terminating on the same author node.
Now that we have defined the nodes of the intermediate graph and how the edges are generated from the original Paper-Citation graph, the next step would be to identify the weights that the edges 2.4. DERIVED GRAPHS 23 should have. Based on Formula 2.3 there are two approaches to this. One is to assign the weight of the original paper citation as is to all derived author citations (No Normalization - NN). The other approach would be to distribute the weight of the original paper citation equally to the number of derived edges (Normalize per citation - NC). The number of derived edges from each individual paper citation is, the product of the number of co-authors of the source and target papers.
The final step in generating the Author-Citation graph is to define the edges of the final graph and assign the appropriate weights to them. The input of this step is the generated intermediate graph which has already defined the nodes of the graphs but may have more than one edges between the same source/target pair of authors. What we would like to have in the resulting graph is a single edge between each pair of authors that is weighted accordingly. As presented in Formula
2.4 we have identified four approaches in defining the weight of an author citation in the resulting
Author-Citation graph:
Full counting (FUC): The resulting weight is generated by adding the weights of all individual
author connections from the intermediate graph.
Fractional citation counting (FRCC): The resulting weight is generated by dividing the summed
weight of the individual author connections by the number of outgoing edges from the source
author in the final Author-Citation graph
Fractional edge counting (FREC): The resulting weight is generated by diving the summed
weight of the individual author connections by the number of outgoing edges from the source
author in the intermediate Author-Citation graph
Fractional weight counting (FRWC): The resulting weight is generated by diving the summed
weight of the individual author connections by the sum of the weights of all outgoing edges
from the source author
Example
Considering the Paper-Citation graph of Figure 2.3 we are going to construct the derived Author-
Citation graph and assign appropriate weights to all the edges in the Intermediate and final
Author-Citation graph. In order to generate the Intermediate Author-Citation graph we first need to translate all the citations in the originating Paper-Citation graph to their corresponding author citations. 24 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS
For example, let us consider the citation between papers P3 and P2 in the originating Paper-Citation graph. Paper P3 has been authored by A4 and paper P2 has been co-authored by authors A2 and A3. This means the P3 → P2 citation will be represented by two edges in the Intermediate
Author-Citation graph, A4 → A2 and A4 → A3. Similarly paper P6 references P2 and since P6 has been authored by A4 as well, we would have two additional edges in the Author-Citation graph originating from A4 and again targeting authors A2 and A3 respectively. If we do have a look at the
Intermediate Author-Citation graph we do see that there are two edges originating from A4 towards
A2, one from the P3 → P2 citation and one from the P6 → P2 citation. After repeating the above process for all citations present in the originating Paper-Citation graph we will have constructed the
Intermediate Author-Citation graph presented in Figures 2.4 (a).
(a) Intermediate graph (b) Final graph
Figure 2.4: Constructed Author-Citation graph
As discussed in the previous section the weights of the edges in the Intermediate Author-Citation graph depend on the weights of the citations in the originating Author-Citation graph and on the normalization method we have chosen for the Intermediate graph. In Figure 2.4 (a) the weights of the edges have been depicted with their mathematical notation and their actual values are presented in Tables 2.3 and 2.4.
In Table 2.3 we can see the weights assigned to each of the citations in the originating Paper-
Citation graph when Full Counting (FUC) has been applied to the originating Paper-Citation graph and in Table 2.4 we can see the weights when Fractional Counting (FRC) has been applied to the
Paper-Citation graph. In addition we can see the authors of each paper and the generated edges in the Intermediate Author-Citation graph along with their corresponding weights depending on whether we have normalized the weights of the citations or not. 2.4. DERIVED GRAPHS 25
Edge weight - Normalization
Citation Author None Per citation
1 Notation Weight From To w(CP P ) w(CP P ) · i j i j a(Pi)∗a(Pj )
1 1 A2 A1 1 2∗1 = 2 CP2P1 1 1 1 A3 A1 1 2∗1 = 2
1 1 A4 A2 1 1∗2 = 2 CP3P2 1 1 1 A4 A3 1 1∗2 = 2
1 1 A5 A2 1 1∗2 = 2 CP4P2 1 1 1 A5 A3 1 1∗2 = 2
1 CP4P3 1 A5 A4 1 1 = 1
1 CP5P4 1 A3 A5 1 1 = 1
1 1 A4 A2 1 1∗2 = 2 CP6P2 1 1 1 A4 A3 1 1∗2 = 2
Table 2.3: Paper (FUC) and Intermediate Author-Citation graph edge weights
For example if we have applied Full counting in the originating Paper-Citation and we have chosen to normalize the weights of each edge in the Intermediate Author-Citation graph per Citation, then for the citation P3 → P2 we would say that it has weight equal to 1 in the originating Paper-Citation 1 graph and the weight of the EA4A2,P3P2 would be 1 ∗ 1∗2 , since paper P3 has one author and P2 has two.
The next step would be to construct the final Author-Citation graph from the Intermediate graph we just generated and assign the appropriate weights to graph edges. The final Author-Citation graph will have exactly the same nodes as the Intermediate graph and it will have a single edge from one node to the other, i.e. if there more than one edges between a particular author pair these will be replaced by a single edge with an appropriate weight.
Figure 2.4 (b) present the final Author-Citation graph and as we can see the resulting Author-Citation graph has fewer edges since multiple edges between authors have been collapsed to single edges with different weights. The weights of the edges depend on the choices made in the previous steps and, as we have already mentioned, if we only examine the transition from the Intermediate graph to the Final one there are four different approaches that one could follow. The calculated weights 26 CHAPTER 2. CITATION ANALYSIS FUNDAMENTALS
Edge weight - Normalization
Citation Author None Per citation
1 Notation Weight From To w(CP P ) w(CP P ) · i j i j a(Pi)∗a(Pj )
1 1 A2 A1 1 2∗1 = 2 CP2P1 1 1 1 A3 A1 1 2∗1 = 2
1 1 A4 A2 1 1∗2 = 2 CP3P2 1 1 1 A4 A3 1 1∗2 = 2
1 1 1 1 1 A5 A2 2 2 ∗ 1∗2 = 4 CP4P2 2 1 1 1 1 A5 A3 2 2 ∗ 1∗2 = 4
1 1 1 1 1 CP4P3 2 A5 A4 2 2 ∗ 1 = 2
1 CP5P4 1 A3 A5 1 1 = 1
1 1 A4 A2 1 1∗2 = 2 CP6P2 1 1 1 A4 A3 1 1∗2 = 2
Table 2.4: Paper (FRC) and Intermediate Author-Citation graph edge weights of the edges when we apply Full counting in the originating Paper-Citation graph are presented in
Table 2.5. The equivalent weights when we apply Fractional Counting in the Paper-Citation graph are presented in Table 2.6.
As we can see from both tables the final weight for each of the edges in the resulting Author-Citation graph is affected by the transformations we choose to apply to the weights while we are building the final graph. In general, we can refer to the steps followed by naming the weight functions we used to generate the graphs, so for example FUC-NN-FUC would mean that we applied Full Counting
(FUC) in the originating Paper-Citation graph, we did not apply any normalization (NN) of the edge weights in the Intermediate Author-Citation graph and finally we applied Full Counting (FUC) in the resulting Author-Citation graph.
From Equation 2.4 we see that in all cases the weight of the edges in the resulting Author-Citation graph will be equal to the sum of the weights of all individual connections in the Intermediate
Author-Citation graph divided by nothing (FUC), or by the number of outgoing edges of the source author in the Intermediate graph (FREC), or by the number of outgoing edges of the source author in the final graph (FRCC), or, finally, by the sum of the weights of the edges in the Intermediate 2.4. DERIVED GRAPHS 27
Edge Citation FUC FRCC FREC FRWC
No Normalization