University of Southampton Research Repository ePrints Soton
Copyright © and Moral Rights for this thesis are retained by the author and/or other copyright owners. A copy can be downloaded for personal non-commercial research or study, without prior permission or charge. This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the copyright holder/s. The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the copyright holders.
When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given e.g.
AUTHOR (year of submission) "Full thesis title", University of Southampton, name of the University School or Department, PhD Thesis, pagination
http://eprints.soton.ac.uk UNIVERSITY OF SOUTHAMPTON
Understanding institutional collaboration networks: Effects of collaboration on research impact and productivity
by Jiadi Yao
A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy
in the Web and Internet Science Group Electronics and Computer Science Faculty of Physical and Applied Sciences
June 2014
UNIVERSITY OF SOUTHAMPTON
Abstract
Web and Internet Science Group Electronics and Computer Science Faculty of Physical and Applied Sciences
Doctor of Philosophy
by Jiadi Yao
There is substantial competition among academic institutions. They compete for stu- dents, researchers, reputation, and funding. For success, they need not only to excel in teaching, but also their research profile is considered an important factor. Institutions accordingly take actions to improve their research profiles. They encourage researchers to publish frequently and regularly (publish or perish) on the assumption that this generates both more and better research. Collaboration has also been encouraged by institutions and even required by some funding calls.
This thesis examines the empirical evidence on the interrelations among institutional research productivity, impact and collaborativity.
It studies article publication data across ACM and Web of Science covering five dis- ciplines – Computer Science, Pharmacology, Materials Science, Psychology and Law. Institutions that publish less seek to publish collaboratively with other institutions. Col- laboration boosts productivity for all the disciplines investigated excepted Law; however, the amount of productivity increase resulting from the institutions’ attempt to collab- orate more is small. The world’s most productive institutions publish at least 50% of their papers on their own. Institutions doing more collaborative work are not found to correlate strongly with their impact either. The correlation between collaborativ- ity and individual paper impact or institutional impact is small once productivity has been partialled out. In Computer Science, Pharmacology and Materials Science, no correlation is found. The decisive factor appears to be productivity. Partialling out pro- ductivity results in the largest reductions in the remaining correlations. It may be that only better equipped and well-funded institutions can publish without having to rely on external collaborators. These institutions have been publishing most of their output non-collaboratively, and are also of high quality and highly reputable, which may have equipped and funded them in the first place.
Contents
Abstract iii
List of Figures xi
List of Tables xiii
Declaration of Authorship xv
Acknowledgements xix
Abbreviations xx
1 Introduction1 1.1 Research Collaboration...... 2 1.2 Funding Research...... 2 1.3 The Role of Universities in Research...... 3 1.4 Competitions between Universities and Scientists in Research...... 4 1.5 Research Contributions...... 5 1.6 Thesis Structure...... 5
2 Measuring Research Activities and their Relationships7 2.1 Research Collaboration...... 7 2.1.1 Co-authorship as Collaboration Measurement...... 8 2.1.2 Other Approaches to Measure Collaboration...... 10 2.1.3 Collaboration Model of Institutions...... 10 2.1.4 Collaboration at the Micro Level...... 12 2.1.5 Summary of measuring research collaboration...... 13 2.2 Research Productivity...... 13 2.2.1 Publication Attribution...... 15 2.2.2 Summary of measuring research productivity...... 16 2.3 Research impact...... 17 2.3.1 Research Impact using Bibliometric Methods...... 19 2.3.1.1 Citation Count...... 20 Author Self-citation...... 20 2.3.1.2 Windowed Citation Count...... 21
v vi CONTENTS
2.3.1.3 PageRank Weighted Citation...... 21 2.3.1.4 Network-based Centrality Measures...... 22 2.3.1.5 Journal Impact as an Indicator of Article Impact..... 23 2.3.1.6 Reading Statistics and Download Logs...... 23 2.3.2 Debate between peer-review based metrics and bibliometrics based metrics...... 24 2.3.3 National Research Evaluation Exercises...... 25 2.3.3.1 UK...... 26 2.3.3.2 Australia...... 26 2.3.4 Journal Impact...... 27 2.3.4.1 Impact Factor...... 27 2.3.4.2 Acceptance Rate...... 28 2.3.4.3 ABS’s Journal Quality Guide...... 28 2.3.4.4 ABDC Journal List...... 29 2.3.5 Conference Quality...... 29 2.3.6 University Rankings...... 30 2.3.6.1 Webometrics Ranking of World Universities...... 32 Methodology...... 32 Coverage...... 33 Shortcoming...... 33 2.3.6.2 Guardian Ranking...... 34 Methodology...... 34 Coverage...... 35 Shortcoming...... 35 2.3.6.3 Academic Ranking of World Universities (ARWU).... 35 Methodology...... 35 Coverage...... 36 Shortcoming...... 36 2.3.6.4 World University Ranking By 4 International Colleges & Universities...... 37 Methodology...... 37 Coverage...... 37 Shortcoming...... 37 2.3.6.5 Compare and Contrast...... 37 2.3.7 Summary of measuring research impact...... 38 2.4 The Relationship Between Research Collaboration and Productivity... 39 2.5 The Relationship Between Research Collaboration and Impact...... 41 2.5.1 Types of collaboration...... 42 2.5.2 Time factors of the collaboration...... 42 2.5.3 Regional and subject variations...... 43 2.5.4 Alternative collaboration measurements...... 43 2.6 The Relationship Between Research Productivity and Impact...... 44 2.7 Network Models...... 45 2.7.1 Graph...... 45 2.7.2 Random Network Models...... 45 2.7.3 Small-World Phenomenon...... 46 2.7.4 The Scale-Free Network Model...... 48 CONTENTS vii
2.7.5 The Citation Network...... 49 2.7.6 The Co-citation Network...... 51 2.7.7 The Co-authorship Network...... 52 2.7.8 The Erd˝osNumber...... 52 2.7.8.1 Domain analyses...... 53 2.7.8.2 Network dynamics and evolution...... 54 2.8 Chapter Summary...... 55
3 Research Questions 57 3.1 Research collaboration and impact...... 58 3.2 Research collaboration and productivity...... 59 3.3 Research productivity and impact...... 60
4 Datasets and Methodology 61 4.1 Computing Resources...... 61 4.2 Datasets and Pre-processing...... 62 University Master List...... 63 4.2.1 ACM Digital Library Dataset Description...... 64 4.2.1.1 University Identification in ACM dataset...... 66 4.2.1.2 ACM Dataset University Recognition Rate...... 68 4.2.2 Web of Science Dataset Description...... 70 4.2.2.1 WoS Data Format and Data Relations...... 70 4.2.2.2 WoS University Name Processing...... 72 4.2.2.3 WoS Dataset’s University Recognition Rate...... 73 WoS Institution Types...... 75 4.2.3 Author Disambiguation...... 75 4.2.3.1 ORCID and ResearcherID...... 76 4.2.3.2 Author Identification and Institutional Repository.... 77 4.2.3.3 Fuzzy approach to Author Identification...... 77 4.2.3.4 ACM and WoS Author Identification...... 78 4.3 Descriptive Statistics...... 79 4.3.1 Paper Productivity...... 80 4.3.2 Citation...... 81 4.3.3 Collaborative Papers...... 82 4.3.4 Collaboration Size...... 83 4.4 Metrics Selection and Counting Methods...... 83 4.4.1 Institutional Productivity Measurements...... 84 4.4.2 Institutional Impact Indicators...... 84 4.4.2.1 Citations per institution...... 85 4.4.2.2 Citation PageRank...... 86 4.4.2.3 Citations per paper...... 86 4.4.2.4 University League Table...... 87 4.4.3 Institutional Collaboration Measurements...... 88 4.4.3.1 Number of times collaborated...... 88 4.4.3.2 Size-weighted collaboration...... 88 4.4.3.3 Percent collaboration...... 90 4.5 Correlation Methods...... 91 viii CONTENTS
4.5.1 Linear Correlation...... 92 4.5.1.1 Box-Cox technique for selecting transformation function. 92 4.5.2 Non-parametric Correlation...... 93 4.5.3 Null Hypothesis and Significance Testing...... 94 4.5.4 Normalisation...... 94 4.5.4.1 Linear regression...... 95 4.6 Network Methods and Visualisation...... 95 4.6.1 Collaboration Network Analysis on ACM Computer Science.... 96 4.6.2 Improving the Visualisation...... 98 4.6.2.1 Threshold on Edge Weight...... 98 4.6.2.2 Minimum Spanning Tree...... 99 4.6.2.3 Path Constraints...... 99 4.6.3 ACM Institutional Collaboration Graph...... 100 4.6.3.1 Visual Analysis of Institution’s Country...... 101 4.6.3.2 Visual Analysis of Institutional Impact and Collaboration 101 4.7 Chapter Summary...... 103
5 Relationships Among Research Productivity, Research Impact and Re- search Collaboration 107 5.1 Collaborativity versus Productivity...... 109 5.1.1 Computer Science...... 110 5.1.1.1 Top Institutions...... 110 5.1.1.2 Comparison of ACM and WoS...... 111 5.1.1.3 Institutional Collaborative Papers...... 112 5.1.1.4 Correlation of Collaborativity and Productivity in Com- puter Science...... 114 5.1.2 Pharmacology...... 116 5.1.2.1 Top institutions...... 116 5.1.2.2 Correlation of Collaborativity and Productivity in Phar- macology...... 117 5.1.3 Materials Science...... 117 5.1.3.1 Top institutions...... 117 5.1.3.2 Correlation of Collaborativity and Productivity in Ma- terials Science...... 118 5.1.4 Psychology...... 118 5.1.4.1 Top institutions...... 118 5.1.4.2 Correlation of Collaborativity and Productivity in Psy- chology...... 119 5.1.5 Law...... 120 5.1.5.1 Top institutions...... 120 5.1.5.2 Correlation of Collaborativity and Productivity in Law. 121 5.1.6 Institution’s Collaborator and Collaborative Papers...... 121 5.1.7 Discussion and Summary of Collaboration vs Productivity..... 122 5.2 Productivity versus Impact...... 125 5.2.1 Institution Citation...... 126 5.2.2 Paper Impact...... 127 5.2.2.1 Researcher Productivity and Paper Impact...... 129 CONTENTS ix
5.2.2.2 Discussion...... 132 5.2.3 Institution Rank...... 134 5.3 Collaboration versus Impact...... 135 5.3.1 Ranking and number of collaborated papers...... 136 5.3.2 Ranking and percent collaboration...... 137 5.3.3 Citations and collaborated papers...... 138 5.3.4 Citations and percent collaboration...... 138 5.3.5 Citations per paper and collaborated papers...... 139 5.3.6 Citations per paper and percent collaboration...... 139 5.4 Chapter Summary...... 140
6 Partial Correlations Among Research Productivity, Research Impact and Research Collaboration 145 6.1 Interpretation of the Factor Partialling...... 146 6.1.1 Factor Partialling Between Productivity and Impact...... 147 6.1.2 Factor Partialling Between Collaboration and Impact...... 148 6.2 Partial Correlation Results...... 148 6.2.1 Collaboration versus Productivity Controlled for Impact...... 149 6.2.1.1 Computer Science...... 149 6.2.1.2 Psychology...... 151 6.2.1.3 Pharmacology...... 152 6.2.1.4 Materials Science...... 152 6.2.1.5 Law...... 153 6.2.1.6 Productivity and proportion of collaborated papers... 153 6.2.1.7 Discussion of Collaboration vs Productivity with Impact Controlled...... 154 6.2.2 Productivity versus Impact Controlled for Collaboration...... 155 6.2.2.1 Computer Science...... 155 6.2.2.2 Psychology...... 156 6.2.2.3 Pharmacology...... 156 6.2.2.4 Materials Science...... 157 6.2.2.5 Law...... 157 6.2.2.6 Discussion of Productivity vs Impact with Collaboration Controlled...... 157 6.2.3 Collaboration versus Impact Controlled for Productivity...... 158 6.2.3.1 Computer Science...... 158 6.2.3.2 Psychology...... 158 6.2.3.3 Pharmacology...... 161 6.2.3.4 Materials Science...... 162 6.2.3.5 Law...... 163 6.2.3.6 Discussion of Collaborativity vs Impact with Productiv- ity Controlled...... 163 6.3 Chapter Summary...... 165
7 Conclusions 167 7.1 Overview of Research Findings...... 168 7.1.1 Measuring institutional collaborativity, impact and productivity. 168 x CONTENTS
7.1.1.1 Research collaborativity...... 168 7.1.1.2 Institutional impact...... 169 7.1.1.3 Research productivity...... 169 7.1.2 Institutional collaboration percentage...... 170 7.1.3 Collaboration structure...... 170 7.1.4 Institutional impact and productivity...... 171 7.1.5 Individual impact and productivity...... 171 7.1.6 Paper count...... 172 7.1.7 Collaboration visualisation...... 173 7.2 Answers to Research Questions...... 173 7.2.1 Relationships between collaborativity and impact...... 173 7.2.2 Relationships between collaborativity and productivity...... 175 7.2.3 Relationships between productivity and impact...... 175 7.3 Implications of the Findings for Universities and Scientists...... 176 7.3.1 Collaborativity does not enhance productivity...... 176 7.3.2 Discipline differences...... 177 7.3.3 More is not always better...... 177 7.4 Limitations and recommendations for future studies...... 178 List of Figures
2.1 Institutional Collaboration Model...... 11 2.2 Watts and Strogatz Model...... 48
4.1 A sample article in the ACM XML dataset...... 65 4.2 Intermediate data structure used for the ACM dataset...... 66 4.3 An example entry in intermediate data structure for the ACM dataset.. 66 4.4 Alternative method to estimate the number of authors in ACM dataset. 79 4.5 Paper productivity overview...... 80 4.6 Citations received by each discipline in the period studied...... 81 4.7 Collaborative Papers...... 82 4.8 Collaboration Size...... 83 4.9 Size-weighted collaboration calculation...... 90 4.10 The correlation measurement triple...... 91 4.11 ACM institutional collaboration network...... 96 4.12 ACM Institution collaboration distribution...... 97 4.13 Institutional collaboration Graph Applied PathFinder...... 100 4.14 Institution collaboration graph colour by country...... 102 4.15 Institution collaboration graph with colour represent ranking and size represent collaboration...... 103
5.1 Institution’s collaborative paper count in descending order in Computer Science...... 113 5.2 Correlation results of institutional productivity and collaboration..... 115 5.3 Institution’s collaborative papers and collaborators...... 123 5.4 Correlation results of institutional productivity and impact...... 128 5.5 Paper impact and productivity correlation coefficient...... 131 5.6 High and low productive authors’ correlation between paper impact and productivity...... 132 5.7 Correlation results of institutional collaboration and impact...... 136
6.1 Correlation results of institutional productivity and collaboration, with impact factors controlled...... 150 6.2 Correlation results of institution productivity and impact, with collabo- ration factors controlled...... 156 6.3 Correlation results of institution productivity and impact, with collabo- ration factors controlled...... 160
xi
List of Tables
2.1 Coverage of Webometrics Ranking...... 34 2.2 Pearson correlations between the four rankings...... 38
4.1 Address variations...... 69 4.2 ACM institution type distribution...... 69 4.3 The most used WoS abbreviation rules...... 73 4.4 Institution types of the WoS data...... 75 4.5 Dataset overview by discipline...... 79 4.6 Collaborative paper overview by discipline...... 80
5.1 Variable pairs for pairwise correlation...... 108 5.2 Top Institutions in Computer Science (ACM) by paper count and collab- oration variables...... 110 5.3 Top Institutions in WoS Computer Science by paper count and collabo- ration variables...... 111 5.4 Spearman’s correlation between productivity and collaboration...... 114 5.5 Top Institutions in Pharmacology by paper count and collaboration metrics116 5.6 Top institutions in Materials Science based on total papers and collabo- ration metrics...... 117 5.7 Top Institutions in Psychology by paper count and collaboration metrics. 119 5.8 Top Institutions in Law by paper count and collaboration metrics..... 120 5.9 Correlation results of the productivity vs impact...... 127 5.10 Sample data for productivity vs impact at the individual level...... 130 5.11 Correlation coefficient between collaboration and impact. PUBCOLL showed significant positive correlations with all three citation metrics in all disciplines investigated. On the other hand, PUBCOLL% showed sig- nificant negative correlation with all of the citation metrics in Computer Science, Pharmacology and Materials Science. Interestingly, the nega- tive correlation diminishes in Psychology and Law, which happen to be NSE disciplines. This could suggest a division in collaboration attitude between NSE and SSH disciplines...... 143
6.1 Correlation coefficient between productivity and collaboration, with im- pact factors controlled...... 149 6.2 Correlation coefficient between productivity and impact, with collabora- tion factors controlled...... 155 6.3 Correlation coefficient between impact and collaboration, with productiv- ity factors controlled...... 159
xiii
Declaration of Authorship
I, Jiadi Yao, declare that this thesis titled, ‘Understanding institutional collaboration networks: Effects of collaboration on research impact and productivity’ and the work presented in it are my own. I confirm that:
This work was done wholly or mainly while in candidature for a research degree at this University.
Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated.
Where I have consulted the published work of others, this is always clearly at- tributed.
Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work.
I have acknowledged all main sources of help.
Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself.
Signed:
Date:
xv
To my wife Yiwen and my parents For all your love, support and encouragement
xvii
Acknowledgements
Thank you to Prof Les Carr for being a fantastic supervisor over the past four years.
He has given me many opportunities to present this work at conferences, provide me research exchange opportunities during my study.
Thank you to Prof Stevan Harnad for being an incredible supervisor. He has given me crucial insights and guidance towards the completion of this work. Thank you for looking after me while I was on the research exchange in Montr´eal. Without any of them, this work would never have gone this far.
Thank you to Yves Gingras and Vincent Larivi`erefor the discussion we had in Montr´eal and provided me the research data from the Web of Science.
Thank you to Andrew Bell and Sheila Bowen for spending hours to proof read my work, which probably is not interesting to them.
Thank you to everyone who has worked around me in level 3 building 32 for good humour and distracting me from writing this thesis. I’ve enjoyed having you guys around. Thank you to all members of WAIS group, past and present. It’s been an honour to know and work with you all.
xix
Abbreviations
There are three main institutional variables – productivity, impact and collaboration, each is represented by the abbreviations below and carries the meaning as following:
P Institutional productivity, includes the paper count. Q Institutional impact, includes all three impact variables (total citation per institution, pagerank weighted citation and average citation per paper). C Institutional collaboration, includes all three collaboration variables (collabo- rative paper count, size-weighted collaboration and percent collaboration).
’Quality’ has been used to indicate different aspects of the research activity in the literature. The specific meaning of quality discussed in this thesis is as follows:
Institutional Quality The perceived quality of an institution. (e.g. historical & current reputation, research output, research output impact, funding and infrastructures etc.) Institutions generally cover entities such as universities, research centres and company research laboratories that produce research output. We fo- cus mainly on universities in this study. Research Quality The quality of the entire research cycle. (e.g. methodology quality, report quality etc.) Paper Quality The quality of a published paper. Citation impact and its variants are often used as proxies in the literature.
xxi The abbreviation for the sub-variables are as follows.
The original and raw variables are represented in bold and italic:
PUBTOT Total institutional paper output CITTOT Total citations per institution. CITTOTw PageRanked citations per institution. Incoming citations weighted by citation weight of citing institution. CITAV Average citations per paper. WRANK Institutional Webometrics Rank, July 2010 version. PUBCOLL Number of collaborative papers. PUBCOLLw Number of collaborative papers, with collaboration size-weighted. PUBCOLL% Percentage of collaborative papers over total papers per institution.
This study uses partial correlation to remove the effect of the third variable in order to find the true relationship. The partialled states of the variables are represented in italics-only:
PUBTOT Total institutional paper output with impact or collaboration controlled. CITTOT Citations per Institution with productivity or collaboration controlled. CITTOTw PageRanked citations per institution with productivity or collaboration controlled. CITAV Citations Per Paper with productivity or collaboration controlled. WRANK Institutional Webometrics Rank with productivity or collaboration con- trolled PUBCOLL Number of Collaborative Papers with productivity or impact controlled. PUBCOLLw Size-weighted Collaboration with productivity or impact controlled. PUBCOLL% Percent Collaboration with productivity or impact controlled. Chapter 1
Introduction
Scientific and scholarly research is a systematic investigation of data in order to estab- lish facts and reach new conclusions. It is mostly conducted by scientists working in universities, research institutes and companies’ research laboratories.
A typical research work that carried out by the above establishments generally has 4 stages:
1. Identification of the knowledge gap
2. Creation of knowledge
3. Quality assurance
4. Dissemination
At the gap identification stage, researchers make their observation of the world, review previous publications and then propose research questions. The knowledge creation stage is where the practical investigation happens. They design experiments, collect data and then analyse the data. They then document this process, so that anyone can replicate the procedure to obtain the same outcome. Their interpretation of the result is also presented. Peer reviews are conducted afterwards for quality assurance, and these are carried out by experts in the field to make sure the research is sound. Finally, the research is disseminated, so that other researchers can use it as basis for further research. 1 2 Chapter 1 Introduction
1.1 Research Collaboration
Traditionally, research was carried out by a single scientist or their colleagues in the same laboratory. They would conduct the required experiments themselves, even though they may not initially have the necessary skills or equipment. If they could not conduct an experiment themselves, they would reach out to a potential collaborator to get help, in return for including them as co-authors. This kind of ad-hoc collaboration is heavily de- pendent on the researcher’s personal connections. The first paper with multiple authors listed was published in 1665[122].
The problems and questions scientists try to resolve are getting more and more complex and increasingly multi-disciplinary. For example, the problem of global warming and climate change; research of the Web and the need to understand its reciprocal effects on human society. These topics require experts from multiple disciplines and the mode of ad-hoc collaboration can no longer sustain the needs of these researches. As a result, collaborative activities, both from bottom up (researchers collaborate with each other) and top down (funders funding collaborative research; institutional collaboration) have increased dramatically in the recent years.
Collaborative research has also been strongly encouraged by institutions on the as- sumption that researchers working together – especially across institutions and across national borders – generate research that is higher in both quantity and impact. Institu- tions are not alone in encouraging collaboration. Funding bodies, such as JISC and the
European Framework Programme (FP), often have specific requirements for individual, inter-institutional and international collaboration in their funding calls[8, 55].
1.2 Funding Research
The funding structure differs from country to country. In the UK, most of the research is publicly funded. A percentage of tax payers’ money is allocated to funding councils and research councils. England, Scotland, Wales and Northern Ireland each has a cor- responding higher education funding council to allocate funds to institutions. In 2014, Chapter 1 Introduction 3 they have jointly conducted the Research Excellence Framework (REF 2014) to assess the research output of the UK institutions. The REF outcome will be used as a refer- ence for allocating funds between institutions. There are also different research councils focusing on specific fields, e.g, Engineering and Physical Science Research Council (EP-
SRC), Economics and Social Science Research Council (ESRC). These research councils steer the direction of the research by releasing call for bids. Scientists then put forward research proposals to compete for grants. Charities in the UK are also major funders, especially in medical research (e.g. Cancer Research UK is a charity focuses on fund- ing cancer related research; British Heart Foundation focuses on funding heart related research). Some research is also funded by industry and private companies.
1.3 The Role of Universities in Research
The role of universities in producing research output varies depending on the countries’ development and economic status. Research involves a high investment, which does not generate direct return on the investments. Universities in less developed countries, due to their economic status and government strategies, may not allocate much funding in conducting research. In these countries, universities tend to focus on education.
In the UK, US, Canada and Australia, universities are one of the biggest output of re- search publications[105]. In the UK, a group of 24 research focused universities form the
Russell Group. This group represents two-thirds of all UK research grant spending. In the US, universities are also evaluated and ranked depending on their research activities, resulting in high pressure for them to do well in research. The US alone represents 28% of all world research as recorded by WoS, most of which is university output[200]. 4 Chapter 1 Introduction
1.4 Competitions between Universities and Scientists in
Research
There is substantial competition among academic institutions. They compete for stu- dents, scientists, reputation, and funding. For success, they need to excel in both teach- ing and research. Universities accordingly have policies that encourage researchers to publish frequently on the assumption that this generates both more and better research.
Public perception on quality of a university often includes, but not limited to students’ experience, students’ prospects, and research output performance. In these respects, university rankings play an important role in communicating universities’ quality to the public. There are more than a dozen well established university quality rankings. They are used by the general public, potential students, researchers and donors to assess the university’s overall quality. Almost all of these rankings’ evaluations are based on some aspects of research activities, e.g., some count the university’s number of publications and some use citations to the papers. Many universities treat these rankings seriously and work hard to improve them, even though experts may have their own views regarding the validity of these rankings[22].
As a scientist, they have their own incentives to make a stronger impact to the com- munity. To produce more publications is one effective approach. In academia there is a well known phrase – “publish or perish” [13] that describes the pressure to rapidly and regularly publish academic work. Publishing more is important for a scientist’s career advancement. In order to secure a research position, to sustain or to further one’s career, and to succeed in applying for grants, scientists need to publish frequently. They are often judged by the number of papers they have published, the qualities of the journals they have published in and the number of citations they have received for their papers.
Producing more publications apparently makes some of these measurements look better. Chapter 1 Introduction 5
1.5 Research Contributions
The literature of studying the relationship among citation impact, collaborativity and productivity is extensive and comprehensive. But most of them focus on the aggregation level of individual researcher, journal, discipline or country. This study look at these relationships specifically at the institution level using large scale data covering multiple disciplines.
Many studies consider variables separately. As we have demonstrated in chapter 4, citation impact, collaborativity and productivity are circularly correlated. This means that there might be a common factor presented in each variable which lead to the high pairwise correlation. The current study uses partial correlation to remove the effect of the third variable in the circular correlation before correlate the remaining pair of variable, giving evidence that increased citation is more associated with productivity than collaborativity at the institution level in the disciplines studied.
This also gives additional evidence on the disciplinary differences towards research col- laboration and citation. Institutions are recommended to develop discipline-specific strategy in encouraging research collaboration and publication. Top cited institutions do not publish engineering and nature science papers collaboratively, while they do for social science and humanity papers.
Finally, this study processed tens of thousands of lines of free text representing univer- sities, and created a lookup table of university’s alternative names that maybe useful in other studies.
1.6 Thesis Structure
In chapter 2, we start with studying the meaning of institutional quality, institutional collaboration and productivity, and what methods were used to measure these factors.
The attempts to establish the relationship between these three activities were also re- viewed. 6 Chapter 1 Introduction
Chapter 3 describes the resources, datasets and methods to be used in the further chap- ters. In particular, it describes the necessary pre-processing done to the dataset in order to apply the correlation methods. The assignment of the papers, citations and collaborations to the universities are presented. Based on the dataset and previous stud- ies, I make decisions on what metrics to use to approximate the three factors. These decisions were justified and the metrics were explained. The statistical methods – corre- lation, normalisation, partial correlation that help understand the relationship between the metrics are described. A network visualisation of the institutional collaboration network is presented, showing evidence of correlation expected.
Chapter 4 investigates and presents the results of the pairwise correlation between the three factors. These pairwise correlations used the unfiltered variables that directly come from the statistics of the universities. Unfiltered variables were commonly used by previous studies; comparison is made when they are comparable with previous studies.
Chapter 5 takes chapter 4 one step further by partialling out the unwanted variables between the pairwise correlations. The results are presented and compared.
Finally, the conclusions of the correlation analysis, answers to the research questions and recommendations on research practice for universities and scientists are presented in chapter 8. Chapter 2
Measuring Research Activities and their Relationships
2.1 Research Collaboration
The concept of collaboration has been taken for granted in most of the literature but there have been few attempts to address the meaning of collaboration. Collaboration means that individuals work together to achieve a common goal. Research collaboration, therefore, means that individuals work together to advance scientific knowledge. But this immediately raises a question: how closely should they be working to qualify as collaboration? In one extreme case, as Subramanyam [178] suggested in his paper, the entire scientific community is like a big collaboration because they work collaboratively to advance science: scholars learn from each other from their publications, make com- ments to each other, suggest hypotheses to test, exchange ideas and share techniques.
In another extreme case, for example, if research collaborators are only those who have contributed directly and frequently to all tasks in a piece of research, then almost all collaborating researchers will be excluded, even those working closely together who have published scientific papers together. This is because one of the main reasons for re- searchers to collaborate is that not everyone knows and does everything, so during the
7 8 Chapter 2 Measuring Research Activities and their Relationships scientific collaboration, it is not common for every collaborator to be involved equally intensively in every aspect of the research.
Collaboration can have different meanings in different research areas and contexts. A technician operating a specific piece of machinery can be counted as a collaborator in one discipline (e.g. in physics, operators are listed as co-author), but may not in others
(e.g. in medical research, the operator of radioactive medical equipment is not listed as a co-author).
2.1.1 Co-authorship as Collaboration Measurement
We have seen that it is not possible to strictly measure the strength of research collabo- ration by the interactions due to its complexity. So scholars have also tried quantitative measures.
Attempts to quantify research collaboration date back to the 1950s. Psychologist Smith
[172] observing the increase of multi-authored publications was the first to suggest using multi-authorship as a proxy to quantify collaborative activities between researchers.
This is often referred as co-authorship. Co-authorship is a research practice, where the primary author includes other researchers as co-authors in the publication to recognise their contributions in the work.
Price and Beaver [159] were among the first to use co-authorship as the measure of collaboration strength between authors. Since then, co-authorship has been used widely as a metric of research collaboration.
Collaboration strength between a pair of authors is often approximated by counting the number of papers that lists the pair’s names. A pair of authors who have co- authored more papers is more collaborative than than a pair who have co-authored fewer papers. By connecting authors who have co-authored papers, a network of authors can be constructed. The network methods are discussed in chapter 6. More advanced algorithms than simply counting the number of co-authored papers were presented and compared by Rousseau [163]. Chapter 2 Measuring Research Activities and their Relationships 9
The suitability of co-authorship as a measure of collaboration has been investigated by Subramanyam [178]. He noted that the nature and magnitude of the collaboration changes during the course of the collaboration, so the precise nature and magnitude of collaboration cannot be determined using methods of interviews or questionnaires, let alone co-authorship. A popular example is that a casual conversation during a coffee break could be more valuable than week-long laboratory work towards the success of the project.
Using co-authorship relies on the assumption that the goal of collaboration between the authors is publishing articles. That is, a co-authored paper is indeed a result of collab- oration, not something else. However, Katz [104, 105] and He et al.[93] demonstrated several scenarios, where either collaborative work does not yield a co-authored paper, or a co-authored paper does not involve any collaboration between the listed co-authors. In addition, social pressures come into play in determining whose name should be included on a published article [97]. Indeed, some higher position scientists (e.g. the leader of
a research group) are often listed as co-authors despite they did not contribute. Fur-
thermore, Bozeman and Lee [33] identified two specific types of co-authorship that often
appear but are less relevant to collaboration: 1. data provider listed as co-author; 2.
equipment provider listed as co-author.
As a result, we should be aware of the limitation in using co-authorships. While the
social aspects of the collaborations may not be quantifiable using conventional methods,
co-authorship offers an unique quantitative approach to measure the strength of the col-
laboration between authors. Here is a list of advantages using co-authorship to measure
collaboration:
• invariant and verifiable. Anyone with the same dataset, using the same method
can reproduce the result.
• inexpensive and practical. Counting co-authorship in datasets is computationally
cheap. The programming is often simple and straight-forward. The dataset is
readily available. 10 Chapter 2 Measuring Research Activities and their Relationships
• large sample size, statistically more informative than case-study or qualitative
method.
• non-reactive. It will not lead authors to co-author papers in the short term to
increase their collaboration. However, wide adoption of this measurement poten-
tially may affect the collaboration process in the longer term [178].
2.1.2 Other Approaches to Measure Collaboration
Beyond co-authorship, other measuring methods were also used. Rigby and Edler [162] used the proportion of co-subprojects (people who worked on the same sub-project) to measure the intensity of collaboration within a project. They constructed a network based on the sub-projects having bi-lateral and multi-lateral cooperation in their project reports. The proportion of the number of nodes and the edges in the network (i.e.the network density) represents the intensity of the collaborations within the project. They used this measurement to compare multiple projects for their intensity of collaboration.
An interesting collaboration measurement model based on citation was proposed by Katz and Hicks [103]. In their research, they found that where the papers represent a home or domestic institution collaboration, the citation count increases 0.75 on average compared to a non-collaborative paper. However, if the paper involves international institutions, the citation count increases 1.6 on average. As a result, the different collaboration types can be quantified by the increased citation number, thus perhaps also reflecting the quality of different categories of collaboration.
2.1.3 Collaboration Model of Institutions
In academia, a model of institutional collaboration is based on projects that involve multiple institutions. Starting from the application of a grant, researchers from multiple institutions jointly submit applications for the project. If the application is successful, the project then becomes a point of collaboration for these researchers. The papers come out of the project is commonly affiliated to participating institutions. (Figure 2.1). Chapter 2 Measuring Research Activities and their Relationships 11
Figure 2.1: Institutional Collaboration Model. Researchers from two institutions form a project, which then produces papers signed by both institutions.
Based on this model, we propose a definition of institutional collaboration used in this thesis:
Definition Two institutions are said to have collaborated if there is a
published paper signed by both institution’s researchers.
Using measures like co-authorship as indicators of degree of collaborativity is not new.
Davidson Frame and Carpenter [52] used multi-country affiliations (two or more authors affiliated with institutions in different countries) as a metric for international collabo- rativity; [105] used multi-institutional affiliations (two or more authors affiliating with different institutions) as a metric for inter-institutional collaborativity. In Katz’s data analysis, he realised that international collaboration, and inter-institutional collabora- tion need not be based on individuals. Nearly 12% of articles in his Web of Science (WoS) dataset list more institutions than authors, which indicates that researchers hold posi- tions in multiple institutions. He took this to be evidence for top-down inter-institutional collaborations between the institutions that share the same researcher. About 2% of the papers in our WoS dataset contain more institutions than authors. 12 Chapter 2 Measuring Research Activities and their Relationships
The measurements discussed so far are all at a macro level, which measures how many times the collaboration happens between the authors and institution. The more frequent the collaboration, the stronger the collaboration between the entities. These macro level methods assume that every collaboration is equal. This is not always the case. In recent years, there has been a growing realisation that collaboration is not well understood at the individual level, nor how it affects macro level collaboration.
2.1.4 Collaboration at the Micro Level
One line of work considers the collaborator’s role and its impact on collaboration. Heffner
[94] suggested the concept of ‘sub-authorship’, where sub-authors perform two primary kinds of roles: technical aid and theoretical aid. Technical aid provides technical support such as collecting data, operating machinery etc.. Theoretical aid provides assistance such as reading, editing or commenting on the research paper prior to publication. Each collaborator takes a different role in the collaboration, such that the research could not be completed without some of those roles. The strength of these kinds of collaboration cannot be easily quantified. Melin [129] had different views on the way researchers collaborated. He conducted surveys, interviewed researchers and concluded that there were two modes of collaboration. One mode is that the work is coordinated in all aspects and there are clear divisions of labour, which supports Heffner’s finding; the other mode carries out discussion and idea exchange between the collaborators over many research questions, followed by an iterative process of research and re-writing, such that individual contributions are no longer identifiable in the final product.
Jeffrey [98] studied an inter-disciplinary research collaboration and found there were more problems than benefits. Communication is one of the biggest obstacles due to the different backgrounds, specialised words and so on. He also found that the group had to communicate using metaphors, story-lines and had to choose their words very carefully in order to be understood. This work suggests that at a micro level, not all collaborations are positive and bring benefit to the end result. There can be collaborations that bring more problems than they solve. Pravdi´cand Olui´c-Vukovi´c[155] suggested something similar in analysing collaboration at the individual level as well as the group level in Chapter 2 Measuring Research Activities and their Relationships 13
Chemistry. They found that productivity is affected by whom you collaborate with: collaboration with high productivity authors increases an author’s personal productivity; however, collaborating with low productivity authors decreases it. This result confirms that not all collaborations are positive.
2.1.5 Summary of measuring research collaboration
Measuring degree of research collaboration is not a solved problem. Researchers have attempted novel ways to quantify collaboration, but due to the nature of human in- teraction, true collaboration strength can never be accurately captured. At the micro level, collaboration is a complex social phenomenon, researchers have different reasons to collaborate, and they have different roles in collaboration, so types of collaborations are best treated differently. At the macro level, when hundreds and thousands of collabora- tions take place between individuals and institutions, collaboration frequency gives us a measure of collaboration. Co-authorship has been widely used as a measure of strength of collaboration in the literature. Its advantages are listed and discussed along with its limitations. Building on the co-authorship, the strength of the higher aggregation level, e.g. departmental collaboration, inter-disciplinary collaboration, inter-institutional collaboration and international collaboration can be modelled.
2.2 Research Productivity
A well recognised definition of productivity (within the domain of management), accord- ing to Swiss [179], is the ratio between output and input of a system. The system can be an individual or an institution where productivity can be assessed. This implies that to accurately measure productivity in conducting scientific research, one has to identify and measure all the input which went into the research and all the outcomes resulting from the research.
Potential inputs for conducting research have been explored in the bibliometric field.
These include, but are not limited to: expenditure, number of researchers, person- hours, length of time, and efficiency of researchers [70, 78]. However, many of these 14 Chapter 2 Measuring Research Activities and their Relationships data are difficult to obtain, either because they are difficult to measure (person-hours, efficiency of researcher), or the data are generally not available (expenditure, number of researchers, person-hour). Without the data to measure input to research, previous studies tried to estimate using various indicators, based on the assumption that they are correlated with the input. These indicators include number of journal articles, books and citations [70, 71, 78]. The range of outputs of a research activity includes journal articles, conference papers, proceedings, patents, books, book chapters, book reviews, comments, prizes, licences, lectures and technical reports. Since the variables used to estimate the output overlap heavily with the output variables, the ratio between the two does not yield meaningful productivity data.
With the current data availability, precise calculation of research productivity by using the ratio between output and input is not possible in the present study, as was also the case in many prior bibliometric studies. Instead of using this ratio, research productiv- ity is often measured as publication productivity [70]. A strong correlation had been demonstrated between the two [101, 121, 159].
However, publication productivity is not the same as research productivity, it only in- cludes what is written, which is only a subset of research productivity (research con- ducted). The literature also points out its problems as a metric. Katz [101] found that different research fields put different emphasis on different publication types. For instance, social sciences pay more attention to publishing books while natural sciences mostly publish in peer reviewed journals. So to count only one type and not the other makes disciplines incomparable and comparisons inequivalent. More recently, Kyvik and Teigen [107] took an interesting perspective and argued that a good quality article should be treated as more productive (weighted more) than a lower quality one: where two researchers are both publishing ten papers in a year, the more productive researcher should be considered the one who has published in better journals (higher impact factor or other measure of journal quality) or has been cited more often. These limitations must be take into account when using publication productivity as a metric for research productivity. Chapter 2 Measuring Research Activities and their Relationships 15
2.2.1 Publication Attribution
The research process is complex, especially with multiple collaborators involved. The contribution of each collaborator is difficult to measure, and the actual contributions can vary significantly. On one hand, a collaboration can be the result of an individual doing most of the work, contributing the most, and so he should be weighted more for the credit; on the other hand, researchers can take different roles within collaborations, and any one of these roles may be important to the success of the piece of research, which sometimes justifies an equal weighting among these collaborators. The way to credit a co-authored paper to its authors is also an area of interest to the bibliometric research community. There are three established counting methods [33, 114, 183] depending on the author’s affiliation:
Straight Count
Only the first author gets the credit and all the rest are ignored. Cronin and Kara [51] have shown that the straight count gives disproportionate credits to senior researchers, whose name often lists first while the junior researchers are completely ignored. This counting was used widely in the early years of bibliometric research, when electronic citation databases were not available. Using this counting method would mean that only the first author’s institution gets the credit.
Full Count
All authors listed on the publication get 1 credit. Their institutions also get at least one credit1. The drawback of this counting method is that it exaggerates the contribution made by the heavily co-authored publication and weights the heavily co-authored pub- lications more. For example, a publication with 10 co-authors gives 1 credit to each of the authors, giving a total credit of 10; while a paper with 2 co-authors only carries 2 credits.
Partial count 1The exact counting process in the literature is far from obvious. Some literature may have credited the same institutions with multiple credits due to multiple authors come from the same institution. 16 Chapter 2 Measuring Research Activities and their Relationships
Sometimes referred to as am adjusted count, normalised count or fractional count, where the 1 credit is equally divided amongst all authors, so each institution gets a fraction of the one credit. The full count and partial count are frequently used in citation based performance research. The full count is mathematically simpler than the partial count, but it has the problem of double counting credit. The partial count method does not have this problem, the sum of all the credits equals the total number of publications in the dataset.
Different counting methods can give quite different results in subsequent analyses. Gauf- friau et al.[79] compared the country research scores obtained by using different counting methods, and found score reduction as high as 72% when using partial count instead of full count. The full count particularly favours those countries that frequently participate in collaboration.
Publication counts as the productivity metric have also been applied at the institutional and country level. Katz [105] used the number of published papers as the university research output. Egghe et al. [61] used publication count to evaluate individual, depart- ment and institutional performance.
2.2.2 Summary of measuring research productivity
Due to the current limitation of accessing and measuring the inputs that went into the research, research productivity is difficult to measure precisely based on conven- tional productivity definitions. Alternatively, publication productivity is widely used as the approximation for the productivity of individuals and institutions. It should also be noted that at the institution level, using publication productivity alone with- out factoring out the input (such as number of researchers) gives an unfair advantage to larger institutions. Assigning co-authored publications to the contributing authors is not straight-forward. Bias may occur if an inappropriate counting method is used. Three publication attribution methods were discussed and their advantages and implications were presented. Chapter 2 Measuring Research Activities and their Relationships 17
Institutions conducting research do not only make quantitative contributions by pub- lishing more; the quality of their research should also be accounted for. I will discuss how a piece of research is assessed for quality in the following section.
2.3 Research impact
The quality of a piece of research is multi-dimensional. It can be viewed from different angles: e.g. a quality piece of research may be the one that advances scientific under- standing; or it may be one that makes the impact on the scientific community, generating a lot of debate and discussion; or it may be one that makes impact in the industry; or it may also be one that experts in the field recognise as such. There are broadly four perspectives in the literatures which refer to measuring the quality of a piece of research.
They are: 1. methodological quality (how good is the idea and process), 2. reporting quality (how good is the write up), 3. quality based on peer review (how well it is recognised by the experts), and 4. bibliometric as a proxy (how well it is recognised by the community). This is often referred to as research impact. These perspectives are not mutually exclusive, e.g. an expert doing peer review may consider factors involving reporting quality. I give a brief introduction to 1-3, then I discuss research impact using bibliometric methods in detail.
Methodological quality
The methodological quality of a piece of research measures research by how well it follows the following research processes [82, 88] : significance of the research question, coverage of the literature, design of the experiment and whether the design of the experiment can in fact address the research question. But since these processes vary across disciplines, methodological quality measurement are more frequently used in certain disciplines, e.g. health [120], education [82] than others. There is no consensus on a specific set of standards that ensures research of high quality, and different disciplines may have their own definition for high quality research that can be very different from others.
Reporting quality 18 Chapter 2 Measuring Research Activities and their Relationships
In published research, the report write-up is often evaluated for quality. Poorly produced reports that lack essential details to replicate the experiment often demonstrates a low quality research. In medical research, the degrees of freedom and P-values are sometimes absent[75]. When such core details are missing, the credibility of the result comes into question and the entire research, whatever interesting findings it may contain, becomes of little value and hence is considered lower quality research. To facilitate better reporting, several ‘checklists’ have been developed by various consortia for general and specific research design, e.g.,Consolidated Standards for Reporting Trials (CONSORT)2, Quality of Reporting of Meta-Analyses (QUOROM)3. These checklists help authors to decide what needs to be included in the report, but they are not an evaluation instrument.
Peer review
Peer review is the process of asking experts’ opinion on the quality of a piece of research.
The report of the research is sent to the experts to read, and they give a rating and comment on the research. Peer review is an established method and is widely used in re- search journals and research conference article evaluation. Peer review is able to provide immediate quality indicators for a piece of research, unlike quantitative metrics (e.g. citations) that may take months or years to accumulate. Peer review is often referred to in the literature as the preferred process of evaluating research when possible[34, 89].
However, to conduct peer review is expensive. Firstly, groups of experts covering the entire subject area must be employed; secondly, reading and reviewing articles is a very time-consuming process; finally, an article needs to be reviewed by several experts in order to calculate a fair rating, adding a greater work load. Peer review is a common practice for journals and conferences to rate and select quality papers. It is because they are in very specific subject area and there is no problem to find experts (peer reviewers tend to be the participants themselves).
Peer review can be subjective and contain personal opinions on the quality. Experts, especially rivals, may have strong disagreement on certain subject, which can potentially
2http://www.consort-statement.org/statement/revisedstatement.htm 3http://www.consort-statement.org/QUOROM.pdf Chapter 2 Measuring Research Activities and their Relationships 19 lead to unfairness and abuse in the peer review process. As a result, articles’ quality rating based on peer review can become non-subjective and non-reproducible.
2.3.1 Research Impact using Bibliometric Methods
I have discussed three of the four perspectives evaluating the quality of a piece of re- search, ranging from methodological quality, (evaluating whether the process of the research is up to standard); reporting quality (a specific peer review that focus on eval- uating whether the content of the research report is up to standard; and expert review
(experts in the area evaluating the quality of piece of research of a piece of research for publication or research performance review). However, these are qualitative methods and are very expensive to execute. In this section, I introduce bibliometric method, which is based on quantitative bibliometric data, such as citation data to indicate the impact of research. These data are recorded in the scientific research process and access to these data is becoming easier.
Authors cite the research they use and discuss, and these citations can be counted.
In the 1960s, Garfield [76] introduced an electronic citation database, where the paper citations were harvested from each paper and indexed. Using his database, citation has been studied widely. Although Garfield warned against it, citations have also been used as quality metrics. The rationale is that at large scale, researchers’ citations are most likely to be positive responses to previous work. Compared to the other three methods, bibliometric methods have data recorded and collected in a public way, so that anyone with access to the bibliometric data is able to reproduce the same measurement, making the metric more objective than subjective evaluations.
Citation counts are often taken as a proxy for the impact of earlier research on later research [30, 126]; citation counts are hence referred to as citation impact. Citation impact and citation frequency have been used as an indicator of quality in many previous studies [76, 112, 113].
However, Goldfinch [87] argues that an increased number of citations can also be the result of an author’s having a larger social network, which in turn increases his visibility 20 Chapter 2 Measuring Research Activities and their Relationships and his citations, so it not necessarily reflects impact. Citation has been compared with peer review and generated intensive debates. I discuss them in 2.3.2. With the potential limitations of citations in mind, I present next the various uses of citations in measuring impact.
2.3.1.1 Citation Count
Citations count means the total citations an article has received to date. This is one of the most widely used article impact measures due to its availability and simplicity.
Various databases publish data with citation counts included. However, there are prob- lems. Firstly, due to the difference in the year of publication, the longer the articles have been published, the more time they have to accumulate citations, so unfair advantage is given to articles which were published earlier in any year. The other problem is that this count treats all citations equally. A citation from a highly cited paper should be worth more than a citation from a never-been-cited article. The impact of each individ- ual citation has been omitted. It is also important to note that across disciplines, due to differences in citation practices, some disciplines can have short publication cycles resulting in a faster rate of citation, accumulating more citations than other disciplines.
Citation measurements cannot be directly compared without normalisation.
Author Self-citation Frequently, authors cite their own work. According to an analysis by Arsnes [6], in a three-year citation window, author self-citation in certain disciplines can reach as high as 36% of total citations. Baldi and Hargens [14] showed that author self-citation is relatively infrequent in general, but quite frequent in certain disciplines.
Authors cite their own work for different reasons, for example, due to the cumulative nature of research; the need for personal gratification; or as a struggle for visibility and scientific authority in the community [27, 68].
Self-citation can inflate the metrics based on citation. Fassoulaki et al.[66] report that self-citation can significantly increase a journal’s average citation count per article. Chapter 2 Measuring Research Activities and their Relationships 21
Schreiber [166] showed that self-citation increases h-index4. As a result, it is a com- mon practice to remove self-citations in the source data before analysis begin.
2.3.1.2 Windowed Citation Count
Citation windows are used to address the first problem. Instead of counting all the citations an article has received to date, a window-period after an article’s publication is used. Only citations received within this window period are counted for the article concerned. WoS impact factor calculation adopts a two-year window. Wider window sizes were also investigated. He et al.[93] compared a two-year window with a three- year window for estimating article impact and found no difference; Campanario [38] compared journal impact factor based on a two-year citation window with a longer five- year window and found the two behave very similarly. Wang [192] conducted a larger analysis by using 30-years worth of WoS citation data in multiple disciplines that the more frequently cited the less the correlation between the windowed citation and the eventual citation count. Windowed citation would not be a good estimation for those highly cited articles.
2.3.1.3 PageRank Weighted Citation
As it was mentioned in the citation count’s problem, a refinement of the citation count is to consider the impact of the citers. In addition to the number of citations a paper has received, the citing paper’s citation should also been taken into account. This idea of the iterative measurement is commonly exercised in the Web. The ranking algorithm
– PageRank – used by Google search engine is specifically designed to calculate this iterative score of the linking documents. The Web can be viewed as a set of documents with links pointing to other document. Page et al.[151] found that a popular web page would be the one not only linked to by many other web pages, but also linked to by heavily linked web pages. An iterative algorithm was proposed by them that would work on a directed-graph like the web, where each link between the web pages has a direction of source and destination. A score is calculated by the algorithm for each web
4A weighted joint indicator of citations and productivity for measuring researcher quality [95]. 22 Chapter 2 Measuring Research Activities and their Relationships page, which determines the quality of the web page. Like web-pages linking each other, citations link citing articles to the cited articles. By connecting these articles using the citation link, a directed-graph like the web can be constructed, therefore, the PageRank algorithm can be applied [24, 127]. With PageRanked citation, the articles’ measured impact is not always the more citations the better, when the citer’s citation is high, it will be counted more towards the impact of the article.
2.3.1.4 Network-based Centrality Measures
Another variation of using citation as impact measurement adopts social network tech- niques. These methods treat citations as links in a network, similar to the PageRank method; but instead of considering each citation as a weighted ‘vote’ to an article, cita- tions are considered as information flow in a network [23]. If article A cites article B, then information flowed from B to A. The centrality of the articles is calculated based on their position in the network. This includes degree centrality, which is how many articles have cited it (this coincidentally has the same meaning as the original citation count); and closeness centrality, which calculates the mean distance to all of the rest of the articles in terms of number of citations. The article with the greatest closeness centrality is the one that has the shortest mean distance to all other articles. Such an article can be a good starting point for a literature research in the domain. Finally, betweenness centrality measures article importance in terms of information flow. The article with the greatest betweenness centrality would be on the shortest path between the most pairs of articles. This may indicate that this article channels through the most information in the citation network.
Applying social network analysis techniques in citation metrics is interesting and opens a new perspective on viewing citation, but it lacks the theoretical support and empirical evidence of the robustness of the methodology, so I will not use them as a way to approximate an article’s impact in this study. Chapter 2 Measuring Research Activities and their Relationships 23
2.3.1.5 Journal Impact as an Indicator of Article Impact
A journal’s impact is often used as a proxy to the impact of its articles. In particular,
WoS’s impact factor is sometimes directly used to express the impact of the articles [93].
This may at first sound odd and has obvious flaws, but the argument is that if an article is accepted by a journal, the article is at least up to the standard of the journal. This is in fact the same logic used when graduates are judged by the university they graduated from rather than by the level of achievement they graduated with. Still, this practice is frequently criticised and discouraged. The obvious problem is that it treats all the articles in the journal as having the same impact, which is certainly false. In addition, several studies [3, 39] show that the citation distribution of articles published in the same journal follows a power law. That is, the top few articles in a journal receive the majority of the citations, while the rest of the articles receive the remaining few citations. Using an arithmetic mean citation of a journal adopted by WoS impact factor to describe all the articles is inaccurate. Journals evolve over time, their impact factor varies year by year, taking a snapshot of this continuous changing impact factor to describe the articles is also inappropriate. WoS impact factor is also well known for its inability to prevent manipulation. Many previous studies [4, 11, 133] have investigated and listed strategies some journal publishers use to increase their impact factor dramatically (e.g., requiring authors to prior research published in the same journal).
2.3.1.6 Reading Statistics and Download Logs
With the wider adoption of institutional repositories and the availability of website ar- ticle download data logs from these repositories, researchers have studied whether the paper access data recorded by repositories offer any new information for paper impact estimation. The primary difference with these log data is that they are generated by potential readers (people that have accessed the paper’s abstract page and/or down- loaded the fulltext) rather than actual readers. This data is one stage ahead of the citation data. People citing the article must have already downloaded the article, while people who have downloaded the article may not necessarily cite the article. As a result, these data are gathered much earlier in the scientific publication cycle and are hence 24 Chapter 2 Measuring Research Activities and their Relationships faster compared to citation. Generally, citations only become available after at least one year has elapsed since the article’s publication due to the long publication cycle. By contrast download and reading data are generated and collected as soon as the article becomes available. Bollen et al. [23, 25] studied the effect of reader generated statistics and showed how it relates to citation count. Brody and Harnad [35] investigated how much variation is indicated by the early download statistics to the citation count. They found a significant (r = 0.4) correlation between the download counts and the eventual citation counts already in 6 months of download data.
2.3.2 Debate between peer-review based metrics and bibliometrics based metrics
Two categories of measurements are frequently discussed in the literature. Maier [125] tested the correlation between experts’ opinions (based on survey) and journal impact factors in regional sciences, she found that the expert opinion is more close to the true reputation of the journal than the impact factor, and the impact factors did not correlate with experts’ opinion and sometimes even correlated negatively. Brinn et al.[34] conducted a survey and found that senior researchers consider peer reviewing more important in measuring the scientific performance than metrics based indicators. van Raan [187] used 147 university chemistry research groups in the Netherlands and found a significant correlation between peer review and citation based metrics (they used h-index and crown indicator5 as the citation based metrics.). Opthof and Leydesdorff
[148] referred to Van Raan’s work and raised concerns for their crown indicator, which was a starting point for a series of debates in the literature. The contributors to this debate include Waltman et al. [190, 191], Opthof and Leydesdorff [149], Bornmann
[31], Gingras and Larivi`ere[83], Moed [130] and Taylor [181] arguing whether the data really showed significant correlations between the peer-review and citation based metrics; whether the normalisation method has affected the result; whether an alternative metrics
(such as crown indicator) is appropriate in scientometrics.
5They define the crown indicator as the ratio between the citations per paper and the field based average citation Chapter 2 Measuring Research Activities and their Relationships 25
Harnad [91] on the other hand found significant correlations between peer review (based on the UK RAE 2008 research assessment exercises) and citation metrics as well as new indicators such as downloads. He pointed out that these measurements for research performance quality are not “face-valid”, because none of them are direct measure of research quality. Correlations only reveal that the pair have underlying common factors.
The main difference between the two types of measurements is the way the impact indicator is based. Citation metrics are based on quantitative statistical data – how many times the work has been cited, how frequently it has been cited and so on. These citation data are recorded in journals, repositories and databases. Thus the calculated impact indicators are reproducible given the same dataset. However, it does require years for the citations to accumulate before the data are useful.
The expert-review based metrics is a qualitative method. Compared to citations, expert- review can be done as soon as the work has been published; there is no need to wait.
However, it is based on a few experts’ opinions and their personal experience in the domain, so it is more likely the result is biased and none reproducible in the future study with a different group of experts reviewing.
So far, we have looked at how a piece of research can be judged for its quality. Using articles as the building block, we would like to measure an aggregated research impact.
In the following sections, we explore how WoS Journal Impact Factor is calculated; how a conference quality can be estimated based on a set of papers. We also review National
Research Evaluation Exercises carried out in two difference countries and learn what factors these exercise have been taken into account to quantify research quality for the institutions.
2.3.3 National Research Evaluation Exercises
The UK and Australian governments have conducted research assessment nation-wide in the past. Both of their evaluation methodologies are based on expert review of the published work that the institutions submit for assessment. 26 Chapter 2 Measuring Research Activities and their Relationships
2.3.3.1 UK
The UK research assessment exercise (RAE) was conducted to evaluate the overall qual- ity of UK research output across a 6-year interval. Its aim was to measure the country’s research performance by institution as well as by discipline, hence justifying the invest- ments in research and to serve as the basis for future research funding. It had happened four times in the past 20 years. The latest published assessment result is RAE2008. The next iteration will be called Research Excellence Framework (REF) and the result will be published in 2014.
Method: Institutions are invited to submit a profile for each Unit of Assessment
(UOA) they would like to be assessed in. There are in total 67 UOAs attempting to cover all government funded research. The submission includes the research profile of the institution, as well as up to four items of research output per researcher submitted to the UOA. These submissions are evaluated by a panel of experts, the research outputs are then rated according to the rules. There are five rating categories, ranging from
“the works make significant or substantial contribution to knowledge” to “the work falls below the standard of nationally recognised work”[152]. The scores of each output are combined to produce an institution’s overall score for the discipline. The combined per- centage of the measurements is as follows: 70% based on citations, 20% based on research environment and infrastructure, 10% based on esteem and impact (measured by the recognitions of the researchers and membership of funding bodies etc.). The RAE2008 submission and results can be downloaded from its website (http://www.rae.ac.uk/).
2.3.3.2 Australia
The Excellence in Research for Australia (ERA) is an initiative taken by the Australian research councils to evaluate their higher education research performance and to allocate funding. They started ERA in 2007[10], replacing the Research Quality Framework
(RQF) used previously between 2003-2006. The latest published result is for 2012 at the time of writing. Chapter 2 Measuring Research Activities and their Relationships 27
Method: The evaluation process involves data collection and evaluation. Data col- lection is done by the Higher Education Research Data Collection (HERDC) scheme, which collects data on Australian universities’ research output and income annually.
The research data is split into eight disciplines for each university, and each piece of research outcome is measured by panels of experts on the following four aspects: 1. re- search quality, which is based on citation, peer review and research income; 2. research volume, which is based on research output and research income; 3. research applications
(e.g. commercialisation income) 4. research recognition. The scores for each of the eight disciplines are listed separately for the universities and the ERA specifically advises that it is not possible to add or average scores from the ERA report to derive overall rankings for universities.
2.3.4 Journal Impact
Publication is an important channel through which research results are disseminated.
The main choice of many of the disciplines is to publish in peer-reviewed research jour- nals. In the following sections, we review a few widely used methods for evaluating journals’ impact.
2.3.4.1 Impact Factor
The journal impact factor is one of the established ways to measure the importance of a journal. For journals that are indexed by the Journal Citation Report (JCR), the impact factor is published annually. Due to citation practices and size of the field, the impact factor of journals is only comparable within the same field. Normalisation methods exist to enable cross discipline impact factor comparison [192].
The use of impact factor is subject to various criticisms. Firstly, due to the distribution type of the citations to articles in a journal, there is no average number of citations to the articles. So calculating an algorithmic mean as the average citation number is not accurate. Secondly, the citation accumulation pattern for original research papers and review papers is very different. Review papers tend to attract more citations than 28 Chapter 2 Measuring Research Activities and their Relationships research papers[187]. Mixing them without normalisation would skew the measurement of impact. Finally, the journals included in the citation index used by the calculation are primarily in English. It poses a language bias for impact.
Impact factor is an application of the citation-based metric. Impact factor is calculated by averaging the citations received in the measuring year for the articles published in the previous two years. In the next section, I review the two journal quality lists published by UK and Australian governments that use a hybrid of expert review and citation based metric.
2.3.4.2 Acceptance Rate
When deciding in which journal to publish results, researchers do not just choose any journal. In order to give their research the maximum coverage, they tend towards a high impact journal. Every researcher would like to publish in a high impact journal, but the space in an issue of a printed magazine is fixed. So the more submission a journal receives, the more they have to reject, to make room for those that can attract the most readership and citations, resulting a low acceptance rate (or high rejection rate).
A potential metric of measuring journal impact would be the journal’s submission and rejection ratio. Some journals publish their rejection ratios, but many do not. The bigger problem with this metric is that it can be easily manipulated by the publisher.
2.3.4.3 ABS’s Journal Quality Guide
The Association of Business Schools (ABS) in the UK publishes a quality guide on business and management journals. The purpose is to give researchers authoritative information. The guide categorises journals into four quality levels and the latest ver- sion 3 was published in March 2009. This guide includes more than 800 journals, and evaluates them based on factors including expert opinion, impact factors and citation based metrics. The coverage of this guide is wider than the citation indices (as many business journals are not presented in any of the citation databases). The method used and the detailed score for each journal is unfortunately not published. Chapter 2 Measuring Research Activities and their Relationships 29
2.3.4.4 ABDC Journal List
The Australian Business Deans Council (ABDC) publishes a journal ratings list that serves a purpose similar to that of ABS’s guide. It reviews over 400 journals covering
17 disciplines. The criteria the journals are judged on are:
• Relative standing of the journal in other recognized lists (such as the Association
of Business Schools)
• Citation metrics
• International standing of the editorial board
• Quality of peer-review processes
• Track record of publishing influential papers
• Sustained reputation
• Influence of publications in the journal in relation to hiring, tenure and promotion
decisions.
The output of this list also categorises journals into four quality levels. However, as in the ABS’s list, the process of how each of the criteria is judged is not available.
2.3.5 Conference Quality
One of the other important channels to disseminate research results is conferences. This is especially the case in fast developing disciplines that require rapid communication of research outcome, such as Computer Science. In these disciplines, top conference publications often carry as much influence as the journal publications.
Depending on the timing of the quality calculated – before or after the conference has taken place, research splits into pre-conference and post-conference. Less data is avail- able for a pre-conference quality prediction than a post-conference quality measurement.
For instance, the citation count of papers will only available after a conference has taken 30 Chapter 2 Measuring Research Activities and their Relationships place. Pre-conference quality prediction is becoming more useful as more conferences are being organised year on year, researchers would like to know the quality of the conference before attending it. Zhuang et al.[203] took an approach by mining and analysing program committee members of the conference. They proposed a number of heuristics to identify reputable conferences, for example, the number of program com- mittees, the committee’s average publication, the committee’s average co-authors and centrality measures of the committee. By combining these factors, they were able to identify top conferences and the extremely low-quality ones before the conference took place. Taking it one step further, Souto et al.[175] developed a classification model for Computer Science conferences. Using a conference ranking, the system was able to support semi-automatic evaluation of the quality of Computer Science conferences.
As for the post-conference quality measurements, Martins et al.[127] used article cita- tions as the basis for their conference assessment. Starting from the citation analysis of articles published in conferences, their metric also takes into consideration conference specific characteristics such as longevity, popularity, prestige and periodicity. Yan and
Lee [199] proposed an approach to evaluate conferences based on the assumption that top papers in a discipline are often recognised. Using these top papers, the co-authors of these top papers are identified. To determine the quality of a conference, they then look for the conferences that these authors regularly attend and publish papers at. However, the top papers in a discipline are often very limited and confined, so they do not give a good coverage of conferences.
2.3.6 University Rankings
Universities are the largest producers of research output across the globe. More than half of the research output recorded in WoS and ACM is from universities. Although in the present study research quality is the prime concern, the overall quality of a university is often covers teaching, reputation and many other factors. Below is a list of potential factors that affect a university’s perceived quality.
• Historic reputation Chapter 2 Measuring Research Activities and their Relationships 31
• Publications
• Citations
• Visibility
• Teaching
• Student feedback
• Student prospects
• Resources, funding, infrastructure
Various ranking and evaluation exercises can only consider a subset of the above list due to the data availability. Data availability is also a limiting to the current study.
The economic development of a country has a strong impact on the role of the higher education system in the country. As a result, the word “university” may not mean the same type of institution in all countries. Some universities may be set up solely to do teaching, while others put more weight on research. Research-oriented institutions and teaching-oriented institutions should be treated differently, if they can be distinguished.
Based on the Webometrics ranking of world universities, there are currently more than
10000 university level higher education establishments in the world. The country with the most is US, which has 2,830, followed by China(702) with only a fraction compared to the US. The strong presence of US institutions in higher education put it far ahead of other countries in terms of research output. van Raan’s work [185] forms the basis for many recent ranking methodologies. He has raised and discussed the problems of making measurements (which include statistical issues), issues in indicator choices, language bias, timeliness and variations between dif- ferent research systems. Moed [131] applied bibliometric analysis in ranking universities.
He used the articles published in the Web of Science database, calculated indicators that measure the universities’ quality, including the article output rate, the percentage of in- ternationally co-authored articles and the percentage of industrial co-authored papers. 32 Chapter 2 Measuring Research Activities and their Relationships
The advantage of using publication data is that the resultant ranking is free of per- sonal opinion. In this respect, van Raan [186] firmly supports metric based methods.
He has criticised expert-review based ranking, showing that no significant correlation can be found between expert opinion and bibliometric outcome. In more recent papers
[188, 189], he showed that papers published in non-English language would decrease the ranking of the university. This reveals that the world research arena is predominantly in English.
In recent years, ranking universities has become increasingly popular. There are a few university rankings published every year by various organisations across the world. I am going to review four of the most commonly used lists amongst students and scholars: the Webometrics ranking of the world universities, the Guardian university ranking,
Academic Ranking of World Universities (ARWU) published by Jiao Tong University and non-profit organisation 4icu’s world university ranking.
2.3.6.1 Webometrics Ranking of World Universities
The Webometrics Ranking of World Universities (Webometrics) is an initiative of the
Cybermetrics Lab, a research group belonging to the largest public research body in
Spain (CSIC). The aim of the ranking is to promote Web publication, support Open
Access initiatives, and support electronic access to scientific publications and other aca- demic material. These are strongly reflected in its methodology. We introduce the methodology of the 2010 version, which we have the data for, the method has changed in the current version (as of 2015).
Methodology Webometrics ranking (2010 version) [5] uses data indexed by popular search engines as the basis of ranking. Four indicators are used and their weightings are as follows:
• Size. Number of pages recovered from four engines: Google, Yahoo, Live Search
and Exalead. (20%) Chapter 2 Measuring Research Activities and their Relationships 33
• Visibility. The total number of unique external links received (inlinks) by a site
(Yahoo Search only). (50%)
• Rich Files. The number of academic related files retrieved from the domain. Files
include: Adobe Acrobat (.pdf), Adobe PostScript (.ps), Microsoft Word (.doc)
and Microsoft Powerpoint (.ppt). The data was extracted using Google, Yahoo
Search, Live Search and Exalead. (15%)
• Scholar. Number of papers and citations reported by Google Scholar for each
academic domain. (15%)
Coverage The design and weighting of the indicators used by this ranking are entirely based on the Web. Therefore, any university or college with web presence is analysed and ranked. This immediately overcomes the university identification problem [131,
186], which causes difficulties in ranking universities across countries, enabling impartial analysis and ranking of universities from purely the web perspective.
The Webometrics ranking covers almost all universities in the world. The data is a very useful resource to learn about the world universities in less well known countries.
Table 2.1 shows the coverage of the ranking by continent and country. The web domain name for each university is included. The domain name was used to obtain the online resources (number of webpages, academic files etc). It was also used as the university identifier in this study.
Shortcoming Because universities are identified by the web domain, those with more than one web domain, or those with domain name changed (e.g. Imperial College Lon- don), the ranking is based on each individual domain only, no aggregation was performed to concatenate multiple domains. Therefore, a lower than expected ranking is expected for these universities. 34 Chapter 2 Measuring Research Activities and their Relationships
Table 2.1: Coverage of Webometrics Ranking
2.3.6.2 Guardian Ranking
The Guardian university ranking is published annually by The Guardian, a respected news agency in the UK. The primary aim of the rankings is to inform potential applicants about UK universities.
Methodology The Guardian’s ranking uses eight indicators, which all focus around the student. The indicators and their weightings as follows:
• Overall quality - National Student Survey (NSS) overall results (5%)
• Teaching quality - NSS feedback section rated by graduates of the course (10%)
• Feedback - NSS feedback section rated by graduates of the course (10%)
• Spending per student (15%)
• Staff/student ratio (15%)
• Job prospects - proportion of graduates who find graduate-level employment or
study full-time within six months of graduation.(15%)
• Value added score - comparison of students’ degree results and their entry quali-
fications (15%) Chapter 2 Measuring Research Activities and their Relationships 35
• Entry score (15%)
Unlike many other rankings, this ranking does not include a measure of research output.
Coverage This ranking considers all listed UK universities and ranks 110 (70%) of them that they have data for. This is a domestic ranking, no universities outside the
UK are ranked. 46 major subject areas are separately ranked. The subject ranking uses different weightings on the indicators. In addition, the size of the department for the subject is considered before including the universities in the subject ranking.
Shortcoming This ranking is based the taught student data, no research performance is taken into account. It is an UK only ranking, no international universities are included.
2.3.6.3 Academic Ranking of World Universities (ARWU)
Academic Ranking of World Universities (ARWU) is an annual ranking since 2003, published by Shanghai Jiao Tong University, China. The original goal of this ranking was for the Chinese education authorities to learn the academic research positions of
Chinese universities. It since became one of the widely used international university rankings. It is also a very controversial ranking due to the metrics considered. Many papers are published specifically to criticise them [22, 67].
Methodology The ARWU ranking considers four aspects of academic and research performance, including:
• Quality of Education. Alumni of an institution winning Nobel Prizes and Fields
Medals. (10%)
• Quality of Faculty. Staff of an institution winning Nobel Prizes and Fields Medals
Award. (20%)
• Quality of Faculty. Highly cited researchers in 21 broad subject categories. (20%) 36 Chapter 2 Measuring Research Activities and their Relationships
• Research Output. Papers published in Nature and Science. (20%)
• Research Output. Papers indexed in Science Citation Index-expanded and Social
Science Citation Index. (20%)
• Per Capita Performance. The weighted scores of the above five indicators divided
by the number of full-time equivalent academic staff. (10%)
Coverage The ARWU ranking includes all universities that have any Nobel Laure- ates, fields medallists, highly cited researchers, or papers published in Nature or Science.
In addition, universities with a significant number of papers indexed by Science Citation
Index-Expanded (SCIE) and Social Science Citation Index (SSCI) are included. The coverage of the university is limited, with only 500 universities ranked across the globe.
The top 100 universities are ordered, the remaining universities are put into 101-150,
151-200, 201-300,301-400 and 401-500 five ranking groups, where universities are not ranked within each group.
Shortcoming There are many critiques in the literature, the main ones are the following:
• The use of Nobel prizes and Fields medals winners – The researchers awarded the
prize don’t necessarily conduct the work in the current university.
• The choice of the 21 subject domains is biased towards medicine and biology.
• The weighting of the authorships on the Nature and Science papers is debatable,
where 100% goes to corresponding author, 50% for the first author, 25% for the
next author affiliation, and 10% for other author affiliations.
• Limited to two citation databases only(SCIE and SSCI), and there is no evidence
to show that they have understood the impact of such limitation.
Additionally, the coverage of the university is very small, and is therefore unsuitable for use in our study. Chapter 2 Measuring Research Activities and their Relationships 37
2.3.6.4 World University Ranking By 4 International Colleges & Universi-
ties
4 International Colleges & Universities (4ICU) is a not-for-profit organisation reviewing accredited Universities and Colleges in the world. The aim of the website is to provide an approximate popularity ranking of world Universities based upon the popularity of their websites. They intend to help international students and academic staff to understand how popular a specific University is in a foreign country. The ranking is published on the website every 6 months in January and July.
Methodology The ranking is entirely based on the data gathered from the web.
Three independent search engine metrics are used: Google Page Rank, Yahoo Inbound
Links, Alexa Traffic Rank. The unique inbound links and traffic to the university do- mains are counted and reviewed separately for the three search engines. The result is then combined to give a final value for ranking. However, the exact process and formula is kept secret due to copyright and to minimise manipulation attempts.
Coverage 4icu.org claims to include around 10000 Colleges and Universities in 200 countries, but they only publish the top 200 universities in the world. Continent and country ranking lists are also available. They list the top 100 universities.
Shortcoming This ranking is not based on actual academic or research performance, it merely ranks a university website’s popularity. Only the top universities are listed, which is too few to be used in this study.
2.3.6.5 Compare and Contrast
These four rankings vary largely in the data used to rank the institutions. The Guardian’s ranking is purely based on the data collected regarding undergraduate student experi- ences; no research performance is considered; Webometrics analyses the documents and traffic within the institution’s web domain, assuming that all evidence of teaching and 38 Chapter 2 Measuring Research Activities and their Relationships
Webometrics Guardian ARWU 4ICU Webometrics r=0.58 (p<0.01) r=0.32 (p<0.01) r=0.02 (p<0.39) Guardian sample too small r=0.61 (p<0.01) ARWU r=0.52 (p<0.01)
Table 2.2: Pearson Correlations between the four rankings. r is the Pearson Correlation, p is probability significance for the correlation.
researching would be demonstrated on the web. ARWU ranking, although receiving a lot of criticism, is the one that has considered the largest number of factors ranging from teaching, size, research, prestige and impact of the university. 4ICU ranking is an institution’s website activity ranking.
Table 2.2 shows the correlations between each of the reviewed rankings in 2010. The
Guardian ranking shows a significant and middling correlation with both web-based
Webometrics and 4ICU rankings. But the two web-based rankings – Webometrics and
4ICU – do not show any correlations between one another. The ARWU ranking shows a small but significant correlation with both Webometrics and 4ICU. The correlations between ARWU and the Guardian ranking are not calculated due to the small number of overlapping institutions. The Guardian only ranks UK universities, and apparently, according to ARWU ranking, there are only 4 UK universities in the top 100, resulting in a small overlapping sample size between the two.
The correlation between the rankings is significant, but not very high. A middle cor- relation coefficient (0.3-0.6) between the two rankings means that there are factors one ranking considers, but the other does not.
2.3.7 Summary of measuring research impact
This section explores various established methodologies used to measure research impact using research publications. The two common methods – expert-review based and cita- tion based – were discussed and compared. Higher aggregation levels, e.g. journal level, institution level and country level adopt different approaches to measure impact. These approaches were presented and discussed. In addition, four popular university rankings Chapter 2 Measuring Research Activities and their Relationships 39 were presented, their methodology as well as their ranking were compared, contrasted and correlated.
With the measurements of research collaboration, research productivity and research impact established, I will move on to investigate the relationship between these three research activities.
2.4 The Relationship Between Research Collaboration and
Productivity
Understanding research collaboration and its impact on productivity has raised much interest in the past. Although collaboration metrics and productivity metrics vary from one study to another, findings indicate that research collaboration positively correlates with research productivity.
In 1966 Price and Beaver [159], in one of the earliest studies on this topic, found that the number of collaborators was positively correlated with the number of articles published by the author. By qualitative analysis, they found that the most prolific researcher also tends to be the most collaborative, and that 3 out of 4 of the next most prolific persons are amongst the next most frequently collaborating persons. No causal analysis was performed.
A year later, Zuckerman [205] interviewed 41 Nobel Laureates in science disciplines, and identified a strong relationship between collaboration and productivity. She found that laureates published more papers and were more willing to collaborate than a matched sample of scientists.
Pravdi´cand Olui´c-Vukovi´c[155] used research data collected from Chemistry, and found that the number of papers published is dependent on the frequency of collaboration among the authors. After they had interviewed a sample of the authors, they also learnt that collaboration with highly productivity authors increases personal productivity while collaborating with less productivity decreases it. 40 Chapter 2 Measuring Research Activities and their Relationships
Gl¨anzeland Schubert [85] considered collaborations from three aggregation levels: indi- vidual level co-authorship, cross-country co-authorship and multi-country co-authorship.
In all three levels, co-authorship is positively correlated with collaboration. The same positive correlation was found by Adams et al. [2] between the size of the collaboration groups and scientific productivity.
Lee and Bozeman appear to have found something more subtle. In their 2003 research report [33], they used a regression model to determine whether the explanatory power of collaboration was diminished by factors such as job satisfaction, rank, age, gender etc.
They surveyed and interviewed 443 academics to obtain their data and then completed the regression. They then concluded that despite the extra variables they had included, the number of collaborators remained the strongest predictor of the number of publica- tions. However, in a later paper by the same authors [114], they extended the journal paper and book counting method to include partial count, in addition to the full count they have used before. The full count method counts a collaborated item as many times as the number of co-authors listed, while the partial count split a collaborated item by the number of co-authors. While they still found the number of journal papers strongly and significantly correlate with the number of collaborators, they could not find the same correlation using ‘partial count’. Different counting method can lead to different result, it is important that the appropriate counting method is used.
More recently, Defazio et al. [55], using the EU framework programme to study similar variables in Chemistry, found that researchers tend to collaborate just to secure fund- ing, the impact of funding on productivity is positive, but the impact of collaboration on productivity is weak. By splitting the period into pre-funding, during-funding and post-funding periods, they found that collaboration during the funding period does not correlate with productivity; in post-funding period, although the collaboration count de- creases compared to the other two periods, however, it has a strong positive correlation with productivity. So it appears that the connections that the researchers established pre-funding and during-funding went on to have a positive effect on subsequent research output. Chapter 2 Measuring Research Activities and their Relationships 41
At a higher aggregation level - institution level, Abramo et al. [1] studied collabora- tion intensity and productivity (normalised by number of staffs). While the correlation varied substantially among different research areas, a strong correlation was found in information engineering.
The research discussed so far was all based on cross-sectional data, making cause-effect inferences un-testable. He et al. [93] constructed a longitudinal dataset of 65 New
Zealand researchers for 14 years. Among other findings, they claimed that international collaborations are positively related with future research output. Although they could not find any significant correlation with future output for within-university collaboration and domestic collaboration.
The positive relationship between collaboration and productivity has been confirmed previously at the individual researcher level. I will expand these studies and report the results of testing the same correlation at the institution level.
2.5 The Relationship Between Research Collaboration and
Impact
The relationship between collaboration and research impact is more studied than the other two pairs. The effects of collaboration on impact have been studied from various angles, for example, different types of collaboration (e.g domestic or international col- laborations) have different effect on impact; time factors of the collaboration (how the impact of the collaboration changes over time) and regional variations of the collabora- tion (how the relationship vary depending on country and region).
Collaboration is often measured using co-authorship. Presser [156] splits papers into co- authored papers and singly authored papers (non-collaborative papers), and compares editorial decisions with respect to these two categories of papers. Statistical analysis based on more than 200 papers submitted to a journal showed that collaborative papers were considered “less bad” than the non-collaborative papers. Beaver [19] reached a 42 Chapter 2 Measuring Research Activities and their Relationships very similar conclusion analysing multiple authored papers by citation counts: singly authored papers are slightly more likely never to be cited than collaborative ones.
2.5.1 Types of collaboration
Different types of collaboration have different relationships with the research impact.
The most studied types include international collaboration, domestic collaboration, inter-institutional collaboration and intra-institutional collaboration. Narin et. al.[137] found that internationally collaborated papers were cited twice as heavily as papers au- thored by a single scientist in a single country. 65 Biomedical scientists in a New Zealand university were closely investigated by He et. al. [93]. They found intra-institutional collaboration and international collaboration significantly and positively correlated with article’s citations. However, they could not find the same correlation with domestic collaborations. In a different study, Didegah and Thelwall [56, 57] attempted to find the determinants of high impact research. Amongst a range of other factors including im- pact factor of the publishing journal, document properties (abstract readability, abstract length, document length, keyword count etc), cited reference’s impact factor, they found individual and international collaboration give a citation advantage in Biology and Bio- chemistry and Chemistry, but inter-institutional collaboration is not important in any of the subject areas they studied, reaching the same conclusion as He et. al.
Positive associations between international collaboration and impact have also been found for New Zealand sciences (Goldfinch et al.[87]), Italian sciences [72] collaborations of South American institutions ( Sooryamoorthy [173]), and a case study of Harvard university research (Gazni and Didegah [80]).
2.5.2 Time factors of the collaboration
The impact of the collaboration has changed over time. Levitt and Thelwall[116] used nearly 30 years of WoS data in Information Science & Library Science subject category to learn how collaboration and citation change over time. Breaking the article citation into five strata, they found that collaboration in the highest four citation strata increased Chapter 2 Measuring Research Activities and their Relationships 43 over time, whereas collaboration in the un-cited articles remained low. In other words, collaborative research is becoming increasingly significant and influential whereas non- collaborative work is becoming more and more difficult to be influential in IS&LS.
2.5.3 Regional and subject variations
Levitt and Thelwall [117] studied regional variations of the effects between the collab- oration and higher citation. Using dataset from Social Science Citation Index (SSCI), they compared 18 countries, 17 US states. While they confirmed that in all regions they studied, the mean citation level of the collaborative articles was at least as high as that for the non-collaborative research, they noted that five of the US states had at least one citations indicator showing higher citation for non-collaborative articles. Leimu and
Koricheva [115] studied articles from an ecology journal between 1998 and 2000, and compared US ecologist with the European counterparts. They claimed that the collab- oration in ecology had a minor effect on the impact of the resulting publications, as measured by citation rates. Comparing the citation rates of the article by European authors and US authors, US ecologists benefited from collaboration more than their
European colleagues.
2.5.4 Alternative collaboration measurements
The positive correlation in a different collaboration setup is also confirmed. Rigby and
Edler [162] studied the collaboration intensity within projects and the quality of the research groups conducting these projects. 22 sets of research data were examined from
Austria, and it was found that increased levels of collaboration within projects are associated with lower levels of variability of quality within each dataset. In other words, when collaboration is frequently conducted within the project, the quality of the output is more stable.
The main message from the literature is that collaboration is, to certain extend, related to the impact of the research paper produced. Between individuals, low impact research is improved by collaboration; in collaboration between countries, collaborative research is 44 Chapter 2 Measuring Research Activities and their Relationships cited more frequently than singly published research, with exceptions in certain discipline
(e.g. ecology). Extra attention must also paid to the types of collaboration and regional differences.
2.6 The Relationship Between Research Productivity and
Impact
Compared to research collaboration, the study of the relationship between research productivity and research impact is rather sparse. There are very few studies directly investigating this relationship. h Lanjouw and Schankerman [108] investigated relationships between companies’ pro- ductivity and impact of patents. Patents, like article publications, present novel ideas
(inventions) and are sometimes cited by other patents. Their productivity was measured by the ratio between the number of patents filed by the company and the resources spent on them. Patent impact was measured by the amount of revenue generated from it. They found a negative correlation between productivity and patent impact. That is, the higher the productivity (more patents filed for the unit amount of resource), the less the revenue generated by those patents. This result showed that the impact of a patent as measured by the revenue generated is dependent on how much resource has been spent on it. For the same amount of investment into producing the patent, the company with a lower number of patents filed actually generates more revenue.
Indeed, higher impact research takes more input because of the extra effort in conducting a thorough background review, careful design of experiments and the final presentation.
It follows that the relationship between research productivity and research impact could be inversely proportional, although no previous research has confirmed this yet. Chapter 2 Measuring Research Activities and their Relationships 45
2.7 Network Models
Correlation analysis is a good mathematical tool for revealing relationships between selected variables. It tells us precisely how much correlations between the variables and the result can be used for predicting variables. However, with the growing amount of data available to us in the recent years, discovering the existence of relationships in the hundreds of variables has become equally important and challenging.
Network models abstract the data into graphs. These graphs can be visualised and analysed effectively. The recent advancement in personal computing power, it is possible to visualise and conduct graph analysis using quite sophisticated algorithms, and hence discover visually the relationships between variables in large scale.
In this chapter, we first present the mathematical background of the network models and network methods, then we review previous works in modelling and analysing networks that constructed based on the research publications.
2.7.1 Graph
The mathematical graph provides a solid foundation for network analysis. The type of network this study is concerned with can be modelled by a graph. Graph Definition A mathematical graph is defined as a pair of sets G = {V,E}, where V is a set of vertices
(or nodes) v1, v2...vn and E is a set of edges (or links) that connect two vertices. The edges can also have values attached, so the graph becomes a valued graph.
Modelling real networks using graphs was recorded as early as the 18th century. Leonardo
Euler tackled the famous K¨onigsberg’s Seven Bridges problem by modelling the islands and bridge connection as a graph. Since then, the study of the network models took off.
2.7.2 Random Network Models
This is a class of network models which includes the original Erd˝osand R´enyi random graph model [63, 64] and relevant variations. The original model defines a very simple 46 Chapter 2 Measuring Research Activities and their Relationships
graph: a graph Gn,p is defined as n nodes that connects each pair of nodes with prob- ability p. In fact, graph Gn,p is not a single graph, but a collection of graphs with n nodes and all the possible ways of connecting the nodes together with probability p.
There are two crucial aspects that the Erd˝osand R´enyi graph model cannot model in social networks:
1. Degree distribution6. Due to the random nature of this model, the degree distribu-
tion follows Poisson distribution. This is very different from many real networks,
such as social networks and citation networks, which follow the power-Law degree
distribution [143, 158].
2. Clustering. Social networks have high clustering [84], indicating a locally well
connected structure. However, the random network model cannot produce this
local structure due to its random nature.
As a result, the random network model is not a very suitable network model to be used to study social networks. There are variations of the Erd˝osand R´enyi network model to address these problems. The configuration model [132] and Chung and Lu’s model
[47, 48] specifically targeted the degree distribution of random networks. Holland and
Leinhardt [96] and Strauss [73, 176] proposed models to address the clustering. But a common problem is that they become too complex to be useful in many studies.
Researchers started asking: are we starting from the correct foundation for modelling social networks?
2.7.3 Small-World Phenomenon
First we need to introduce a metric that measures the connectedness of a network – the Average Path Length (APL). The APL in a network is the average of the shortest path between all pairs of nodes. For instance, a network with APL of 3 tells us that on average, the path length between any pair of nodes is 3.
6Degree of a node is the number of edges connected to that node. Chapter 2 Measuring Research Activities and their Relationships 47
The Small-World Phenomenon is the observation that large networks – with millions or even billions of nodes – only have a small APL. This phenomenon is often found in many real networks. The human acquaintanceship network [184] consists of billions of people and its APL is only 6; the co-authorship network with 250,000 researchers
[59, 144] has an APL of only 7. More recently, the small-world phenomenon has taken a precise meaning[9, 106]: networks are said to show small-world phenomenon if the
APL of the network scales logarithmically (or slower) with the network size for a fixed average degree.
One of the important features of this class of networks is that the information transmis- sion is much faster than, for example, a network that has APL in the order of thousands or millions. So this small-world effect is desirable in networks like the scientific collabora- tion network and the World Wide Web (WWW), where information and knowledge can channel through quickly, but is not so desirable in situations like disease transmission or the spread of rumours in social networks.
Another property of the small-world networks is that the nodes are locally clustered
[59, 193]. This means that if node A is connected to node B and C, then B and C are very likely to be connected too. One demonstration in social networks is two close friends of someone are very likely friends themselves too.
The Erd˝osand R´enyi random network model and its variations reproduce the small- world effect well [17, 26, 58, 74]. But as we have already discussed in section 2.7.2, the random network model cannot produce high clustering.
Watts and Strogatz [194] proposed a simple model that caters for both the small-world effect and high clustering (Figure 2.2). The model starts with a ring of nodes, then each node is connected to the nearest neighbour on both sides. A randomness parameter p is
introduced, such that the amount of edges in the model is randomly rewired according to
p. When p = 0, no edge is rewired and the graph remains a lattice; while when p = 1 the
graph is completely rewired and becomes a random graph. By varying the randomness
p, there is a sizeable region, as shown by Watts and Strogatz using numerical simulation,
where the model has small-world phenomenon and is highly clustered. 48 Chapter 2 Measuring Research Activities and their Relationships
Figure 2.2: Watts and Strogatz Model. Left: the regular ring lattice with no randomness; middle, some randomness introduced when connecting neighbours, the network became small- world; right, a complete random graph. Figure reproduced from [194].
This model demonstrated observations in many social systems, where most people are friends with people they are geographically close to – colleagues, house neighbours, classmates – and the lattice represents these connections. Many people also have a few friends that live a long way away – friends living in other cities or other countries – the randomness adds long distance connections to the network.
Analysis of this model shows a surprising result: in order to convert a lattice network into a small-world network, only a tiny fraction of rewiring is required. What this means is that the small-world phenomenon found in social networks is stable and is not on the edge of collapse.
There are many variations of the Watts and Strogatz model. A much studied variant was proposed by Newman and Watt [145], which randomly adds edges to the graph but does not remove edges from the regular lattice. This prevents isolated clusters forming, which made the network easier to analyse. Models with higher dimensions have also been proposed and studied [54, 135, 146, 150], and the results are qualitatively similar to the one-dimensional case.
2.7.4 The Scale-Free Network Model
The Erd˝osand R´enyi random network model is one of the simplest yet most studied net- work models. However, it has major weaknesses in modelling social networks. In 1999, Chapter 2 Measuring Research Activities and their Relationships 49
Barab´asiand Albert [16] presented a new way of modelling networks. They emphasized the growth of networks found in real life. The social network, the citation network and the World Wide Web are evolving networks. They all started with few nodes, then new nodes were created and attached to existing nodes in the graph, which finally resulted in the current network. Barab´asiand Albert showed that in order to produce a real network’s degree distribution, whenever a new node is added to the graph, the node must have a higher chance to connect to nodes that already have many connections. For instance in citation network growth, a new publication has a higher chance to cite one that thousands of other publications cite, than one which only a few dozen publications cite.7 They call this the preferential attachment. The resulting topology is that their degree distribution follows a power law. This means that most nodes have very few links, but the remaining few nodes have all the rest.
The importance of their contribution is not only on a new network model, but also a whole new way of viewing a network – it is a dynamic, evolving structure. Some of the network features are rooted in the evolution of the networks rather than the network’s topological characteristics.
2.7.5 The Citation Network
Citation networks are classic knowledge networks. The research papers – the carriers of original ideals and knowledge, cite one another, indicating the path of knowledge evolution. In a typical citation network, the nodes represent papers and the directed links represent citations. Because papers are generally cited after their publication, so citations can only point back in time, i.e. only later papers can cite previously published papers. The citation network, unlike the co-authorship network, is a non-cyclic network and the arrows on the links point back in time.
Researchers started studying this network in the 1950s. Price [157] was among the first to investigate the patterns of citations. He found that a small number of papers are cited
7This can be simply explained by probability: If thousands of publications cite a paper, then this paper is much easier to be found than one that only a few papers cite. 50 Chapter 2 Measuring Research Activities and their Relationships more frequently than average, while the majority of papers are cited less frequently than average.
Citation is often used as an indicator of scientific performance. Redner [160] studied the relationship between citation count and scientific impact; Cronin [50] investigated the h-index8 and impact ranking of authors; Cole [49] and van Raan [185, 187] evaluated the influence of awards, honours and Nobel laureateships on citations. These studies in general give a positive result for citations measuring scientific activity.
However, some studies have suggested that citations are not a suitable measure for scientific activities [77, 198]. They claim that citation depends on many factors besides scientific impact. These include, for example:
• Time-dependent factor: the more frequently a publication has been cited, the more
frequently it will be cited in the future [40, 157];
• Availability of the publication: physical accessibility [174], open access of publica-
tions [36, 90] and publishing media influence the probability of citations [171];
• Author-reader dependent factors: results from M¨ahlck and Persson [124] and
White [195] showed that citations are affected by social networks, as authors cite
personally acquainted authors more often.
Bornmann and Daniel [30] reviewed the citing behaviour of scientists and concluded that at the micro-level, citing is a social and psychological process that is mixed with personal bias and social pressures; but at the macro-level, scientists give credits to colleagues by citing their work. Thus, citations represent an intellectual or cognitive influence on scientific work.
Studies of the citation network enabled us to understand the structure of knowledge and anticipate developments in various domains. The network constructed using citations between published papers is a knowledge network, where papers point towards a source of knowledge. Since papers are produced by researchers and increasingly, co-authored
8h-index, or Hirsch index is a value used to estimate a researcher’s impact, using citation. Original article: [95] Chapter 2 Measuring Research Activities and their Relationships 51 by researchers, networks of researchers can be constructed, both based on the extension of the citation network (co-citation) and co-authorship. In the coming sections, we show how these two types of networks are constructed and what these networks add to the analysis of the scientific activity.
2.7.6 The Co-citation Network
Author cite papers together (e.g. in the same sentence or in the same paragraph) when the content of those papers are somewhat relevant. Collectively, the co-citation can represent a group of authors’ decision in the content similarities of previous publication.
The more frequent the publications have been co-cited, the stronger the similarities between publication, hence stronger the similarities in the researchers’ work.
There are two types of co-citation networks which are commonly discussed in the liter- ature: the paper co-citation network and the author co-citation network. These are in fact two different networks. The paper co-citation network builds a network of papers that are frequently cited together, so it is useful for studying knowledge structure and knowledge propagation. The author co-citation network connects the researchers who made the publications. It links researchers together and is useful for studying the po- tential researcher’s relationship and the likelihood of their collaboration. We focus on author co-citation network in this study.
The construction of the author co-citation network can vary depending on availability of data and processing power of the study. A pair of authors can be co-cited in the same sentence, in the same paragraph, or in the same article. The strength weakens as the authors co-cited are further apart. The position of the author in the author list also matters. Some studies only use the first author and ignore all other authors for co-citation, while some use all authors. These studies can give very different result due to the construction of the co-citation network.
The original methodology by White [196] only considers the first author of any given paper and disregards the contributions of other co-authors. This was perhaps due to the limitation in computing technologies. Follow-up studies by Persson [153], Zhao [202] and 52 Chapter 2 Measuring Research Activities and their Relationships
Callahan [37] consider all authors listed on a paper and the context where the co-citation occurred, thus helping to identify the domains of authors who are seldom listed as first authors. Su [177] proposed an algorithm to discover authors based on their expertise.
The input to her algorithm was a co-citation network.
A few domain co-citation analyses have also been performed in trying to understand the sub-domains and the top active authors. White et al. [197] analysed the information science field from 1972 to 1995 using the author co-citation. They generated maps of the top 100 authors in the field and used factor analysis to identify major specialities. They found that information science consists of two major specialities with little overlap.
Chen and Carr [42] used ACM publication data to study the structure of the hypertext literature. Authors cited fewer than 5 times during the period 1989-1998 were filtered, resulting in 367 authors. An author co-citation matrix was constructed and Princi- pal Component Analysis (PCA) was applied. The temporal information of the papers was included in the visualization methods, allowing them to identify emerging research directions in the field.
2.7.7 The Co-authorship Network
Authors form co-authorship relationship if their have co-authored papers and both their names appear on the same paper. In this network, nodes represent authors and links represent co-authorship. Co-authors generally know each other and many of them col- laborate with each other. So to a certain degree, the co-authorship network represents researchers’ social network and collaboration relationships, hence, it has attracted much research attention in recent years.
2.7.8 The Erd˝osNumber
Calculating the Erd˝osNumber is one of the earliest activities based on the co-authorship.
The Erd˝osNumber is a measurement of the number of collaboration steps a researcher co-authored with the famous Mathematician - Paul Erd˝os.Researchers who co-authored a paper with Paul Erd˝oshave Erd˝osNumber 1; researchers who co-authored a paper Chapter 2 Measuring Research Activities and their Relationships 53 with a co-author of Paul Erd˝oshave Erd˝osNumber 2 and so on. Those authors who never co-authored a paper with Paul Erd˝osdon’t have an Erd˝osNumber or are said to have an infinite Erd˝osNumber. De Castro and Grossman [53] found that many famous researchers, whatever their research areas, have a finite Erd˝osNumber. Because the famous researchers also are tightly connected within their own research domain, this leads to an entire research community being connected through co-authorship. The implication is that scientific research is a collaborative work rather than individuals making their own discoveries. They also reported that in order to have a smaller Erd˝os
Number, quality is more significant than quantity. The person with whom one has collaborated is more important than the number of collaborators.
2.7.8.1 Domain analyses
Co-authorship analysis is widely used to understand publication and collaborative pat- terns among researchers in a specific domain.
Newman [139–141, 144] carried out a series of co-authorship analyses in 2001. He answered a wide variety of questions about collaborative patterns by analysing co- authorship networks, such as the number of papers authors write, how many people they write them with and the typical distance between researchers through the network.
He compared these attributes across several domains – Biology, Physics, Computer Sci- ence and Maths, and made these claims:
• The number of papers written per author is similar across the domains in the
study;
• The number of authors per paper and the average number of collaborators vary
substantially across domains;
• All of the subject domains have a largest component connecting at least 80% of
the researchers;
• The average collaboration distance is small, typically 4 to 6 steps for a network
containing millions of nodes; 54 Chapter 2 Measuring Research Activities and their Relationships
• The clustering coefficient is much smaller than a random network expected value.
Other domain analysis work using co-authorship include: Gl¨anzel[85] studied scientific networks in general; Moody [134] investigated social science collaboration networks; Liu
[118] and Sharma [170] studied the digital library community, including a few others
[62, 111, 119, 128, 164, 180]. These studies came to similar conclusions: researchers are mostly connected; the distance between researchers is short and the network is highly clustered; co-authorship networks are small-world networks.
2.7.8.2 Network dynamics and evolution
Studying co-authorship networks has not been limited to understand the snapshot of scientific collaboration in time. Since the published work has a timestamp, which tells us when the co-authorship represented by the publication was added to the network, it is possible to study an evolving network of people.
Barab´asiand Albert [7, 15] proposed a model based on the co-authorship network that captured the network’s time evolution. They realised that the features commonly used to identify a network, such as average degree, diameter, clustering coefficient, are in fact time dependent and no longer suitable to characterise a network. On the other hand, they found that the degree distribution is a stationary measurement for an evolving network, which can potentially be used to characterise a network instead. In addition, they also discovered that the measurements on incomplete data could lead to opposite tendency. For example, the node separation exhibits a decreasing tendency on datasets that only cover certain periods, while their numerical simulation suggest otherwise.
Newman [142] used the evolving social network extracted from co-authorship to pre- dict further collaboration. He analysed what affects researchers’ choice over who to collaborate with, given their previous publications, he found that the probability of a pair of researchers collaborating increases with the number of other collaborators they have in common; and the probability of a particular researcher acquiring new collabora- tors increases with the number of his/her past collaborators. This result demonstrated evidence of preferential attachment in co-authorship networks. Chapter 2 Measuring Research Activities and their Relationships 55
Co-authorship has been frequently used as a source data in the network studies. It has facilitated many interesting discoveries in researcher social networks and helped us understanding the disciplinary differences in collaboration. We will be using this data and network visualisation to show the institutional collaboration network.
2.8 Chapter Summary
This chapter reviewed literatures in the area of measuring research activities – doing collaborative research, increasing scientific knowledge production and achieving higher impact. We then investigated how these activities relate to each other. Towards the end, we reviewed network models and network methods in assisting of discovery of relationship between the activities.
The definition of research collaboration is simple, but to accurately measure the com- plex collaborative interactions between researchers is difficult. Collaboration can vary by closeness, frequency and the respective roles of the researchers in the collabora- tion. Across disciplines, the roles of collaborators can also vary. While some disciplines consider certain roles as collaboration, others do not as reflected in their inclusion or exclusion among the paper’s co-authors.
Despite the complexities of collaboration, using co-authorship as its metric has demon- strated its advantages. The limitations were also discussed. Co-authorship will be used in this thesis as the measure of research collaboration.
The measurements for research productivity were investigated. The widely used def- inition of productivity takes the ratio between the output and input, i.e. the output generated on the unit input. Although the input as well as output of the research activ- ity was identified in the literature, however, the data was generally unavailable, hence this approach was not widely used. The publication productivity, which is an aspect of the research productivity that counts the number of publication, is commonly used in the literature. Despite it is a non-normalised variable, and as a result, it has bias towards larger sized institutions; it is the only variable available to use in a large scale study. Several publication counting methods were reviewed, which may offset the bias. 56 Chapter 2 Measuring Research Activities and their Relationships
There were four perspectives on quality commonly discussed in the literature: method- ological quality, report quality, expert review based quality and bibliometric impact. In bibliometric methods, citation was one of the most widely used impact indicators in the literature. New metrics, such as download logs and network based methods were showing correlations to the impact of research. Higher aggregation of research entities
(e.g. journal, country level and university level) have adopted their own unique method- ologies to measure their research related qualities. In this respect, country-wide research assessment, journal impact factor and university rankings were reviewed and discussed.
Collaboration was the focus of previous studies on the correlation analysis. Despite the different measurements and types of collaboration considered, the general findings suggest that collaboration correlate with the impact of the publications (especially the lowest impact ones); and international collaborations are cited more than national ones.
These results were obtained in a variety of disciplines including Psychology, Chemistry and Computer Science.
I would like to stress, at this point, that correlation does not imply causality. If one variable correlates with another it simply means there are relationships between them and they change together. Causality from one variable to another is a much stronger relationship. To confirm a causality requires very high quality data, especially the timestamps that generated the data points. As a result, there is little research trying to find causal relationships among these three variables. He et al.[93] touched on the causality between collaboration and productivity using the longitudinal data about the
New Zealand researchers.
Little research effort has been devoted to institution-level collaboration, and the dis- cipline coverage has been limited, which prevents us from understanding whether the effects of the collaboration on impact and productivity are uniform across disciplines.
Based on the material presented in this chapter, we have the building blocks to measure the collaborativity, productivity and impact of institutions. In the next chapter, I present how raw publication data was processed before correlational analysis was applied. Chapter 3
Research Questions
Universities and scientists tend to assume that (1) if they publish more, they are pro- ducing higher impact research; and that (2) if they increase collaborative research, they will produce both more and higher impact research. These causal assumptions have been accepted among the research productivity, the research impact and the research collaboration.
To provide evidence to these assumptions, the following central research question is explored in this study:
What are the relationships, at the institution aggregation level, among
collaboration, productivity and impact?
This central question involves three core variables: collaborativity, productivity and impact at the aggregation level of institution. The correlation between these variables can split into three pairs: collaborativity vs impact, collaborativity vs productivity and productivity vs impact. The past studies address only one pair of the variables and rarely consider them all together in a single study[137]. This study not only correlates them in pairs, but also apply partial correlation to remove the effect of the third variable, so that the true correlation between the pairs can be revealed.
In the following section, the central question is divided into three sub-parts – the correla- tion between collaboration and impact, collaboration and productivity, and productivity 57 58 Chapter 3 Research Questions and impact – one for each pair of the variables. A series of additional questions are asked, based on the gaps identified in the literature.
3.1 Research collaboration and impact
We have seen a wealth of articles focusing on exploring the relationship between research collaboration and impact in the literature. A positive correlation was found in a variety of fields including Nanoscience and Nanotechnology[57], Ecology[115], Economics[117] and Library and Information Science[116], but in Finance[12], Literature of Academic
Librarianship[92] or Library and Information Science[116], a correlation could not be found. In these disciplinary studies, the aggregation is at article, author, journal or country level. The collaboration measurement was estimated by the number of co-authorships. A stronger collaboration between a pair of authors(or countries) is demonstrated by more papers co-authored. Gazni[80], Bordons et. al.[28] and Larivi`ere et. al.[110] looked at the co-authorship sizes, i.e. the number of authors, addresses and countries listed on a paper. They have also found a positive correlation with the impact of the paper. In addition, it was further shown that not all types of collaboration have the same relationship with impact [56, 57, 72, 80, 137, 173]; the impact of the collabo- ration changes over time [116]; and the relationship between impact and collaboration has regional variations [115, 117].
Gazni[80] took Harvard university papers as a case study, found that at least 60% of the papers published are multi-author publications, suggesting that Harvard is a highly collaborative institution. Harvard, by its reputation, is a high impact institution, we can generalise Gazni’s finding by asking the following questions:
1. Are higher impact institutions more collaborative?
2. Do higher impact institutions emphasize on collaborative research?
Collaboration and impact is rarely explored at the institutional level. The current study
fills this gap by including thousands of institutions across five disciplines and conducting the analysis at the aggregation level of institution. Chapter 3 Research Questions 59
3.2 Research collaboration and productivity
Previous research suggested a close relationship between the research collaboration and productivity. Many found that collaboration has a positive correlation at the author level on productivity [33, 114, 205], and on author’s future productivity [93]. The num- ber of researcher’s co-authors was also found to have positive correlation with the re- searcher’s productivity [60]. Similar results were also confirmed in qualitative studies: by interviewing Nobel Laureates[205]; and by interviewing researchers in a New Zealand university[93].
At the aggregation level of entire institution, Katz[105] applied bibliometric methods to assess the collaboration status between institutions in three countries – UK, Canada and Australia. Among other questions he explored, he found a non-linear relation- ship between institutional collaboration and productivity, where larger institutions have proportionately fewer collaboration than smaller institutions. However, he could not generalise this finding due to the limited institutions from only three countries. Abramo et al.[1] studied the relationship at the institution level between Italian universities, while they found a strong correlation in information engineering, the correlations varied substantially among the remaining 7 research areas they analysed.
While the positive correlation has been found at the author aggregation level repeatedly, the studies focused on investigating at the institution level are scarce, confusing and non- generalisable. To fill these gaps, this study uses data that covers thousands of institutions to attempt to answer the following question:
3. Do institutions that publish more papers also collaborate more?
We also try to generalise Katz’s finding by asking:
4. Do institutions that publish more papers also publish proportionately
more collaborative papers? (i.e. Do high productive institutions emphasize
on collaborative research?) 60 Chapter 3 Research Questions
3.3 Research productivity and impact
Few studies consider the relationship between research productivity and impact, and those that do examined it in limited ways.
Lanjouw and Schankerman[108] investigated relationships between companies’ produc- tivity and impact of patents. They found a negative correlation between productivity and patent impact. Similarly, Bergh et. al.[21] found that authors having fewer arti- cles tended to have articles that received the most citations in a journal, suggesting a negative correlation between authors’ productivity and their impact. No study has yet addressed this relationship at the institutional level.
In the current study, the correlation is investigated at the institutional level and the following questions are addressed:
5. Do institutions that publish a large number of papers have higher im-
pact? Are there disciplinary differences?
6. Are papers published by high paper-output institutions cited more often
than papers published by low paper-output institutions?
This study uses recent advances in digital archiving and indexing services to do large- scale quantitative analyses of the relations among research productivity, impact and collaborativity, comparing effects across disciplines as well as countries. In the following chapter, the dataset and the methodology are described. Chapter 4
Datasets and Methodology
This chapter addresses the tools and the source datasets used in this research. It splits into six sections, detailing (1) the computing resources used; (2) the datasets and the preprocessing works; (3) the descriptive statistics of the source data; (4) the metrics selection and the counting methods used for each of the three institutional research activities; (5) the correlation methods used and (6) network methods and visualisations.
4.1 Computing Resources
The entire data processing, data analysis and graph visualisation was performed on a standard issue research student PC Workstation with Quad-core processor and 12GB
RAM at the University of Southampton. The majority of the data processing was com- pleted in an Ubuntu Linux environment using the Python v2.7 programming language and the MySQL v5 database. The visualisation was performed in a Windows 7 Environ- ment using the Network Workbench Software [182] package with GUESS1 visualisation module. 1GUESS is a graph visualisation and layout software. It is written in Java and has implemented several widely used graph layout algorithms. http://graphexploration.cond.org
61 62 Chapter 4 Datasets and Methodology
The analysis of the data, including data normalisation, correlation and partial correlation was performed with IBM SPSS2 v19 statistical package. Chart drawing was completed with Microsoft Excel software package.
4.2 Datasets and Pre-processing
The Thomson-Reuters Web of Science (WoS) database collects and indexes the world’s leading scientific journals in sciences, social sciences, arts and humanities. It was started by Eugene Garfield in the 1960s to index the citations between papers electronically. The citation index enables the use of the bibliometric analysis in learning about the scientific publication and practices in general. The data used in this study was obtained directly from the Web of Science under the licence that the data is solely used for research. This data covers papers published in Computer Science, Psychology, Pharmacology, Law, and
Materials Science between 1973 and 2010. The metadata of the papers were provided, which includes paper’s publication year, discipline, author list, institution list and the list of citing papers (papers that have cited this paper). The format of data was in database text dump, which was easily imported into a local MySQL database. The data included a total of 2,127,015 publications. WoS collects several types of publications, including articles, book reviews, letters, meeting abstracts and review articles etc., but the type of each individual document was not available. The entire dataset was therefore used in analysis regardless of publication types.
The Association for Computing Machinery (ACM) is an international scientific and educational computing society, it is the world’s largest computing society and it publishes leading computing journals and organises recognised conferences. The ACM Digital
Library is an online service that collects the ACM’s journals and conference proceedings starting in the 1950s. A dataset that contains 214,592 publications was provided by
ACM directly for this research. This dataset is assumed to only contain publications on Computer Science subjects. The data is organised in journal issues and conference
2SPSS is a software package for statistical analysis. It has a graphical user interface and implements a wide range of algorithms. Chapter 4 Datasets and Methodology 63 proceedings. The publication’s year, author list, institution address, citing article and reference list were included in the provided data.
Webometrics Ranking of World Universities (Webometrics ranking)3 is a world univer- sity ranking based on data found on the web. It is calculated and published annually by the Cybermetrics Lab in the Spain. The July 2010 version of the ranking was down- loaded for this study.
University Master List University names are not standardised, it is possible and often the case that the same university is spelled differently by different people publishing their papers. For example, some use abbreviations, some use different language, some even misspells. Different databases tend to use different university name conventions.
For instance, the WoS dataset uses abbreviated universities’ names while ACM dataset uses the universities’ addresses. It poses problem in matching the universities across databases accurately.
A master university list was used to simplify the process of mapping universities’ name across multiple databases. Each database maps its own university names to this master university list, which is then used to link together all of the university name variations across databases. The Webometrics ranking university list was used as the master list because: 1. it uses the common English university name spellings, which takes less effort to convert other name conversion into it; 2. it contains almost all universities in the world, regardless whether they are teaching oriented or research oriented, so it gives a wider coverage; 3. a university’s country is provided when the university name is not unique. For example, University of Technology exists in many countries; 4. it provides domain and url for each of the universities on the list, which can be used as an identifier for the university. Using an intermediate university list also makes this work easily expandable to include new databases in future analyses.
3http://www.webometrics.info 64 Chapter 4 Datasets and Methodology
4.2.1 ACM Digital Library Dataset Description
The metadata of articles are contained in the set of XML files. Figure 4.1 shows a sample part of an article. The ACM dataset provides the following details for all of the papers:
• article ID
• title
• publication year
• authors’ names and ID
• authors’ affiliation
• papers cited this paper
Due to the large size of the data (1.8 GB in total), accessing and processing the entire set was extremely slow. It took several hours just going though each article in the dataset to read its metadata. To greatly improve the processing speed, a data extraction step was performed to save only the useful data in a new format. These extracted data were stored in a file for simplicity. After this extraction process, the time takes to go thought the dataset was reduced to the orders of seconds. Figure 4.2 lists the fields extracted and the data structure and figure 4.3 shows an example of a paper metadata in the structure.
During the data extraction, I noted two major data quality issues regarding the paper metadata within the ACM dataset:
1. There was no unique ID assigned to each institution, nor were the institution names
standardised. Only the institution addresses were available to us to identify the
institution. These institution addresses appear to be the original text typed by
the authors, and they are typically not standardised and contain (but not limited
to) details like department name, post code, country etc. Chapter 4 Datasets and Methodology 65
Figure 4.1: A sample article in the ACM XML dataset
2. The authors were inconsistently named and not properly identified. Most of the
authors have been assigned with unique identifiers, but some have different identi-
fiers for each of their name format. For example, Prof Les Carr in the University
of Southampton was found to have several identifiers each linked to his different
name format, such as Leslie Carr, Les A Carr etc.. Each of these identifiers has
publications linked to it.
The problem 1 is specific to the ACM dataset, it is a major barrier before doing statistical analysis on the data. In the following section, I discuss my approach to identify the universities from an address and present our method to evaluate our method. The 66 Chapter 4 Datasets and Methodology
Article Year, Author 1 firstname, lastname, ID, Address Author 2 firstname, lastname, ID, Address
Figure 4.2: Intermediate data structure used for the ACM dataset
2005, Ben, Shneiderman, PP40024227, Univ.of Maryland, College Park Maryann, Alavi, P193437, Univ.of Maryland, College Park
Figure 4.3: An example entry in intermediate data structure for the ACM dataset problem 2 regarding the author name is more generic, the WoS dataset also has a similar problem. Our approach to the problem is presented in section4.2.3.
4.2.1.1 University Identification in ACM dataset
The ACM dataset stores the institution address as text items attached to the author
(figure 4.1). These items sometimes include department, institution name, institution address, city, country and post code. They appear to be the originals the authors submitted and the ACM had not processed or identified. There were 473,634 non- empty addresses in the dataset, visual analysis on randomly selected a few hundred addresses revealed the following potential problems in matching the universities to the
Webometrics list.
1. Name variance in the spelling of the university name. For example, University of
California Berkeley was sometimes written as UC Berkeley, UCB and so on.
2. Language variance in the university name. For example, Universit¨atHannover was
sometimes written as University of Hannover. Chapter 4 Datasets and Methodology 67
3. European accents were left as the HTML code. For example, ä found in the
source data should be displayed as ¨a.
4. Multiple universities were described in the same text item. (These can be those
authors that affiliate with multiple institutions.)
5. Mis-spelling of university names.
Targeting issue 1-3, the following algorithmic rules were implemented. The rules were executed in the order below:
1. The strings were converted into lower case, making the recognition case insensitive,
thereby reducing the possible variations for addresses.
2. The HTML code in the source data was converted into the actual characters.
3. The European language variance of the word ‘university’, i.e. ’universit¨at’was
shortened into ’univ’, thus reducing the range of variations the university is spelt
in European countries.
4. The names of the institutions were looked up from the Webometric university list.
An additional lookup table was constructed to map between the university name
variations and the university name used by Webometric list.
The lookup table was constructed by a computer program that analyses the address frequency. The most frequent variations in the source data were manually identified, and the variations were added to the list of name variation for the university. This process is repeated until the increase in the address recognition rate is negligibly small for every new address variation added to the lookup table. The address recognition rate is presented in the next section.
To evaluate the scale of the problem of multiple institutions appearing in the same text item (issue 4), two variations of the algorithm were implemented. the algorithm
1 assumed that only one university was expected from any given address, while the algorithm 2 did not made such an assumption, so it finds as many universities in the 68 Chapter 4 Datasets and Methodology address as it can. Algorithm 1 found 273,844 university instances and algorithm 2 found 275,978 university instances using the same lookup table on the same dataset.
This is an increase of 0.4% more universities using algorithm 2 than using algorithm
1. This small percentage increase is not significant in the overall university recognition rate. Considering the overhead and complications in algorithm 2, I decided to use only algorithm 1 for university recognition. This means that I ignored the author’s second affiliation, so fewer institutions appear to be collaborating in this dataset.
The problem of mis-spelling (issue 5) was not dealt with directly. Mis-spelling of a university name can be counted as a variation in the university name. If the mis-spelling is very common, it would be picked up by the frequency analysis process and added to the lookup table. But there was no mis-spelling common enough to enter the lookup table.
4.2.1.2 ACM Dataset University Recognition Rate
The ACM dataset includes many types of institutions ranging from hospitals, research centres, companies and so on. In order to learn the university recognition rate without treating all the unrecognised institutions as universities, we need to estimate the pro- portion of the university addresses. Thus we can avoid underestimating the recognition rate by basing the data on all of the addresses in the dataset.
The addresses in the ACM dataset are not standardised to be usefully aggregated. Table
4.1 shows 5 addresses all representing the same institution. The actual possible variation of the address for the same institution is far more than the listed ones. So without the knowledge of the number of unique institutions, it is only possible to estimate the address recognition rate, that is, the proportion of the recognised addresses out of the total recognisable addresses, with no addresses aggregated. This is to be distinguished from the institution recognition rate used in WoS, where the addresses have already been pre-processed and the addresses representing the same institutions are aggregated.
The non-aggregated addresses include duplicated institutions. That is, an institution’s address is repeated as many times as the number of papers published by the institution. Chapter 4 Datasets and Methodology 69
As a result, the address recognition rate is not directly comparable with institution recognition rate.
Addresses Institutions Massachusetts Institute of Technology MIT MIT MIT M.I.T MIT M.I.T, Cambridge MIT MIT,Cambridge MIT MIT, Cambridge MIT
Table 4.1: Address variations in the dataset. The actual variations of the address for the same institution is far from the listed 5.
Address recognition rate is calculated by the formula 4.1
Num. of Recognised Address Univ. Address Recog. Rate = (4.1) T otal Addresses × Univ. %
To estimate the percentage of addresses that represent universities (Univ.%), two hun-
dred of randomly selected non-empty addresses were inspected, and their institution
type were identified manually. The distribution of the institution types is shown in
table 4.2. Institution Percentage Company 21% University 70% Research Centre 9%
Table 4.2: ACM institution type distribution
The sampling reveals that the university is the major contributor in Computer Science research as seen by ACM, where 70% of the contributors are from the university. Using formula 4.1, the address recognition rate can be calculated. After several iterations of the lookup table improvement, it was able to recognise and match 273,844 address lines,
473,634×70% which represents an address recognition rate of 83% ( 273,844 = 83%). I decided to stop here because the rate improvement with each additional address in the lookup table is negligibly small. 70 Chapter 4 Datasets and Methodology
4.2.2 Web of Science Dataset Description
Unlike the ACM dataset, which includes only computing publications, WoS collects a broader range of disciplines. They maintain three citation indices, Science Citation
Index Expanded (SCIE), which covers 150 disciplines in nature science and engineering;
Social Sciences Citation Index (SSCI), which covers the social science disciplines; and the Arts & Humanities Citation Index (AHCI), which covers the arts and humanities disciplines. Its index coverage in terms of the journal is also vast. It collects 17,240 leading journals out of about 28,000 total journals in the world [161], that is more than
60% of the journals. Almost all of the current active research disciplines are covered by one of its citation indices, making WoS the largest citation index in the world.
Due to licensing and limited computing power, I can only choose a limited number of disciplines. The choice of the disciplines was decided based on the practice difference of research publications across two groups – Nature Science and Engineering (NSE) and
Social Science and Humanity (SSH). In NSE, peer reviewed journal publication is the primary output of the research, while in SSH, research are more often disseminated in the form of monographs, which are not indexed in the journal-based databases such as
Web of Science [86, 109, 138]. Psychology and Law were selected to represent SSH while
Pharmacology and Materials Science were selected to represent NSE. Computer science was also included to compare with the ACM dataset.
4.2.2.1 WoS Data Format and Data Relations
The WoS dataset provides the following details about each paper:
• paper ID
• publication year
• subject
• list of authors
• author’s order of appearance Chapter 4 Datasets and Methodology 71
• list of institutions
• institution’s order of appearance
• institution’s country
• papers cited this paper
The primary entity of this study – the institutions – were not uniquely identified by the dataset. A decision was made to use the combination of institution’s name and its country as the primary key. to identify them in the university master list. Once a match was found, the URI of the university provided by the master list was used as an identifier for an institution.
The authors of the institutions were also not identified. The dataset did not provide au- thor’s information except their names, which only includes the first initial and surname.
With just author’s initial and surname, there is not enough information to separate two authors with the same name easily. Previous work has shown to use co-authorship and author self-citation to assist author’s identification, I discuss these works in section 4.2.3.
The WoS dataset do not provide author’s affiliating institutions. The authors and the institutions were separately linked to the papers, but no link exists between them. An extra field – order – was provided in both the author table and institution table, which may offer us clues to the author-institution relationship. Unfortunately I was told by the data provider that this ordering information cannot be used to reconstruct the relationship. (It appears that many of the paper’s author numbers do not match the institution number, which makes reconstruction problematic.) There are 64% of the papers in the dataset that have more authors than number of institutions, while 3% of the papers have fewer authors than the number of institutions. The papers with more authors than institutions could be due to the source journal’s convention of storing the author and institution. When authors come from the same institution, some journals may have chosen to omit the duplicated institutions on the paper. The high percentage
(64%) of paper with more authors than institutions means that it is quite common for multiple authors from the same institution to collaborate on a paper. On the other hand, the 3% papers with more institutions than the authors are mainly because of an author’s 72 Chapter 4 Datasets and Methodology multiple affiliations. This phenomenon of authors specifying multiple institutions in their published paper was referred to by Katz as ‘top down collaboration’ [105]. He claimed that the appointment of a single researcher at multiple institutions symbolises the collaboration between institutions. However, in the UK, it is also a common practice for researchers and lecturers to visit other institutions as part of their career progression.
These researchers may have reasons to put down both institutions in their published papers. This multiple affiliation resulting from visiting researchers may not necessarily be an indication of institutional collaboration. The collaborations resulted by multiple affiliation of the authors were removed when possible in this study.
4.2.2.2 WoS University Name Processing
The institution name processing for WoS dataset is very different from the ACM dataset.
The ACM dataset used free text describing the institution’s address while the WoS dataset has already standardised institution names. None of the free text processing problems identified for the ACM dataset in section 4.2.1.1 apply to the WoS dataset.
The 12,894,008 WoS addresses were first aggregated into 539,356 institutions. The processing WoS data moves straight to match these standardised institution names to the Webometrics institution names.
There are two potential ways to match them. One is the forward method: to apply the
WoS abbreviation rules to the Webometrics’s institution names, then uses the obtained abbreviations to match with the WoS list; the other is the backward method: to recon- struct the WoS abbreviations into the full institution names used by the Webometrics list. Both approaches are equivalent if the rules are non-destructive, but it is not the case here.
Table 4.3 lists the four most applied rules used by WoS to abbreviate the institution names.
Since the rules used by WoS are destructive, for example, rule 2 in table 4.3, after removing ‘of’ from the phrase, it is not possible to reconstruct the original institution’s Chapter 4 Datasets and Methodology 73
Rule Description Original With rule applied 1 University and its European Universitatea UNIV variations are shortened to UNIV 2 Connecting words, ‘of’, ‘de’ of are removed 3 Space is converted to ‘-’ - 4 European accents are con- ¨a;`a;˙a a verted to English alphabet Example: University of Edinburgh UNIV-EDINBURGH Example: Universit´ede Montr´eal UNIV-MONTREAL
Table 4.3: The most used WoS abbreviation rules. These rules are destructive. The original can not be obtained by applying the inverse of these rules.
name, because the location of the word was lost. This leaves us only the forward method for doing institution matching.
In addition to the abbreviation rules, WoS uses a separate list for those long established institution abbreviations. For example, California Institute of Technology is shortened to Caltech; Massachusetts Institute of Technology is shortened to MIT. These are also included in this study. Institutions with an abbreviation unrecognised or incomplete are excluded in this study. For example, about 1400 institutions labelled as ‘INCONNU’
(French word for ‘unknown’) are excluded.
4.2.2.3 WoS Dataset’s University Recognition Rate
The WoS dataset includes a range of institutions such as research centres and companies.
To learn only the recognition rate for university of our matching method, I need to find out the proportion of the addresses that are actually universities.
This involves 4 steps:
• find out the total number of institutions in the dataset.
• find out the proportion of the institutions that are universities.
• work out the expected number of universities.
• work out the university recognition rate. 74 Chapter 4 Datasets and Methodology
The calculation of the university recognition rate is shown in formula 4.2
Recog. University Num. University Recog. Rate = (4.2) T otal Inst. × Univ%
There were in total 12,894,008 addresses, aggregation on these addresses gave a to- tal of 539,356 institutions’ addresses. This is a large number of institutions that have made scientific contributions compared to the total number of universities (12,000 es- tablishments) in the globe. A close examination revealed two possibilities that may have elevated the number of institution: 1. these institution addresses include former institu- tions. For example, there were institutions with an address in former Soviet Union. 2.
WoS institution abbreviation is just a shortening rule, it still relies on the source data for the correct addresses. For example, Peking University was found to have two rep- resentations: PEKING-UNIV-BEIJING and PEKING-UNIV. It may potentially have many more other representations too, increasing the institution counts.
The majority of the total 539,356 institutions contributed very few papers. There were
496,947 (92% of total) institutions only ever authored or collaborated in less than 10 papers during the past 37 years. This potentially indicates that majority of the contrib- utors are one-off, or these addresses are just variations to the original institution.
To determine the proportion of universities in the dataset, a random sample of 200 institutions were selected, and 11% of them were found to be universities. Scaling it to the entire dataset, the expected number of universities is about 59,000. This is almost
5 times more compared to Webometrics’ world wide university listing of 12000. This is revealing that the WoS has not been standardising institution names very well (issue
2 above), making alternative spellings of the university name been categorised as a different university.
Executing the matching algorithm on all of the institutions gave 7,183 matched Webo- metrics universities. Further analysis reveal that out of the 7183 matched universities, only 1972 have been actively publishing (at least 3 authorship per year). The decision was made to use only active universities because (1) very low contribution university will not change the big picture; (2) the methodology adopted in this study (correlation Chapter 4 Datasets and Methodology 75 analysis) is unsuitable for analysing a large quantity of less quality data. Applying for- mula 4.2, I obtained the university recognition rate of 12%.This is equivalent of nearly
60% of the current world university found to have contributed to the domain analysed.
WoS Institution Types The institution type distribution – whether they are a company, research centre or university – for each discipline was also examined. The
WoS data was categorised into five disciplines and institutions with less than 100 papers were removed. A random sample of 200 institutions was selected in each discipline. Table
4.4 shows the institution type for Pharmacology (Phar.), Law, Materials Science (M.S.),
Psychology and Computer Sciences (C.S.). From the data, university takes the biggest proportion in the institution types in all five of the disciplines. Since there are only about
12000 university establishments in the world, the majority of the institutions filtered out were not universities. Universities were also the highest contributing institutions in terms of the number of papers published. In SSH disciplines, companies make no contributions.
This is of no surprise since research in these areas is less linked to revenues. On the other hand, Pharmacology research receives the highest industry support. Pharmaceutical companies are actively involved in research as well as publishing their research.
WoS Phar. M.S. Law Psych. C.S. Company 12% 2.5% 0% 0% 8% Research Centre 28% 10% 15% 10% 5% University 60% 87.5% 85% 90% 88% Total 100% 100% 100% 100% 100%
Table 4.4: Institution types of the WoS data. University is the biggest contributor in all of the disciplines; Psychology and Law have no company contributions while Pharmacology has the biggest company contribution.
4.2.3 Author Disambiguation
Although the overall aim of this study deals with research activities at the institutional aggregation level, it is necessary to obtain the author estimations in order to calculate certain descriptive statistics. The author estimations require disambiguation of the author names provided by the dataset. 76 Chapter 4 Datasets and Methodology
Author name disambiguation is the problem of the non-uniqueness of people’s names.
Within a small group(e.g. with in a research group), the full names can probably identify people without ambiguity because the likelihood of two people named the same is very low. But when the full name is used to identify researchers globally, it is far from appropriate. For instance, authors with the same name cannot be distinguished using their name; people may change names over the course of their career(e.g. changing surname after getting married); many journals use first initial in their publications, making the author names even more ambiguous. Inaccurate author identification brings issues to publication based the author evaluation metrics (e.g. h-index), making them less authoritative and credible.
4.2.3.1 ORCID and ResearcherID
In recent years, this problem is becoming more urgent with accelerated research output and output based performance evaluations (e.g. the UK and Australian’s research as- sessment exercise). Efforts are currently being made to resolve this problem from the root. A community based project, Open Researcher and Contributor ID (ORCID)4, is maintaining a database of identifiers for authors. This identifier, unlike the author’s name, is globally unique and does not change over time. Linking it with the Docu- ment Object Identifier (DOI)5 of the author’s publications, it creates this unambiguous relationship between authors and publications. A similar ID system, ResearcherID6 is currently used by Thomson Reuters to allow authors to link to their publication in the
WoS database. It is only early stages for these systems, and the data I have obtained from WoS do not contain any ResearcherIDs.
Until the ORCID and ResearcherID systems resolve this problem from ground up, we can only use heuristic approaches to this problem.
4http://www.orcid.org 5DOI is a currently well adopted document identification system. Each DOI uniquely identifies one document. 6http://www.researcherid.com/ Chapter 4 Datasets and Methodology 77
4.2.3.2 Author Identification and Institutional Repository
With the open access initiative to make the research publications openly and freely avail- able online, hundreds of institutions around the globe7 have mandated their scholars to deposit the publications into their institutional repository. (Some take further steps to use these publications as promotion evidence.) The scholars within each individual insti- tution are well identified, and they tend to have an institutional wide ID. Journals and conference proceeding publishers also make efforts to identify the authors within their own database. However, when these publication data was aggregated by the secondary publisher, e.g. WoS, authors cannot be matched up across different journals even if they have IDs in each of the journals, so the identified authors are often lost.
4.2.3.3 Fuzzy approach to Author Identification
Given a set of publications with the author’s name attached, there are two practical problems in identifying these names. First is the multi-name format problem. That is, one individual’s name may have multiple forms of spellings recorded in publications, including the name spelling variations, mis-spellings and OCR errors. The impact of this problem in bibliometric analysis is that it splits a single author’s publications into several distinguished identities, splitting this author output. When conducting co-authorship network analysis, this split could also potentially disconnect a large, connected network into several small, isolated ones, thus strongly affecting the network structure and the subsequent observations. At the individual level, researcher metrics and evaluation methods, such as h-index could give false results due to incomplete data. On the other hand, conducting the same analysis at the institution level is not affected by this problem, because the institution level only count the papers and discards the number of authors.
So there is no difference between 10 authors each publishing 1 paper, or 5 authors each publishing 2 papers. Both situations count as one institution publishes 10 papers.
The second is the duplicated-name problem. It is frequently seen in the Chinese name’s
English representation, where common surnames, for example Zhang with a short first
7For a complete and more up to date list, please refer to http://roarmap.eprints.org/ 78 Chapter 4 Datasets and Methodology name (or just first name initial) clashes with different researchers with the same spelling.
This problem has the opposite effect on the bibliometric and network analysis at the author level, where these common names stand out because they collect all of these researchers’ publications into a single ‘individual’. However this problem does not affect analysis at the institution level. At this level, author’s name is not used as an identifier to merge the publications into, the institution’s name is used instead. For example,
Zhang’s paper from University of Southampton is counted towards Southampton and
Zhang’s paper from University of Oxford is counted towards Oxford, instead of counting towards the common identifier Zhang.
Co-authorship and social network analysis techniques have been experimented in name disambiguation [99, 147]. It works based on the similarities of the same author’s co- authors. For example, if several ‘names’ have the same or a similar set of co-authors, then these names are likely referring to the same individual and can be safely treated as the same person. The problem with this technique is that it is computationally expensive and unsuitable for large scale analysis.
4.2.3.4 ACM and WoS Author Identification
The ACM dataset contains an ID field to each authors. However, these ‘IDs’ do not uniquely identify individual authors as its name suggested. There were instances found that the same individual has been assigned with multiple of these IDs, splitting his/her work into multiple authors. In order to evaluate the extent of the ID’s re-assignment, I have to use an alternative method to estimate the number of authors. (Figure 4.4)
The name conversion and collapse was an attempt to address the multi-name format problem while not introducing unacceptable duplicated-name problem. The institution- wide collapse assumes that first initial is a good enough identifier for authors in discipline within an institution.
The number of authors identified using this method is 323,419, which approximates the number of ACM IDs. For simplicity, ACM ID was used as the identifier for authors in the ACM dataset. Chapter 4 Datasets and Methodology 79
1. An author’s institution is identified and matched to the corresponding Webo- metrics entry.
2. The authors’ full names are converted to surname with first initial, which attempts to remove the different spellings of names on various publications.
3. The surname and first initials are merged institution-wide to limit the chances of mis-merging due to the shortened names.
Figure 4.4: Alternative method to estimate the number of authors in ACM dataset.
The WoS data has the authors represented in the form of surname with first initial with- out any identifiers attached. As a result, the duplicated-name problem is expected for common names, while the multi-name problem would be less likely to exist. However, the author information provided by WoS does not include affiliation, so the same tech- nique of merging names within each institution was not possible. Author level analysis is not the focus of this study, some author ambiguity will not affect the outcome of this study. It was decided to use author names as provided.
4.3 Descriptive Statistics
The overview of the two datasets is presented in figure 4.5 organised by discipline.
There are two datasets for Computer Science – ACM and Web of Science, each collects a different sets of journals. For Pharmacology, Materials Science, Law and Psychology, the Web of Science datasets were used.
ACM C.S. WoS C.S. Phar. M.S. Law Psych. Period/Years 1957-2009/52 1973-2010/37 Total papers 214,592 479,913 728,721 583,640 126,675 208,066 Mean papers per year 4,049 12,970 19,695 15,774 3,423 5,623 Institutional collaborative papers 20,043 164,553 277,939 214,491 11,543 68,141 Papers per author 1.56 3.85 4.16 4.38 1.99 2.77 Total citation counts 1,225,561 2,711,196 12,488,473 4,895,448 592,857 3,514,787 Papers received one or more citations 113,836 267,666 602,611 403,364 66,887 156,992 Mean citations per paper 5.71 5.65 17.14 8.39 4.68 16.89 Publishing universities 2,523 3,742 3,081 3,413 1,425 2,896 Total authors 334,150 310,683 729,427 452,191 78,373 175,731 Authorships 522,369 1,195,081 3,033,987 1,979,209 155,937 486,750 Authors per paper 2.43 2.49 4.16 3.39 1.23 2.34
Table 4.5: Dataset overview by discipline 80 Chapter 4 Datasets and Methodology
Figure 4.6 lists the statistics based only on the collaborative papers in each discipline.
ACM C.S. WoS C.S. Phar. M.S. Law Psych. Collaborative papers 20,043 164,553 277,939 214,491 11,543 68,141 Mean papers per year 542 4,447 7,512 5,797 312 1,842 Papers per author 1.41 2.80 2.77 3.18 1.40 2.07 Citations 138,138 1,066,609 4,802,108 1,938,636 80,790 1,348,530 Papers received one or more citation 12,534 103,356 236,208 161,273 7,361 55,880 Mean citations per paper 6.89 6.48 17.28 9.04 7.00 19.79 Mean institutions per paper 2.27 2.37 2.59 2.41 2.33 2.47 Authors 53,393 194,246 541,695 308,046 17,326 108,613 Authorships 75,548 544,655 1,498,862 978,275 24,219 225,142 Authors per paper 3.77 3.31 5.39 4.56 2.10 3.30
Table 4.6: Collaborative paper overview by discipline
4.3.1 Paper Productivity
10% 800,000 0% 25,000 2 3 4 5 6
700,000Series1 Series2 20,000 600,000
500,000 15,000
400,000 Paper 300,000 10,000
200,000 peryear papers Mean 5,000 100,000
0 0 ACM C.S. WoS C.S. Phar. M.S. Law Psych. PUBTOT PUBCOLL Mean papers per year
Figure 4.5: Paper productivity overview. Paper productivity varies across disciplines, demon- strating the disciplinary differences toward publishing. Productivity must be taken into ac- count when comparing publication metrics across disciplines.
Figure 4.5 presented the number of published papers, collaborative papers and mean papers per year for each discipline. From the WoS dataset, Pharmacology has the most number of published papers in total, per year and in collaborative papers. The other two NSE disciplines – Computer Science and Materials Science also have above 400,000 paper productivity. On the other hand, Law and Psychology papers are much fewer.
This could be because social science disciplines published in other formats which are not included in our dataset[109]. Chapter 4 Datasets and Methodology 81
Fewer Computer Science papers were recorded in the ACM database than in WoS database, despite the ACM dataset covering a longer period of time (52 years vs 37 years).
We have seen large differences in paper productivity across disciplines, the number of unique universities involved in the publication stays about the same (Table 4.5 publishing universities). Most universities are involved in publication and play a big role in research output.
4.3.2 Citation
14,000,000 18 16 12,000,000 14 10,000,000 12
8,000,000 10 CITAV CITTOT 6,000,000 8 6 4,000,000 4 2,000,000 2 0 0 ACM C.S. WoS C.S. Phar. M.S. Law Psych. CITTOT CITAV
Figure 4.6: Citations received by each discipline in the period studied. Total citation shows a similar shape as the per paper citation in all disciplines, the more citations received in total, the more citations were received by each paper, despite the increase of the papers. Psychology papers receive much higher number of citations per paper, compared to two similarly cited datasets: Materials Science and WoS Computer Science.
Figure 4.6, Pharmacology received the highest number of citations to its papers, both in its total citations received by all of its papers and the mean citations per paper.
Psychology only received a third of citations compared to Pharmacology, but its mean citations per paper is as high as Pharmacology. The remaining disciplines receive a fraction of the total citations of Pharmacology. Pharmacology has a much higher rate 82 Chapter 4 Datasets and Methodology of publication and citation practice compared to the other disciplines, while Psychology is much higher cited.
The mean citations per paper has grown for each discipline once the non-collaborative papers are taken out of the statistics (Table 4.5 compared to Table 4.6). Collaborative papers in Law received an increase as high as 50% (from 4.68 to 7.00 citations per paper).
4.3.3 Collaborative Papers
Collaborative papers 300,000 45% 40% 250,000 35% 200,000 30% 25% 150,000 20% PUBCOLL 100,000 15%
10% PUBTOT / PUBCOLL 50,000 5% 0 0% ACM C.S. WoS C.S. Phar. M.S. Law Psych.
PUBCOLL PUBCOLL / PUBTOT
Figure 4.7: Collaboration Papers. The number of collaborative papers published by each discipline varies strongly. The proportion of the collaborative paper over the total paper is also discipline dependent, where Law has a very low proportion of papers collaborated.
In figure 4.7, the ACM Computer Science dataset contains fewer collaborative papers than the WoS Computer Science dataset, the ratio between collaborative papers and the total papers is also much lower. Pharmacology has the highest number of collabora- tive papers as well as the highest proportion of the collaborative papers / total papers.
Psychology, while it has a smaller number of collaborative papers, its proportion of col- laborative paper is much higher than ACM CS and Law. Law has the lowest proportion of collaborative papers of all disciplines. Chapter 4 Datasets and Methodology 83
4.3.4 Collaboration Size
In figure 4.8, the mean number of institutions involved in each paper is relatively stable across disciplines at about 2. At institution level, collaborations do not tend to be very large in these five disciplines. At the author level, the collaboration size varies slightly more, with Pharmacology the largest of more than 5, while Law has the smallest of 2.
The size of the collaboration may be closely related to the mode of research in that the experimental based disciplines (e.g. Pharmacology and Materials Science) have a larger number of co-authors than those that are not [140].
Size of the Collaborative Activities 2.70 6.00
2.60 5.00
2.50 4.00
2.40 3.00 Authors
Institutions 2.30 2.00
2.20 1.00
2.10 0.00 ACM C.S. WoS C.S. Phar. M.S. Law Psych. Institutions per paper Authors per paper
Figure 4.8: Collaboration Size. The mean number of institutions involved in each discipline is similar while the number of authors can vary across disciplines.
4.4 Metrics Selection and Counting Methods
The aim of the metrics selection is to include as broad a range of measurement in our analysis as possible, while striking a balance of not including too many redundant metrics. The limitation of the metrics selection in this case lay in the data availability. 84 Chapter 4 Datasets and Methodology
In this section, I discuss how the metrics were selected to estimate the institution’s productivity, impact and collaboration, and how these metrics were counted and aggre- gated.
4.4.1 Institutional Productivity Measurements
Research productivity was discussed and it was concluded in the literature review that it can be approximated by publication productivity. Publication productivity is the number of publications an entity (a researcher, an institution or a country) publishes within a period of time. The highly productive ones are those which publish more within the same period. Using publication productivity to measure research output is frequently used in previous studies [33, 69, 101].
It is important to recognise that since the input of publication was not factored into this measurement, productivity is not normalised, which means that larger institutions or larger countries which have more researchers may have a higher productivity due to their size.
To measure a university’s publication productivity, papers published with the university name contained in the address is counted. The counting process used in this study is as follows: the university’s productivity (PUBTOT) is incremented once if a paper has listed the university’s name at least once. Multiple appearances of the same university on a single paper are counted once towards the university.
4.4.2 Institutional Impact Indicators
In section 2.3, we discussed that the overall research quality of an institution can be approximated from individual research’s quality. The four ways to measure the quality of a piece of research are: methodological quality, report quality, peer review based qual- ity and bibliometric. The first two quality measurements are very specific to disciplines and these measurements are difficult to use in a quantitative analysis. Peer-review based measurement, although it offers the authoritative expert opinions about the research, the associated cost is often much higher than the alternative. The UK and the Australian Chapter 4 Datasets and Methodology 85 governments have conducted and published their nation-wide expert-review based in- stitution research quality measurements. But these national data limited to only two countries are not useful in this study concerning the global context. The bibliometric method offers a practical and reproducible impact indicator, in particular, the citations offer impact at a quantitative level. Based on the availability of the data, three citation based metrics were chosen to measure the impact of an institution’s research activities: citation per institution; PageRank weighted citation per institution, and citations per paper. In addition to bibliometrics, I have also included a university ranking – Webo- metrics Ranking of World Universities – in our study. It offers a completely different perspective to the citation based metrics and potentially gives us new insights.
4.4.2.1 Citations per institution
Citation count has been widely used in the literature as a research impact indicator
[18, 93, 137]. It is easy to count and the data is readily available in databases. I include it to offer a baseline for the analysis.
In this study, the citation count is aggregated at the institution level and it is referred to as citations per institution (CITTOT). All of the citations received by a paper are counted towards the listed institution’s citation. For collaborative papers that have multiple institutions, each of these institutions gets a copy of all the citations received by the paper. This is also referred to as the “full counting” in the literature.
However, the raw citation count has a few disadvantages: 1. The citation count varies across fields of study due to different scholarly practices, the size and the nature of the audience and the size of the community. As a result, the citation number is not comparable across disciplines. 2. The older the publication, the more time it has to receive citations, making a pattern that older publications generally have bigger counts, thus giving older publications an unfair advantage in measuring impact using citation counts. 3. The citations that come from low impact articles are weighted with the same value as the citations that come from high impact articles. In other words, citation count does weight the value of citations according to source impact. This is addressed by using a weighted citation count that is based on PageRank. 86 Chapter 4 Datasets and Methodology
4.4.2.2 Citation PageRank
Citation PageRank (CITTOTw) is a qualitative improvement on the raw citation counting, in that it gives weighting to those cites that come from highly cited insti- tutions. Compared to the citation count, not only the citation number is considered, but also how well cited the citing institution is. This metrics requires more strict source data since both the citing and cited institution need to be available to calculate the weighting.
The calculation of the citation PageRank was completed using the Network Workbench software package. A citation network of institutions generated from the dataset was the input to the program, and the PageRank value of the institution was the output of the program.
4.4.2.3 Citations per paper
Citations per paper (CITAV ) normalises the citation measurement by dividing an institution’s total citation counts by the total number of published papers. I include it in this study because it provides a productivity normalised impact for the institution. This normalised view can highlight institutions that have a low number of published papers, but a high number of average citations (i.e. high impact but low volume research).
Formula 4.3 calculates the CITAV for institution i.
C CIT AV = i (4.3) Pi
where Ci is the total citations received by all of the publications associated with insti- tution i and Pi is the total number papers published by the institution. The citations per paper measures the average number of citations papers receive at an institution and estimates the average impact of individual papers. Chapter 4 Datasets and Methodology 87
4.4.2.4 University League Table
Four university league tables were studied in the literature review, but only the Webo- metrics ranking (WRANK ) offers the complete world university ranking, so it is the only one included in this study. The July 2010 version of the ranking was used.
This version of the ranking is based on four aspects:
• University website’s popularity. How many external websites are linking to it.
• University website’s size. The number of webpages the website has.
• University’s output and accessibility. The number of open format documents which
can be found on the university’s website.
• University’s research output. The number of publications which can be found
online.
The quality represented by the WRANK is not a single aspect of the university’s
quality, but a weighted mixture of four sub-metrics. (In fact these sub-metrics are
evolving over the years, with latest ranking measures a slightly different set of these
sub-metrics 8.) This “non-pure” quality indicator may give high correlations with our
other measurements, without necessarily being related to the university’s quality. For
instance, Webometrics assigns 20% of the score for the number of papers published by
the university, which is expected to be correlated with the PUBTOT. It is important
to recognise the features of this indicator before making interpretations.
League tables generally put high ranking universities in front, with smaller rank values.
(e.g., rank 1 is better than rank 20.) To make this metrics consistent to the other
metrics and make the results easier to interpret, the WRANK is inverted, so that a
higher ranking value indicates the better universities.
8Please see http://www.webometrics.info/ for the latest sub-metrics used in ranking universities 88 Chapter 4 Datasets and Methodology
4.4.3 Institutional Collaboration Measurements
Using the institutional collaboration model discussed in section 2.1.3, these are the two important factors affecting the strength of the institutional collaboration: 1. the number of times institutions have co-authored papers; 2. the size of a collaboration.
The collaboration measurements used in the literature mostly considers the frequency of institutions’ collaboration, without paying attention to the individual collaboration’s size. Both of these factors are measured in this study. In addition, I also include a measurement of the institution’s proportion of collaborative papers. This measurement helps us to address the questions related to whether shifting to a collaborative mode of research alone would improve the productivity and the impact.
4.4.3.1 Number of times collaborated
This indicator is frequently used, it counts the number of collaborative papers the insti- tution has published. This directly reflects the institution’s involvement in collaboration.
I denote it as PUBCOLL. Similar to the publication counting and citation counting,
“full counting” was used, which means one collaborative paper is counted as many times as the number of distinct institutions presented in the paper. For example, a paper by
Southampton and Oxford University is counted once for each university, even if there are multiple authors from either university. This measurement provides a baseline for the collaboration measurement.
4.4.3.2 Size-weighted collaboration
Building on the collaborative paper count, size-weighted collaboration (PUBCOLLw) aimed to add collaboration size into the metrics. The more authors a paper has, the more self-citation as well as the their colleagues’ citation would potentially be guaranteed, driving up the impact.
In a collaborated paper, there are two author-number related parameters: Chapter 4 Datasets and Methodology 89
• the total number of authors participating in the collaboration. For instance, a
paper with 10 co-authors forms a larger collaboration than a paper with 2 co-
authors.
• the number of authors the institution has brought to the collaboration. For in-
stance, an institution with 5 authors in a collaboration is contributing more com-
pared to an institution in the same collaboration with 1 author.
Both of these collaboration variables are proportional to the institution’s measured size- weighted collaboration, and they are summarised over all of the papers an institution produced. When calculating an institution’s contribution to the paper, to avoid double counting the institution’s participating researchers, the authors from the institution are subtracted. Formula 4.4 calculates size-weighted collaboration for institution i.
X CSi = Api × (TAp − Api) (4.4) pi where Api is the number of authors from institution i on paper p, and TAp is the total number of authors for paper p.
For example, figure 4.9 shows two collaborated papers, both have the same number of institutions, but have different total authors and different number of authors from each
institution. All three institutions have the same PUBCOLL count, because each paper
give them 1 collaborative paper, so the PUBCOLL count is 2 for Southampton, Bath and Oxford with no differentiations. PUBCOLL count ignores the fact that paper B
was a bigger collaboration and Bath had contributed the most in both of the papers.
PUBCOLLw will be able to differentiate based on their collaboration sizes.
Using formula 4.4, we calculate the size-weighted collaboration for the three institutions
(Figure 4.9). In Paper A, P UBCOLLwSoton is the same as P UBCOLLwOxford,
while less than P UBCOLLwBath, because Bath contributed 2 authors in the paper A while Southampton and Oxford each contributed 1. Oxford has contributed the same
number of authors in both Paper A and Paper B (1 author for paper A and 1 author
for paper B), it gets a higher PUBCOLLw value in Paper B because Paper B is a 90 Chapter 4 Datasets and Methodology