Machine Learning in Scientometrics

DEPARTAMENTO DE INTELIGENCIA ARTIFICIAL Escuela Tecnica´ Superior de Ingenieros Informaticos´ Universidad Politecnica´ de Madrid PhD THESIS Machine Learning in Scientometrics Author Alfonso Ibańez˜ MS Computer Science MS Artificial Intelligence PhD supervisors Concha Bielza PhD Computer Science Pedro Larranaga˜ PhD Computer Science 2015 Thesis Committee President: CésarHervás Member: JoséRamónDorronsoro Member: Enrique Herrera Member: Irene Rodr´ıguez Secretary: Florian Leitner There are no secrets to success. It is the result of preparation, hard work, and learning from failure. Acknowledgements Ph.D. research often appears a solitary undertaking. However, it is impossible to maintain the degree of focus and dedication required for its completion without the help and support of many people. It has been a difficult long journey to finish my Ph.D. research and it is of justice to cite here all of them. First and foremost, I would like to thank Concha Bielza and Pedro Larra~nagafor being my supervisors and mentors. Without your unfailing support, recommendations and patient, this thesis would not have been the same. You have been role models who not only guided my research but also demonstrated your enthusiastic research attitudes. I owe you so much. Whatever research path I do take, I will be prepared because of you. I would also like to express my thanks to all my friends and colleagues at the Computa- tional Intelligence Group who provided me with not only an excellent working atmosphere and stimulating discussions but also friendships, care and assistance when I needed. My special thank-you goes to RubénArma~nanzas,Roberto Santana, Diego Vidaurre, Hanen Borchani, Pedro L. López-Cruz, Hossein Karshenas, Luis Guerra, Bojan Mihaljevic and Laura Antón- Sánchez. I have learned so much from all of you. Your constructive recommendations and collaboration have been tremendous assets throughout my Ph.D. research. This dissertation would not have been possible without the financial support offered by the Spanish Ministry of Economy and Competitiveness, projects TIN2008-04528-E and Con- solider Ingenio 2010-CSD-2007-00018. I would also like to thank TIN2007-62626, TIN2010- 20900-C04-04 and TIN2011-14083-E projects that supported my research during these years. Finally, my deepest gratitude goes to my wife for her understanding and faithful support as well as her patience and unconditional love. Thanks also for believing in me, being a never-ending fount of moral support and standing by me, when the going is difficult. For all of this, I thank you. Last but of course not least, I wish to thank my parents and sister for their unending encouragement and love from childhood to now. Without them, I would not be where I am. Thank you. Abstract Machine learning and scientometrics are the scientific disciplines which are covered in this dissertation. Machine learning deals with the construction and study of algorithms that can learn from data, whereas scientometrics is mainly concerned with the analysis of science from a quantitative perspective. Nowadays, advances in machine learning provide the mathemat- ical and statistical tools for properly working with the vast amount of scientometrics data stored in bibliographic databases. In this context, the use of novel machine learning methods in scientometrics applications is the focus of attention of this dissertation. This dissertation proposes new machine learning contributions which would shed light on the scientometrics area. These contributions are divided in three parts: Several supervised cost-(in)sensitive models are learned to predict the scientific success of articles and researchers. Cost-sensitive models are not interested in maximizing classification accuracy, but in minimizing the expected total cost of the error derived from mistakes in the classification process. In this context, publishers of scientific journals could have a tool capa- ble of predicting the citation count of an article in the future before it is published, whereas promotion committees could predict the annual increase of the h-index of researchers within the first few years. These predictive models would pave the way for new assessment systems. Several probabilistic graphical models are learned to exploit and discover new relationships among the vast number of existing bibliometric indices. In this context, scientific community could measure how some indices influence others in probabilistic terms and perform evidence propagation and abduction inference for answering bibliometric questions. Also, scientific community could uncover which bibliometric indices have a higher predictive power. This is a multi-output regression problem where the role of each variable, predictive or response, is unknown beforehand. The resulting indices could be very useful for prediction purposes, that is, when their index values are known, knowledge of any index value provides no information on the prediction of other bibliometric indices. A scientometric study of the Spanish computer science research is performed under the publish-or-perish culture. This study is based on a cluster analysis methodology which char- acterizes the research activity in terms of productivity, visibility, quality, prestige and inter- national collaboration. This study also analyzes the effects of collaboration on productivity and visibility under different circumstances. Resumen El aprendizaje automáticoy la cienciometr´ıason las disciplinas cient´ıficasque se tratan en esta tesis. El aprendizaje automáticotrata sobre la construccióny el estudio de algoritmos que puedan aprender a partir de datos, mientras que la cienciometr´ıase ocupa principalmente del análisisde la ciencia desde una perspectiva cuantitativa. Hoy en d´ıa,los avances en el aprendizaje automáticoproporcionan las herramientas matemáticasy estad´ısticaspara tra- bajar correctamente con la gran cantidad de datos cienciométricosalmacenados en bases de datos bibliográficas.En este contexto, el uso de nuevos métodos de aprendizaje automático en aplicaciones de cienciometr´ıaes el foco de atenciónde esta tesis doctoral. Esta tesis propone nuevas contribuciones en el aprendizaje automáticoque podr´ıanarro- jar luz sobre el áreade la cienciometr´ıa.Estas contribuciones estándivididas en tres partes: Varios modelos supervisados (in)sensibles al coste son aprendidos para predecir el éxito cient´ıficode los art´ıculosy los investigadores. Los modelos sensibles al coste no estánin- teresados en maximizar la precisiónde clasificación,sino en la minimizacióndel coste total esperado derivado de los errores ocasionados. En este contexto, los editores de revistas cient´ıficaspodr´ıandisponer de una herramienta capaz de predecir el númerode citas de un art´ıculoen el fututo antes de ser publicado, mientras que los comitésde promociónpodr´ıan predecir el incremento anual del ´ındiceh de los investigadores en los primeros a~nos. Estos modelos predictivos podr´ıanallanar el camino hacia nuevos sistemas de evaluación. Varios modelos gráficosprobabil´ısticosson aprendidos para explotar y descubrir nuevas relaciones entre el gran númerode ´ındicesbibliométricosexistentes. En este contexto, la comunidad cient´ıficapodr´ıamedir cómoalgunos ´ındicesinfluyen en otros en términospro- babil´ısticosy realizar propagaciónde la evidencia e inferencia abductiva para responder a preguntas bibliométricas. Además,la comunidad cient´ıficapodr´ıadescubrir qué´ındicesbi- bliométricostienen mayor poder predictivo. Este es un problema de regresiónmulti-respuesta en el que el papel de cada variable, predictiva o respuesta, es desconocido de antemano. Los ´ındicesresultantes podr´ıanser muy útilespara la predicción,es decir, cuando se conocen sus valores, el conocimiento de cualquier valor no proporciona informaciónsobre la predicciónde otros ´ındicesbibliométricos. Un estudio bibliométricosobre la investigaciónespa~nolaen informáticaha sido realizado bajo la cultura de publicar o morir. Este estudio se basa en una metodolog´ıade análisis de clusters que caracteriza la actividad en la investigaciónen términosde productividad, visibilidad, calidad, prestigio y colaboracióninternacional. Este estudio tambiénanaliza los efectos de la colaboraciónen la productividad y la visibilidad bajo diferentes circunstancias. Contents Contents xvi I Preliminaries 1 1 Introduction 3 1.1 Contributions of the dissertation . 4 1.2 Overview of the dissertation . 6 II Background 9 2 Machine Learning 11 2.1 Introduction . 11 2.2 Supervised learning . 13 2.2.1 Supervised learning approaches . 13 2.2.2 Classification validation . 17 2.3 Unsupervised learning . 19 2.3.1 Unsupervised learning approaches . 20 2.3.2 Clustering validation . 23 3 Probabilistic Graphical Models 27 3.1 Introduction . 27 3.2 Bayesian networks . 28 3.2.1 Discrete Bayesian networks . 28 3.2.2 Gaussian Bayesian networks . 29 3.3 Learning Bayesian networks . 29 3.3.1 Structural learning . 30 3.3.2 Parametric learning . 32 3.4 Inference in Bayesian networks . 32 3.5 Bayesian networks classifiers . 33 xiii xiv CONTENTS 4 Scientometrics 37 4.1 Introduction . 37 4.2 Citation analysis in research evaluation . 38 4.3 The h-index . 40 4.4 Improvements of the h-index . 42 4.4.1 Bibliometric measures that complement the h-index . 42 4.4.2 Bibliometric measures that take time into account . 47 4.4.3 Bibliometric measures that allow for co-authorship . 49 4.4.4 Bibliometric measures that consider other variables . 51 4.5 Journal-based measures . 52 4.5.1 Impact factor . 53 4.5.2 Bibliometric measures that assess the quality of citations . 54 4.5.3 Bibliometric measures that correct for differences among fields . 55 4.6 Bibliographic databases . 57 4.6.1 Web of Science . 58 4.6.2 Scopus . 61 4.6.3 Google Scholar . 62 III Data Mining in Research Evaluation 65 5 Predicting citation counts using supervised algorithms 67 5.1 Introduction . 67 5.2 Predicting citation count of Bioinformatics papers . 69 5.2.1 Dataset compilation . 69 5.2.2 Data distribution . 70 5.2.3 Predictive models . 71 5.2.4 Exploiting the best models . 75 5.3 Discussion and conclusions . 77 6 Predicting the h-index using cost-sensitive algorithms 79 6.1 Introduction .

Machine Learning in Scientometrics

Newsletternewsletter Volume 1 ■ Number 41 ■ Decemberapril 2003 2003

A Philosophical Treatise on the Connection of Scientific Reasoning

Scientometrics1

Steering Science Through Output Indicators & Data Capitalism

1.12 News Feat 4 Searches MH

Application of Fuzzy Query Based on Relation Database Dongmei Wei Liangzhong Yi Zheng Pei

On Fuzzy and Rough Sets Approaches to Vagueness Modeling⋆ Extended Abstract

Research Hotspots and Frontiers of Product R&D Management

Download (Accessed on 30 July 2016)

A Framework for Software Modelling in Social Science Re- Search

Get Noticed Promoting Your Article for Maximum Impact Get Noticed 2 GET NOTICED

Scientometrics: the Project for a Science of Science Transformed Into an Industry of Measurements