Machine Learning in Scientometrics
Total Page:16
File Type:pdf, Size:1020Kb
DEPARTAMENTO DE INTELIGENCIA ARTIFICIAL Escuela Tecnica´ Superior de Ingenieros Informaticos´ Universidad Politecnica´ de Madrid PhD THESIS Machine Learning in Scientometrics Author Alfonso Iba´nez˜ MS Computer Science MS Artificial Intelligence PhD supervisors Concha Bielza PhD Computer Science Pedro Larranaga˜ PhD Computer Science 2015 Thesis Committee President: C´esarHerv´as Member: Jos´eRam´onDorronsoro Member: Enrique Herrera Member: Irene Rodr´ıguez Secretary: Florian Leitner There are no secrets to success. It is the result of preparation, hard work, and learning from failure. Acknowledgements Ph.D. research often appears a solitary undertaking. However, it is impossible to maintain the degree of focus and dedication required for its completion without the help and support of many people. It has been a difficult long journey to finish my Ph.D. research and it is of justice to cite here all of them. First and foremost, I would like to thank Concha Bielza and Pedro Larra~nagafor being my supervisors and mentors. Without your unfailing support, recommendations and patient, this thesis would not have been the same. You have been role models who not only guided my research but also demonstrated your enthusiastic research attitudes. I owe you so much. Whatever research path I do take, I will be prepared because of you. I would also like to express my thanks to all my friends and colleagues at the Computa- tional Intelligence Group who provided me with not only an excellent working atmosphere and stimulating discussions but also friendships, care and assistance when I needed. My special thank-you goes to Rub´enArma~nanzas,Roberto Santana, Diego Vidaurre, Hanen Borchani, Pedro L. L´opez-Cruz, Hossein Karshenas, Luis Guerra, Bojan Mihaljevic and Laura Ant´on- S´anchez. I have learned so much from all of you. Your constructive recommendations and collaboration have been tremendous assets throughout my Ph.D. research. This dissertation would not have been possible without the financial support offered by the Spanish Ministry of Economy and Competitiveness, projects TIN2008-04528-E and Con- solider Ingenio 2010-CSD-2007-00018. I would also like to thank TIN2007-62626, TIN2010- 20900-C04-04 and TIN2011-14083-E projects that supported my research during these years. Finally, my deepest gratitude goes to my wife for her understanding and faithful support as well as her patience and unconditional love. Thanks also for believing in me, being a never-ending fount of moral support and standing by me, when the going is difficult. For all of this, I thank you. Last but of course not least, I wish to thank my parents and sister for their unending encouragement and love from childhood to now. Without them, I would not be where I am. Thank you. Abstract Machine learning and scientometrics are the scientific disciplines which are covered in this dissertation. Machine learning deals with the construction and study of algorithms that can learn from data, whereas scientometrics is mainly concerned with the analysis of science from a quantitative perspective. Nowadays, advances in machine learning provide the mathemat- ical and statistical tools for properly working with the vast amount of scientometrics data stored in bibliographic databases. In this context, the use of novel machine learning methods in scientometrics applications is the focus of attention of this dissertation. This dissertation proposes new machine learning contributions which would shed light on the scientometrics area. These contributions are divided in three parts: Several supervised cost-(in)sensitive models are learned to predict the scientific success of articles and researchers. Cost-sensitive models are not interested in maximizing classification accuracy, but in minimizing the expected total cost of the error derived from mistakes in the classification process. In this context, publishers of scientific journals could have a tool capa- ble of predicting the citation count of an article in the future before it is published, whereas promotion committees could predict the annual increase of the h-index of researchers within the first few years. These predictive models would pave the way for new assessment systems. Several probabilistic graphical models are learned to exploit and discover new relationships among the vast number of existing bibliometric indices. In this context, scientific community could measure how some indices influence others in probabilistic terms and perform evidence propagation and abduction inference for answering bibliometric questions. Also, scientific community could uncover which bibliometric indices have a higher predictive power. This is a multi-output regression problem where the role of each variable, predictive or response, is unknown beforehand. The resulting indices could be very useful for prediction purposes, that is, when their index values are known, knowledge of any index value provides no information on the prediction of other bibliometric indices. A scientometric study of the Spanish computer science research is performed under the publish-or-perish culture. This study is based on a cluster analysis methodology which char- acterizes the research activity in terms of productivity, visibility, quality, prestige and inter- national collaboration. This study also analyzes the effects of collaboration on productivity and visibility under different circumstances. Resumen El aprendizaje autom´aticoy la cienciometr´ıason las disciplinas cient´ıficasque se tratan en esta tesis. El aprendizaje autom´aticotrata sobre la construcci´ony el estudio de algoritmos que puedan aprender a partir de datos, mientras que la cienciometr´ıase ocupa principalmente del an´alisisde la ciencia desde una perspectiva cuantitativa. Hoy en d´ıa,los avances en el aprendizaje autom´aticoproporcionan las herramientas matem´aticasy estad´ısticaspara tra- bajar correctamente con la gran cantidad de datos cienciom´etricosalmacenados en bases de datos bibliogr´aficas.En este contexto, el uso de nuevos m´etodos de aprendizaje autom´atico en aplicaciones de cienciometr´ıaes el foco de atenci´onde esta tesis doctoral. Esta tesis propone nuevas contribuciones en el aprendizaje autom´aticoque podr´ıanarro- jar luz sobre el ´areade la cienciometr´ıa.Estas contribuciones est´andivididas en tres partes: Varios modelos supervisados (in)sensibles al coste son aprendidos para predecir el ´exito cient´ıficode los art´ıculosy los investigadores. Los modelos sensibles al coste no est´anin- teresados en maximizar la precisi´onde clasificaci´on,sino en la minimizaci´ondel coste total esperado derivado de los errores ocasionados. En este contexto, los editores de revistas cient´ıficaspodr´ıandisponer de una herramienta capaz de predecir el n´umerode citas de un art´ıculoen el fututo antes de ser publicado, mientras que los comit´esde promoci´onpodr´ıan predecir el incremento anual del ´ındiceh de los investigadores en los primeros a~nos. Estos modelos predictivos podr´ıanallanar el camino hacia nuevos sistemas de evaluaci´on. Varios modelos gr´aficosprobabil´ısticosson aprendidos para explotar y descubrir nuevas relaciones entre el gran n´umerode ´ındicesbibliom´etricosexistentes. En este contexto, la comunidad cient´ıficapodr´ıamedir c´omoalgunos ´ındicesinfluyen en otros en t´erminospro- babil´ısticosy realizar propagaci´onde la evidencia e inferencia abductiva para responder a preguntas bibliom´etricas. Adem´as,la comunidad cient´ıficapodr´ıadescubrir qu´e´ındicesbi- bliom´etricostienen mayor poder predictivo. Este es un problema de regresi´onmulti-respuesta en el que el papel de cada variable, predictiva o respuesta, es desconocido de antemano. Los ´ındicesresultantes podr´ıanser muy ´utilespara la predicci´on,es decir, cuando se conocen sus valores, el conocimiento de cualquier valor no proporciona informaci´onsobre la predicci´onde otros ´ındicesbibliom´etricos. Un estudio bibliom´etricosobre la investigaci´onespa~nolaen inform´aticaha sido realizado bajo la cultura de publicar o morir. Este estudio se basa en una metodolog´ıade an´alisis de clusters que caracteriza la actividad en la investigaci´onen t´erminosde productividad, visibilidad, calidad, prestigio y colaboraci´oninternacional. Este estudio tambi´enanaliza los efectos de la colaboraci´onen la productividad y la visibilidad bajo diferentes circunstancias. Contents Contents xvi I Preliminaries 1 1 Introduction 3 1.1 Contributions of the dissertation . 4 1.2 Overview of the dissertation . 6 II Background 9 2 Machine Learning 11 2.1 Introduction . 11 2.2 Supervised learning . 13 2.2.1 Supervised learning approaches . 13 2.2.2 Classification validation . 17 2.3 Unsupervised learning . 19 2.3.1 Unsupervised learning approaches . 20 2.3.2 Clustering validation . 23 3 Probabilistic Graphical Models 27 3.1 Introduction . 27 3.2 Bayesian networks . 28 3.2.1 Discrete Bayesian networks . 28 3.2.2 Gaussian Bayesian networks . 29 3.3 Learning Bayesian networks . 29 3.3.1 Structural learning . 30 3.3.2 Parametric learning . 32 3.4 Inference in Bayesian networks . 32 3.5 Bayesian networks classifiers . 33 xiii xiv CONTENTS 4 Scientometrics 37 4.1 Introduction . 37 4.2 Citation analysis in research evaluation . 38 4.3 The h-index . 40 4.4 Improvements of the h-index . 42 4.4.1 Bibliometric measures that complement the h-index . 42 4.4.2 Bibliometric measures that take time into account . 47 4.4.3 Bibliometric measures that allow for co-authorship . 49 4.4.4 Bibliometric measures that consider other variables . 51 4.5 Journal-based measures . 52 4.5.1 Impact factor . 53 4.5.2 Bibliometric measures that assess the quality of citations . 54 4.5.3 Bibliometric measures that correct for differences among fields . 55 4.6 Bibliographic databases . 57 4.6.1 Web of Science . 58 4.6.2 Scopus . 61 4.6.3 Google Scholar . 62 III Data Mining in Research Evaluation 65 5 Predicting citation counts using supervised algorithms 67 5.1 Introduction . 67 5.2 Predicting citation count of Bioinformatics papers . 69 5.2.1 Dataset compilation . 69 5.2.2 Data distribution . 70 5.2.3 Predictive models . 71 5.2.4 Exploiting the best models . 75 5.3 Discussion and conclusions . 77 6 Predicting the h-index using cost-sensitive algorithms 79 6.1 Introduction .