machine learning & knowledge extraction Article Defining Data Science by a Data-Driven Quantification of the Community Frank Emmert-Streib 1,2,* and Matthias Dehmer 3,4,5 1 Predictive Medicine and Data Analytics Lab, Department of Signal Processing, Tampere University of Technology, FI-33101 Tampere, Finland 2 Institute of Biosciences and Medical Technology, FI-33101 Tampere, Finland 3 Institute for Intelligent Production, Faculty for Management, University of Applied Sciences Upper Austria, Steyr Campus, A-4400 Steyr, Austria;
[email protected] 4 Department of Mechatronics and Biomedical Computer Science, UMIT, A-6060 Hall in Tyrol, Austria 5 College of Computer and Control Engineering, Nankai University, Tianjin 300071, China * Correspondence:
[email protected]; Tel.: +358-50-301-5353 Received: 4 December 2018; Accepted: 17 December 2018; Published: 19 December 2018 Abstract: Data science is a new academic field that has received much attention in recent years. One reason for this is that our increasingly digitalized society generates more and more data in all areas of our lives and science and we are desperately seeking for solutions to deal with this problem. In this paper, we investigate the academic roots of data science. We are using data of scientists and their citations from Google Scholar, who have an interest in data science, to perform a quantitative analysis of the data science community. Furthermore, for decomposing the data science community into its major defining factors corresponding to the most important research fields, we introduce a statistical regression model that is fully automatic and robust with respect to a subsampling of the data.