A Comparison of Open Source Tools for Data Science

2015 Proceedings of the Conference on Information Systems Applied Research ISSN: 2167-1508 Wilmington, North Carolina USA v8 n3651 __________________________________________________________________________________________________________________________ A Comparison of Open Source Tools for Data Science Hayden Wimmer Department of Information Technology Georgia Southern University Statesboro, GA 30260, USA Loreen M. Powell Department of Information and Technology Management Bloomsburg University Bloomsburg, PA 17815, USA Abstract The next decade of competitive advantage revolves around the ability to make predictions and discover patterns in data. Data science is at the center of this revolution. Data science has been termed the sexiest job of the 21st century. Data science combines data mining, machine learning, and statistical methodologies to extract knowledge and leverage predictions from data. Given the need for data science in organizations, many small or medium organizations are not adequately funded to acquire expensive data science tools. Open source tools may provide the solution to this issue. While studies comparing open source tools for data mining or business intelligence exist, an update on the current state of the art is necessary. This work explores and compares common open source data science tools. Implications include an overview of the state of the art and knowledge for practitioners and academics to select an open source data science tool that suits the requirements of specific data science projects. Keywords: Data Science Tools, Open Source, Business Intelligence, Predictive Analytics, Data Mining. 1. INTRODUCTION Data science is an expensive endeavor. One such example is JMP by SAS. SAS is a primary Data science is an emerging field which provider of data science tools. JMP is one of the intersects data mining, machine learning, more modestly priced tools from SAS with the predictive analytics, statistics, and business price for JMP Pro listed as $14,900. Based on intelligence. The data scientist has been coined the high price point of related software, data the “sexiest job of the 21st century” (Davenport science efforts are out of reach for small and & Patil, 2012). The data science field is so new medium business as well as many local and that the U.S. bureau of labor and statistics does regional healthcare organizations where efforts not yet list it as a profession; yet, CNN’s Money bring competitive advantages, improved lists the data scientist as #32 on their best jobs performance, and cost reductions. The shortfall in America list with a median salary of $124,000 of data scientists detail the need for higher (Money, 2015). Fortune lists the data scientist education to provide training programs; as the hot tech gig of 2022 (Hempel, 2012). nonetheless, the high cost of data science The volume of data has exploded (Brown, Chui, software is a barrier to classroom adoption. & Manyika, 2011); however, a shortfall of skilled Based on the aforementioned shortfalls, open data scientists remain (Lake & Drake, 2014) source based solutions may provide respite. which helps justify the high median salary. This paper seeks to provide an overview of data science and the tools required to meet the needs of organizations, albeit higher education, _________________________________________________ ©2015 ISCAP (Information Systems & Computing Academic Professionals) Page 1 http://iscap.info 2015 Proceedings of the Conference on Information Systems Applied Research ISSN: 2167-1508 Wilmington, North Carolina USA v8 n3651 __________________________________________________________________________________________________________________________ business, or clinical settings as well as provide identified in the first step when developing a insight to the capabilities of common open business understanding of the project. source data science tools. Machine Learning 2. BACKGROUND Machine learning (ML), employed as a method in This section begins with introducing the term data science, is the process of programming Data Science and prominent aspects of data computers to learn from past experiences science, namely data mining, machine learning, (Mitchell, 1997). ML seeks to develop algorithms predictive analytics, and business intelligence. that learn from data directly with little or no Next, open source software is introduced and human intervention. ML algorithms perform a skills of the data scientist are framed based on variety of tasks such as prediction, classification, an industry certification. or decision making. ML stems from artificial intelligence research and has become a critical Data Science aspect of data science. Machine learning begins Data science is a revived term for discovering with input as a training data set. In this phase, knowledge from data (Dhar, 2013); yet, the the ML algorithm employs the training dataset to term has come to encompass more than merely learn from the data and form patterns. The traditional data mining. A universally accepted learning phase outputs a model that is used by definition does not yet exist. It is generally the testing phase. The testing phase employs agreed that a data scientist combines skills from another dataset, applies the model from the multiple disciplines such as computer science, training phase, and results are presented for mathematics, and even art (Loukides, 2010). analysis. The performance on the test dataset The data scientist must combine techniques demonstrates the models ability to perform its from multiple disciplines which include, but are task against data. Machine learning extends not limited to, data mining, machine learning, beyond a statically coded set of statements into predictive analytics, and business intelligence. statements that are dynamically generated Regardless of the tool, extracting knowledge based on the input data. from data, particularly for predictive purposes, is at the heart of the data science field. The data Predictive Analytics scientist collaborates with domain experts to Predictive analytics, a cornerstone of data extract and transform data as well as provide science efforts, is the process of employing guidance in the analysis of the results of data empirical methods to generate data predictions science activities. (Shmueli & Koppius, 2010). Predictive analytics frequently involve statistical methods, such as Data Mining regression analysis, to make predictions based Data mining (DM), commonly referred to as on data. Predictive analytics has a wide range knowledge discovery in databases (KDD) and an of applications from marketing, finance, and integral aspect of data science, is the process of clinical applications. A common marketing extracting patterns and knowledge from data application is customer churn analysis which such as pattern discovery and extraction seeks to determine which customers may switch (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). to a competing provider and make special offers The Cross Industry Standard Process for Data in order to retain these high-risk of churn Mining (CRISP-DM), one of the leading data customers. Finance applications include mining methodologies, divides the data mining predicting customer profitability or risk process into 6 steps (Chapman et al., 2000; management as employed by the insurance Wirth & Hipp, 2000). First, a business industry. Clinical applications include clinical understanding of the project is developed decision support, determining which patients are followed by an analysis and understanding of the at risk for hospital readmission, or medication current data resources. Third, data pre- interaction modeling. processing is performed to format the data suitable to data mining applications and Business Intelligence algorithms. Next, models based on the data are Business intelligence, or BI, combines analytical generated. Model generation may be automatic tools to present complex information to decision via machine learning, semi-automatic, or makers (Negash, 2004). BI is part of data manual. The models are then evaluated for science efforts frequently as output of such performance and accuracy. The final step is efforts. Business intelligence tools integrate deployment of the model(s) to solve the mission data from an organization for presentation. One such example is providing executive _________________________________________________ ©2015 ISCAP (Information Systems & Computing Academic Professionals) Page 2 http://iscap.info 2015 Proceedings of the Conference on Information Systems Applied Research ISSN: 2167-1508 Wilmington, North Carolina USA v8 n3651 __________________________________________________________________________________________________________________________ management with dashboards which provide a A brief description of each, as adapted from by view of the organization’s operations. Decision EMC is: makers employ this information to make K-means clustering – an unsupervised strategic or operational decisions that impact the method learning method which groups data objectives of the organization. One goal of instances. K-means is the most popular business intelligence is presentation of data and algorithm for clustering where the data is information in a format that can be easily grouped into K groups. understood by decision makers. Business Association rule mining - an unsupervised intelligence includes key performance indicators method to find rules in the data. ARM is (KPI) from the organization, competitors, and commonly used as market

A Comparison of Open Source Tools for Data Science

KNIME Big Data Workshop

Text Mining Course for KNIME Analytics Platform

Sage for Lattice-Based Cryptography

User Manual QT

A Study of Open Source Data Mining Tools and Its Applications

Introduction to Data Mining May 2018

Automatic Classification of Oranges Using Image Processing and Data Mining Techniques

C++ GUI Programming with Qt 4, Second Edition by Jasmin Blanchette; Mark Summerfield

Data Mining and Knowledge Discovery Petra Kralj Novak November 18, 2019

Analysis of Heart Disease Using in Data Mining Tools Orange and Weka by Sarangam Kodati & Dr

AN QP™/QM™ and the Qt™ GUI Framework

Legate Numpy: Accelerated and Distributed Array Computing