A Practical Guide to Support Predictive Tasks in Data Science

A Practical Guide to Support Predictive Tasks in Data Science Jose´ Augusto Camaraˆ Filho1, Jose´ Maria Monteiro1,Cesar´ Lincoln Mattos1 and Juvencioˆ Santos Nobre2 1Department of Computing, Federal University of Ceara,´ Fortaleza, Ceara,´ Brazil 2Department of Statistics and Applied Mathematics, Federal University of Ceara,´ Fortaleza, Ceara,´ Brazil Keywords: Practical Guide, Prediction, Data Science. Abstract: Currently, professionals from the most diverse areas of knowledge need to explore their data repositories in order to extract knowledge and create new products or services. Several tools have been proposed in order to facilitate the tasks involved in the Data Science lifecycle. However, such tools require their users to have specific (and deep) knowledge in different areas of Computing and Statistics, making their use practically unfeasible for non-specialist professionals in data science. In this paper, we propose a guideline to support predictive tasks in data science. In addition to being useful for non-experts in Data Science, the proposed guideline can support data scientists, data engineers or programmers which are starting to deal with predictive tasks. Besides, we present a tool, called DSAdvisor, which follows the stages of the proposed guideline. DSAdvisor aims to encourage non-expert users to build machine learning models to solve predictive tasks, extracting knowledge from their own data repositories. More specifically, DSAdvisor guides these professionals in predictive tasks involving regression and classification. 1 INTRODUCTION dict the future, and create new services and products (Ozdemir, 2016). Data science makes it pos- Due to a large amount of data currently available, sible to identifying patterns hidden and obtain new arises the need for professionals of different areas to insights hidden in these datasets, from complex ma- extract knowledge from their repositories to create chine learning algorithms. new products and services. For example, cardiolo- The Data Science lifecycle has six stages: busi- gists need to explore large repositories of electrocar- ness grasp, data understanding, data preparation, diographic signals in order to predict the likelihood modeling, evaluation, and deployment. To extract of sudden death in a certain patient. Likewise, tax knowledge from the data, we must be able to (i) un- auditors may want to explore their databases in or- derstand yet unsolved problems with the use of data der to predict the likelihood of tax evasion. How- mining techniques, (ii) understand the data and their ever, in order to build predictive models, these non- interrelationships, (iii) extract a data subset, (iv) cre- specialist professionals need to acquire knowledge in ate machine learning models in order to solve the se- different areas of Computing and Statistics, making lected problem, (v) evaluate the performance of the this task practically unfeasible. Another alternative new models, and (vi) demonstrate how these models is ask experienced data science professionals to help, can be used in decision-making (Chertchom, 2018). which creates dependency instead of autonomy. In The complexity of the previous tasks explains why this context, the popularization of data science be- only highly experienced users can master the entire comes an important research problem (Provost and Data Science lifecycle. On the other hand, several Fawcett, 2013). tools have been proposed in order to support the tasks Data science is a multidisciplinary area involv- involved in the Data Science lifecycle. However, such ing the extraction of information and knowledge from tools require their users to have specific (and deep) large data repositories (Provost and Fawcett, 2013). knowledge in different areas of Computing and Statis- It deals with the data collection, integration, manage- tics, making their use practically unfeasible for non- ment, exploration and knowledge extraction to make specialist professionals in data science. decisions, understand the past and the present, pre- In this paper, we propose a guideline to support 248 Filho, J., Monteiro, J., Mattos, C. and Nobre, J. A Practical Guide to Support Predictive Tasks in Data Science. DOI: 10.5220/0010460202480255 In Proceedings of the 23rd International Conference on Enterprise Information Systems (ICEIS 2021) - Volume 1, pages 248-255 ISBN: 978-989-758-509-8 Copyright c 2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved A Practical Guide to Support Predictive Tasks in Data Science predictive tasks in data science. In addition to being outputs (Demsarˇ et al., 2013). RapidMiner pro- useful for non-experts in Data Science, the proposed vides a visual and user friendly GUI environment. guideline can support data scientists, data engineers This tool uses the process concept. A process may or programmers which are starting to deal with pre- contain subprocesses. Processes contain operators dictive tasks. In addition, we present a tool, called which are represented by visual components. An DSAdvisor, which following the stages of the pro- application wizard provides prebuilt workflows for posed guideline. DSAdvisor aims to encourage non- a number of common tasks including direct market- expert users to build machine learning models to solve ing, predictive maintenance, sentiment analysis, and a regression or classification tasks, extracting knowl- statistic view which provides many statistical graphs edge from their own data repositories. DSAdvisor (Jovic et al., 2014). Weka offers four operating op- acts like an advisor for non-expert users or novice data tions: command-line interface (CLI), Explorer, Ex- scientists. perimenter and Knowledge flow. The “Explorer” op- The rest of this paper is organized as follows. Sec- tion allows the definition of data source, data prepa- tion 2 reviews related works. In section 3, the pro- ration, run machine learning algorithms, and data vi- posed guideline is laid out. The DSAdvisor is com- sualization (Hall et al., 2009). DSAdvisor is an ad- mented in section 4. Finally, in section 5 we present visor for non-expert users or novice data scientists, our conclusions and suggestions for future research. which following the stages of the guideline proposed in this paper. DSAdvisor aims to encourage non- expert users to build machine learning models to solve regression or classification tasks, extracting knowl- 2 RELATED WORKS edge from their own data repositories. Even before the popularization of data Science, In this section we will discuss the main related works. all these tools were developed to help with data min- For a better understanding, we organized the related ing tasks. These tools differ regarding tool usability, works into two categories: supporting tools and prac- type of license, the language in which they were de- tical guidelines. veloped, support for data understanding, and missing values handle. The most widely used tools include 2.1 Data Mining Tools KEEL, Knime, Orange, RapidMiner, Tanagra, and Weka. The table 1 provides a comparison between these tools and the DSAdvisor. Traditional data mining tools help companies estab- In the other hand, AutoML tools enable you to au- lish data patterns and trends by using a number of tomate some machine learning tasks. Although it be complex algorithms and techniques. As example of important to automate all machine learning tasks, that such tools, we can cite: KEEL, Knime, Orange, is not what AutoML does. Rather, it focuses on a few RapidMiner and WEKA (Hasim and Haris, 2015). repetitive tasks, such as: hyperparameter optimiza- KEEL (Knowledge Extraction based on Evolu- tion, feature selection, and model selection. Exam- tionary Learning) is a software that facilitates the ples of these tools include: AutoKeras, Auto-WEKA, analysis of the behavior of evolutionary learning in Auto-Sklearn, DataRobot, H20 and MLBox. different approaches of learning algorithm such as Pittsburgh, Michigan, IRL (iterative rule learning) and GCCL (genetic cooperative-competitive learn- 2.2 Guidelines ing) (Alcala-Fdez´ et al., 2009). Knime is a mod- ular environment that enables easy integration of A guideline is a roadmap determining the course of a new algorithms, data manipulation and visualization set of actions that make up a specific process, in addi- methods. It allows the selection of different data tion to a set of good practices for the performance of sources, data preprocessing steps, machine learning these activities (Dictionary, 2015). Some guidelines algorithms, as well as visualization tools. To cre- have been proposed to manage general data mining ate the workflow, the user drag some nodes, drop tasks. onto the workbench, and link it to join the input In (Melo et al., 2019), the authors presented a and output ports. The Orange tool has different practical guideline to support the specific problem of features which are visually represented by widgets predict change-proneness classes in oriented object (e.g. read file, discretize, train SVM classifier, etc.). software. In addition, they applied their guideline Each widget has a short description within the in- over a case study using a large imbalanced dataset ex- terface. Programming is performed by placing wid- tracted from a wide commercial software. It is im- gets on the canvas and connecting their inputs and portant to highlight that, in this work, we extend the 249 ICEIS 2021 - 23rd International Conference on Enterprise Information Systems Table 1: General characteristics of data mining

A Practical Guide to Support Predictive Tasks in Data Science

Robustbase: Basic Robust Statistics

Detecting Outliers in Weighted Univariate Survey Data

Outlier Detection

Mining Software Engineering Data for Useful Knowledge Boris Baldassari

Robust Statistics Part 1: Introduction and Univariate Data General References

Unsupervised Anomaly Detection in Receipt Data

Package 'Robustbase'

Exploring Ways of Identifying Outliers in Spatial Point Patterns Jie Liu East Tennessee State University

Research Article a Robust Skewed Boxplot for Detecting Outliers in Rainfall Observations in Real-Time Flood Forecasting

Robust Normality Test and Robust Power Transformation with Application to State Change Detection in Non Normal Processes

Package 'Mrfdepth'

Topics in Causal and High Dimensional Inference