Intuitive Time-Series-Analysis-Toolbox for Inexperienced Data Scientists
Total Page:16
File Type:pdf, Size:1020Kb
2020 International Conference on Computational Science and Computational Intelligence (CSCI) Intuitive Time-Series-Analysis-Toolbox for Inexperienced Data Scientists Felix Pistorius∗, Daniel Baumann∗, Luca Seidel‡ and Eric Sax∗ Institute for Information Processing Technologies (ITIV), Karlsruhe Institute of Technology (KIT) 76131 Karlsruhe, Germany ∗Email:{felix.pistorius, d.baumann, eric.sax}@kit.edu ‡[email protected] Abstract—There are many different procedures to carry out based on it [5]. However, these models offer only a rough a data mining project. Depending on the application, various granular structure of how a process can be implemented and methods have to be chosen, which is normally done by experts. do not specify an explicit sequence of methods [6]. Therefore, For inexperienced users without machine learning experience, it is difficult to analyse their data without help. standard process models for data mining and knowledge dis- To simplify the access to data mining particularly for multi- covery require experience and educated guesses (e.g. CRISP- variate time series analysis, we propose an intuitive toolbox that DM [7]). guides step-by-step through data mining process. Furthermore, The realization of such process models is suitable for it supports inexperienced users with pre-suggested methods in data science experts. These experts are highly sought after, every data mining process step. Therefore, specifications for such a toolbox are defined and a prototype for the realization of a therefore, not every company has the possibility to hire them toolbox is presented. The work steps which are based on the for data mining projects [8]. Nevertheless, companies urge knowledge discovery process can be adapted and changed to suit the use of data mining in their processes. Thus users who different application scenarios. work for the first time on data mining projects are often Index Terms —data mining application, intuitive software, inexperienced. They are neither familiar with data mining and KDD, multivariate time series KDDM processes nor carry out whole data mining projects. Without in-depth experience, it is difficult to know which I. INTRODUCTION AND MOTIVATION methods are the most appropriate for each KDDM process The advancing change to Industry 4.0 and the increasing step. A survey of the ”Fraunhofer Institute for Manufacturing connectivity of technical systems coming with it provide Engineering and Automation” (Fraunhofer IPA) has shown access to a large amount of data. Many sectors, such as that more than 60% of the interviewed participants (employees insurance or the automotive industry, hope to gain advantages from technical professions) desire simplified software solu- by saving costs and making business-critical decisions faster tions to perform data mining without specific knowledge [9]. and more efficient. Moreover, new markets can be opened up There is a lack of an intuitive and easy understanding KDDM [1]. Finding patterns, trends and associations in the collected tool, particularly for multivariate time series, as they occur in data plays a central role in achieving such goals. Useful the monitoring of machines using a large number of sensors. information only emerges, when the knowledge gets extracted This is because common data mining applications do not from the collected data. Data mining, the core of knowledge consider and compare time series as a whole, but interpret each discovery processes, provides access to this knowledge. To data point independently. Such a tool should guide the user study the data, algorithms are derived and models are devel- step by step through all tasks of a data mining process, such oped to uncover previously unknown patterns. The model is as preprocessing, transformation, data mining, visualization. used to understand phenomena from the data resulting from In doing so, the user should apply his expertise in a targeted the analysis. Predictions can also be derived from it [2]. way because such users only have a limited understanding Due to the increasing development in the research field of of data mining, but a precise insight and deep understanding ”Big Data”, the number of different methods and algorithms of the problem. We focus on a software solution that guides rises. In the area of classification analysis, for example, more the user through a defined data mining workflow. Based on than 150 different algorithms are available [3]. Every method user input, appropriate algorithms and methods for the whole or algorithm has its advantages and disadvantages depending KDDM process are applied. on the purpose of use. Furthermore, steps like prepossessing, In the first part of section II the already established software transformation, evaluation and visualization are needed to solutions are introduced. Then in the second part the spec- perform a proper analysis. Therefore, users have to choose ifications for an intuitive Time-Series-Analysis-Toolbox are methods and algorithms separately for every application. Since defined. Based on these specifications, the developed concept the introduction of the Knowledge Discovery in Databases pro- for such a toolbox is presented in section III and in IV. cess of Fayyad [4], many different Knowledge Discovery and In section V the toolbox itself and its achieved results are Data Mining process models (KDDM) have been developed discussed. 978-1-7281-7624-6/20/$31.00 ©2020 IEEE 401 DOI 10.1109/CSCI51800.2020.00075 II. RELATED TOOLBOX AND ITS LIMITATIONS FOR and accessible knowledge from data [4], especially time series. INEXPERIENCED KDDM USERS A. The most used related KDDM-Toolbox 2) Workflow: An optimal workflow is essential for the suc- cess of a knowledge discovery process. Knowledge discovery Gartners ”Magic Quadrant for Data Science and Machine process models only provide a framework for which tasks Learning Platforms” is a visualization of the annual results should be performed in which order, but not the methods of market research conducted by the company Gartner Inc. themselves, which are used for the analysis. An optimal knowl- and is divided into four ”quadrants” [10]. The further right edge discovery process is realized when the right methods and a vendor of a Gartner KDDM framework is positioned, the parameters have been chosen considering the underlying data more complete is the vision to support the user in all aspects and goals of the project. The supporting application should of KDDM. The higher up a framework is positioned, the more suggest suitable methods and parameters to the user in the suitable it is be for the intended analysis during execution. context of the existing data and user input. First-time users can Gartner includes several subcategories for this overarching be supported even more effectively by an automatic selection criterion, such as market understanding, bundling of expertise, of parameters and methods without user input. or global marketing. The leading companies include SAS, An optimal workflow is thus composed of a manual Alteryx, and Mathworks [10]. or automated run, in which parameters and methods are SAS and Alteryx offer software solutions for the analysis suggested based on the data. of business data from financial services and retail. Technical time series as sensor data cannot be interpreted as such. With 3) Visual representation: Found patterns and associations the help of a code-free environment, data mining processes in data are only useful if they can be understood [4]. can be created via a flow chart. The programs are designed to Therefore, a proper visual presentation of results is important. serve the demands of analysts and machine learning experts Especially with multivariate data mining, the representation and only offer limited help for inexperienced users [11] [12]. of patterns is often difficult. Accordingly patterns and With Matlab MathWorks offers a widely used platform for results have to be visualized comprehensibly through using analytical calculations [13]. Matlab can be extended through appropriate methods. When users can understand and track toolboxes. Concerning data mining, the Neural Network Tool- the results, it will be easier for them to become familiar box, the Optimization Toolbox, the Statistics Toolbox, and with machine learning. A modern and intuitive design of the Curve Fitting Toolbox are the most popular. With these the user interface is also essential for a successful workflow, extensions, Matlab can be used to carry out data mining as it contributes to an intuitive and therefore more effective processes. To perform data mining tasks, machine learning operation. experience and programming knowledge are required. A software solution that makes data mining accessible 4) Scope: A large number of methods are required to view to domain experts without a machine learning experience data from many different perspectives. As already mentioned, is required. Furthermore, such a solution should offer com- there are a large number of different methods and algorithms. prehensive analysis methods for technical time series with An intuitive toolbox should include a sufficient spectrum of physical reference. analysis tools, but not overload the user with an oversupply. An excessive number inhibits an intuitive workflow. A B. Focus and needs of a tool for KDDM inexperienced users compromise between complexity and applicability has to be As shown in section II-A, there are potential software found in order to support the user in the best possible way. solutions, especially