Integration and Automation of Data Preparation and Data Mining

Integration and Automation of Data Preparation and Data Mining Shrikanth Narayanan Ayush Jaiswal Yao-Yi Chiang Spatial Sciences Institute Department of Computer Science Spatial Sciences Institute Department of Computer Science University of Southern California University of Southern California University of Southern California Email: [email protected] Email: [email protected] Email: [email protected] Yanhui Geng Craig A. Knoblock Pedro Szekely Huawei Technologies Information Sciences Institute Information Sciences Institute Email: [email protected] Department of Computer Science Department of Computer Science University of Southern California University of Southern California Email: [email protected] Email: [email protected] Abstract—Data mining tasks typically require significant effort into a shared domain model, which supports the integrated in data preparation to find, transform, integrate and prepare reasoning and processing of data across many sources. Once the data for the relevant data mining tools. In addition, the a user models the data sources, Karma can automatically work performed in data preparation is often not recorded and is converts the raw data into any of a variety of formats. difficult to reproduce from the raw data. In this paper we present an integrated approach to data preparation and data mining In this paper we present an integrated approach built that combines the two steps into a single integrated process on Karma for information integration and data mining. Our and maintains detailed metadata about the data sources, the approach combines the steps in data preparation and data steps in the process, and the resulting learned classifier produced mining into a single integrated process and maintains detailed from data mining algorithms. We present results on an example scenario, which shows that our approach provides significant metadata about the data sources. We show that a user can reduction in the time in takes to perform a data mining task. use our approach to rapidly clean, normalize, restructure and integrate data collected from multiple mobile sensors, and then apply machine learning algorithms from a library of algorithms to perform a data mining task. We demonstrate I. INTRODUCTION these capabilities using a data mining task of predicting the While the quality and ease of use of data mining libraries mode of transport of a user given the sensor data collected via such as in R [1] and Weka [2] is excellent, users must spend his/her cellphone. The challenge here is how to perform a data significant effort to prepare raw data for use in these libraries. mining task, such that there is an sizable reduction in overall In practice, the data preparation task is separate from the data user time and effort in preparing the data set from its raw form mining process and very often requires data source dependent and invoking the prediction service. We compare our approach and repetitive manual work, such as applying Structured Query with a baseline approach that uses Microsoft Excel for most Language (SQL) statements to extract and aggregate database of the data preparation tasks. We show how Karma supports records, using Microsoft Excel to clean and normalize datasets the required steps for data preparation and data mining, and of structured text files, and writing scripts to apply complex yields a significant reduction in time and effort. transformations (e.g., in Python). This practice does not scale when the data size becomes larger and the number of data The remainder of this paper is organized as follows. sources increases. Section 2 describes a motivating data mining problem on prediction of mode of transport from mobile sensor data. In a previous work, Knoblock et al. [3] developed an Section 3 presents our integrated approach to data preparation interactive approach to extracting, modeling, and publishing and data mining. Section 4 describes the steps of using Excel data in a system called Karma.1 Karma allows an end-user to to prepare raw data for data mining. Section 5 reports on our solve their own integration tasks without having to program experimental results. Section 6 discusses the related work, and those tasks. Karma can semi-automatically build a semantic Section 7 presents discussion and future work. description (a model) of a data source. This semantic description makes it possible to rapidly convert a set of sources (rep- resented in Extensible Markup Language (XML), Javascript II. MOTIVATING PROBLEM Object Notation (JSON), structured text files, or databases) As an example of a data mining task, we consider the This research was supported in part by a gift from Huawei problem of predicting the mode of transport of a mobile user. 1See http://www.isi.edu/integration/karma for a video demonstration and Mode of transport prediction is an interesting problem in data details on Karma. The software is available as open source (Apache 2 License). mining. It helps in providing contextual information about Trees algorithm with features such as the Discrete Fourier Data%Collec3on%for%3%days% Transform (DFT) energy coefficients at 1Hz, 2Hz, and 3Hz of acceleration magnitude, the accelerometer variance, and the Sensor%data%for% Transporta3on% Accelerometer%and%GPS% Labels% speed recorded by the GPS sensor. The features are derived from data collected from the accelerometer and the GPS sensors on a mobile device. Our method of predicting the mode of transport is similar to theirs as we use the Support Karma% Vector Machine (SVM) algorithm with similar features. We use GPS accuracy as a feature instead of accelerometer variance, Modeling%Data%and%Services% keeping the rest of the features the same. AddDFT%Service% To acquire training and testing data, we use an applica- Loca3onProbe% Labeling%Service% tion that collects accelerometer and GPS readings on mobile AccelerometerSensor% SVM%Training%Service% AddDFT%service%output% phones. We asked users to record their mode of transportation SVM%Tes3ng%Service% while using the application in order to acquire the data to build the prediction model. The data mining system derives the aforementioned features from the sensor data and merges Data%Prepara3on% the mode of transport labels noted by the user with the features for mining. Process%Loca3on% Invoke%addDFT% Probe%data% service% The collected data is hardly ever ready to use directly in real life scenarios. We have to prepare it for mining through multiple transformation steps, such as dealing with missing Join%addDFT% values, discretization, and normalization. Moreover, even if Load%Accelerometer% output%and% Sensor%data% Loca3on%Probe% the data is in tabular form, in many cases, we cannot use data% the columns as features for data mining libraries directly. We have to compute useful features by using values in multiple Pytransform%for% columns. In the example case of mode of transport prediction, Accelera3on% Filter%rows%that% we calculate acceleration magnitude from the acceleration Magnitude%% cannot%be%joined% recorded by the accelerometer along the X, Y, and Z dimen- sions. We then calculate DFT energy coefficients at 1Hz, 2Hz, and 3Hz of the acceleration magnitude for every 1 second time Extract%3mestamp% Add%mode%of% window. We then discretize the data, which contains multiple and%magnitude% transporta3on% entries every second, to the nearest second. columns% labels% Many data mining tasks use data from multiple sources that have to be merged in order to derive features from them, such as those performed by large organizations on data collected from different branches and departments within them. The data Data%Mining% used by the algorithm that we employ in our example case is derived from three sources, i.e., the accelerometer sensor, the Invoke%SVM% Invoke%SVM% GPS sensor, and the users mode of transportation notes. We Training%service% Tes3ng%service% merge the data by making use of timestamps contained in all three sources. Train%&%Update% In general, after the data is prepared, machine learning SVM%model% algorithms are applied to it to train the prediction model and to test it, or to perform cross-validation. In our case, we train a prediction model using the SVM algorithm and test it. The final set of features consists of DFT energy coefficients at 1Hz, SVM%Training% SVM%Predic3on% 2Hz, and 3Hz, speed recorded by the GPS sensor, and GPS Summary% output% accuracy. These parts of the entire data mining system motivate the development of a complete end-to-end system that offers Fig. 1: System block diagram complex data transformations (as in-built features, extensions, or services), data merging capabilities, invocation of various machine learning algorithms for application and comparison, and display of results in a meaningful manner. the user that can be used in building intelligent smartphone applications. III. APPROACH Reddy et al. [4] describe such a method for predicting the In general, a data mining task will have the following steps mode of transport of a mobile user by applying the Decision - data collection, data integration and cleaning, transformation and selection, data mining and evaluation [5]. It has been B. Processing with Karma estimated that data preparation - integration, cleaning, selection The steps using Karma are divided into two parts: and transformation, accounts for a significant portion of the time spent on a data mining project. In our approach to 1) Karma setup: These are tasks that are performed only the demonstrate the end-to-end process of data preparation and first time. They include modeling the three web services, data mining, we select the task of predicting the mode of trans- addDFT, getLabel, and svmTraining, which are explained port for a user, given the sensor data collected by the mobile later, as well as modeling the two raw datasets Accelerom- phones. Our approach enables users to integrate data from a eterSensor and LocationProbe. All transformations and variety of sources without exposing them to the underlying processing done here is recorded by Karma and can be complexity of data integration and exchange.

Integration and Automation of Data Preparation and Data Mining

Mapreduce-Based D ELT Framework to Address the Challenges of Geospatial Big Data

A Survey on Data Collection for Machine Learning a Big Data - AI Integration Perspective

Best Practices in Data Collection and Preparation: Recommendations For

The Risks of Using Spreadsheets for Statistical Analysis Why Spreadsheets Have Their Limits, and What You Can Do to Avoid Them

Data Migration (Pdf)

Advanced Data Preparation for Individuals & Teams

Preparation for Data Migration to S4/Hana

Preparing a Data Migration Plan a Practical Introduction to Data Migration Strategy and Planning

Analyzing the Web from Start to Finish Knowledge Extraction from a Web Forum Using KNIME

Improving Data Preparation for Business Analytics Applying Technologies and Methods for Establishing Trusted Data Assets for More Productive Users

Web Data Integration: a New Source of Competitive Advantage

Automated Machine Learning for Healthcare and Clinical Notes Analysis