A Generic Data Analysis Application

Knowing: A Generic Data Analysis Application T. Bernecker, F. Graf, H.-P. Kriegel, C. Türmer, D. Dill N. Seiler Heinz-Nixdorf-Lehrstuhl für Medizinische Institute for Informatics, Elektronik, TU München Ludwig-Maximilians-Universität München Theresienstraße 90 / N3, 80333 München Oettingenstraße 67, 80538 München {tuermer,dill}@tum.de {bernecker,graf,kriegel}@dbs.ifi.lmu.de [email protected] ABSTRACT With the use of a standardized plug-in system like OSGi1, Extracting knowledge from data is, in most cases, not re- Java Plugin Framework (JPF) or Java Simple Plugin Frame- stricted to the analysis itself but accompanied by prepara- work (JSPF), each implementation of an algorithm does tion and post-processing steps. Handling data coming di- not have to be specifically adapted to the according frame- rectly from the source, e.g. a sensor, often requires pre- work. With Knowing (Knowledge Engineering) we provide conditioning like parsing and removing irrelevant informa- a framework that addresses this shortcoming by bridging tion before data mining algorithms can be applied to analyze the gap between the data mining process and rapid proto- the data. Stand-alone data mining frameworks in general do type development. We achieve this by using a standardized not provide such components since they require a specified plug-in system based on OSGi, so that algorithms can be input data format. Furthermore, they are often restricted packed in OSGi resource bundles. This offers the possibil- to the available algorithms or a rapid integration of new ity to either create new algorithms as well as to integrate algorithms for the purpose of quick testing is not possible. and exchange existing algorithms from common data min- To address this shortcoming, we present the data analysis ing frameworks. The advantage of these OSGi compliant framework Knowing, which is easily extendible with addi- bundles is that they are not restricted for use in Knowing tional algorithms by using an OSGi compliant architecture. but can be used in any OSGi compliant architecture. In this demonstration, we apply the Knowing framework This demonstration includes the following contributions: to a medical monitoring system recording physical activity. We use the data of 3D accelerometers to detect activities • a simple, yet powerful graphical user interface (GUI), and perform data mining techniques and motion detection to classify and evaluate the quality and amount of physical • a bundled embedded database as data storage, activities. In the presented use case, patients and physicians can analyze the daily activity processes and perform • an extensible data mining functionality, long term data analysis by using an aggregated view of the results of the data mining process. Developers can integrate • extension support for algorithms addressing different and evaluate newly developed algorithms and methods for use cases and data mining on the recorded database. • a generic visualization of the results of the data mining 1. INTRODUCTION process. Supporting the data mining process by tools was and still is a very important step in the history of data mining. By Section 2 provides a short overview of related work. De- the support of several tools like ELKI [1], MOA [2], Weka tails of the architecture of Knowing will be given in Sec- [3] or RapidMiner [4], scientists are nowadays able to ap- tion 3. In our scenario, described in Section 4, we will ply a diversity of well-known and established algorithms on present the MedMon system which itself extends Knowing. their data for quick comparison and evaluation. In cases In the developer stage, we can easily switch between the where the requirements enforce a rapid development from scientific data mining view and the views which will be pre- data mining to a representative prototype, these unstan- sented to the end users later on. As MedMon is intended to dardized plug-in systems can cause a significant delay which be used by different target groups (physicians and patients), is caused by the time needed to incorporate the algorithms. it is desired to use a single base system for all views and only deploy different user interface bundles for each target group. This way, the data mining process can seamlessly be integrated into the development process by reducing long Permission to make digital or hard copies of all or part of this work for term maintenance to a minimum, as only a single system personal or classroom use is granted without fee provided that copies are with different interface bundles has to be kept up to date not made or distributed for profit or commercial advantage and that copies and synchronized instead of a special data mining tool, a bear this notice and the full citation on the first page. To copy otherwise, to physician tool and a patient tool. Section 5 describes the republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. planned demo tour in more detail. EDBT 2012, March 26–30, 2012, Berlin, Germany. 1 Copyright 2012 ACM 978-1-4503-0790-1/12/03 ...$10.00 OSGi: http://en.wikipedia.org/wiki/OSGI 630 2. RELATED WORK between completely different systems, but also if the GUI In the past years, several data mining frameworks like should be changed from a data mining view to a prototype ELKI [1], MOA [2], WEKA [3], RapidMiner [4] or R [5] view for the productive system. This can be done by ei- have been presented and established (among many others). ther using the resource bundles containing the DPUs, or by Although all frameworks perform data mining in their core, directly extending Knowing itself. they all have different target groups: In the current implementation, the Knowing framework WEKA and MOA provide both algorithms and GUIs. By is based on the established and well-known Eclipse RCP 4 using these GUIs, the user can analyze data sets, config- system and uses the standardized OSGi architecture which ure and test algorithms and visualize the outcome of the allows the composition of different bundles. This brings the according algorithm for evaluation purposes without need- great advantage that data miners and developers can take ing to do some programming. As the GUI cannot satisfy two different ways towards their individual goal: If they start all complex scenarios, the user still has the possibility to a brand new RCP-based application, they can use Knowing use the according APIs to build more complex scenarios in out of the box and create the application directly on top of his own code. RapidMiner integrates WEKA and provides Knowing. The more common case might be that an RCP- powerful analysis functionalities for analysis and reporting or OSGi-based application already exists and should only which are not covered by the WEKA GUI itself. RapidMiner be extended with data mining functionality. In this case, also provides an improved GUI and also defines an API for only the appropriate bundles are taken from Knowing and user extensions. Both RapidMiner and WEKA provide some integrated into the application. support to external databases. The aim of ELKI is to pro- In the following, we describe the architecture of the Know- vide an extensible framework for different algorithms in the ing framework which consists of a classical three-tier archi- fields of clustering, outlier detection and indexing with the tecture comprising data storage tier, data mining tier and main focus on the comparability of algorithm performance. GUI tier, where each tier can be integrated or exchanged Therefore, single algorithms are not extensively tuned to using a modular concept. performance but tuning is done on the application level for all algorithms and index structures. Like the other frame- 3.1 Data Storage works, ELKI also provides a GUI, so that programming is The data storage tier of Knowing provides the function- not needed for the most basic tasks. ELKI also provides ality and abstraction layers to access, import, convert and an API that supports the integration of user-specified algo- persist the source data. The data import is accomplished by rithms and index structures. an import wizard using service providers, so that importing All the above frameworks provide support for the process data is not restricted to a certain format. of quick testing, evaluating and reporting and define APIs in Applying the example of the MedMon application, a ser- different depths. Thus, scientists can incorporate new algo- vice provider is registered that reads binary data from a rithms into the systems. However, none of them makes use 3D accelerometer [7] which is connected via USB. The data of a standardized plug-in system, so that each implementa- storage currently defaults to an embedded Apache Derby 5 tion of an algorithm is specifically adapted to the according database which is accessed by the standardized Java Per- framework without being interchangeable. sistence API (JPA & EclipseLink). This has the advantage R provides a rich toolbox for data analysis. Also there that the amount of data being read is not limited by the are a lot of plugins which extend the functionality of R. main memory of the used workstation and that the user does Nevertheless, all advantages and disadvantages like for the not have to set up a separate database server on his own. above frameworks also hold for R, especially the lack of a However, by using the JPA, there is the possibility to use standardized plug-in system. more than 20 elaborated and well-known database systems which are supported by this API6. An important feature in the data storage tier arises from the possibility to use exist- 3. ARCHITECTURE ing data to support the evaluation of newly recorded data, Applying a standardized plug-in system like OSGi, the e.g.

A Generic Data Analysis Application

Feature Selection for Gender Classification in TUIK Life Satisfaction Survey A

Effect of Distance Measures on Partitional Clustering Algorithms

An Analysis on the Performance of Rapid Miner and R Programming Language As Data Pre-Processing Tools for Unsupervised Form of Insurance Claim Dataset

Açık Kaynak Kodlu Veri Madenciliği Yazılımlarının Karşılaştırılması

Industry Applications of Machine Learning and Data Science

BETULA: Numerically Stable CF-Trees for BIRCH Clustering?

Mining the Web of Linked Data with Rapidminer

Rapidminer Operator Reference Manual ©2014 by Rapidminer

Research Techniques in Network and Information Technologies, February

Machine Learning for Improved Data Analysis of Biological Aerosol Using the WIBS

ML-Flex: a Flexible Toolbox for Performing Classification Analyses

Machine Learning