KNIME - the Konstanz Information Miner
Total Page:16
File Type:pdf, Size:1020Kb
KNIME - The Konstanz Information Miner Version 2.0 and Beyond Michael R. Berthold, Nicolas Cebron, Fabian Dill, Thomas R. Gabriel, Tobias Kotter, Thorsten Meinl, Peter Ohl, Kilian Thiel and Bernd Wiswedel University of Konstanz Nycomed Chair for Bioinformatics and Information Mining Box 712, 78457 Konstanz, Germany [email protected] ABSTRACT editor is implemented as an ·Eclipse [9] plug-in. I t is easy to The Konstanz Information Miner is a modular environment, extend through an open API and a data abstraction fr ame work, which allows for new nodes to be quickly added in a which enables easy visual assembly and interactive execution well-defin ed way. of a data pipeline. It is designed as a teaching, research and collaboration platform, which enables simple integration of In this paper - which is based on an earlier publication [1] new algorithms and tools as well as data manipulation or concentrating on KNIME 1.3 - we describe the internals visuali zation methods in the form of new modules or nodes. of KNIME in more detail with emphasis on the new fea In this paper we describe some of the design aspects of the tures in KNIME 2.0. More information as well as down underlying archi tecture, briefl y sketch how new nodes can loads can be found at http: //www .knime . org. Experimen be incorporated , and highlight some of the new features of tal extensions are made available at the KNIME Labs pages version 2.0. (http: //labs . knime . org). 1. INTRODUCTION The need for modular data analysis environments has in creased dramatically over the past years. In order to make use of the vast variety of data analysis methods around, it is 2. OVERVIEW essential that such an environment is easy and intuitive to In KNIME, the user can model workflows, which consist of use, allows for quick and interactive changes to the analysis nodes that process data, transported vi a connections be process and enables the user to visually explore the resul ts. tween those nodes. A flow usually starts with a node that To meet these challenges data pipelining environments have reads in data from some data source , which are u sually text gathered incredible momentum over the past years. Some fil es, but dat abases can also be queried by special nodes . of today's well-established (but unfortunately also commer Imported data is stored in an internal table-based format cial) data pipelining tools are InforSense KDE [6], Insightful consisting of columns with a certain (extendable) data type Miner [7], Pipeline Pilot [8], to name just three examples. (integer, string, image, molecule, etc.) and an arbitrary These environments allow the user to visually assemble and number of rows conforming to the column specifications. adapt the analysis fl ow from ::; talldardized building blocks, These data tables are sent along the connections to other which are then connected through pipes carrying data or nodes that modify, transform , model, or visualize the data. models. An additional advantage of these systems is the IVlodifications can include handling of missing values, fil intui tive, graphical way to document what has been done. tering of column or rows, oversampling, partitioning of the KNIME, the Konstanz Information Miner provides such a table into training and test data and many other operators. pipelining environment. Figure 1 shows a screenshot of Following these preparatory steps, predictive m odels with the standard KNIME workbench with a small example data machine learning or data mining algorithms such as decision analysis workflow. In the center, a fl ow is reading in data trees, Naive Bayes classifiers or support vector machines are from two sources and processes it in several, parallel anal buill. I:<o r inspecLing the results of an analysis workflow nu ysis fl ows, consisting of preprocessin g, modelin g, and visu merous view nodes are available, which display the data or alization nodes. On the left a repository of nodes is shown. the trained models in diverse ways. FI.·om this large variety of nodes, one can select data sources, In contrast to many other workflow or pipelining tools, nodes data preprocessing steps, model building algorithms, as well in KNIME first process the entire input table before the as visualization tools and drag them onto the workbench, results are forwarded to successor nodes. The advantages where they can be connected to other nodes. The ability to are that each node stores its results permanently and thus have all views interact graphically (visual brushing) creates workflow execution can easily be stopped at any node and a powerful environment to visually explore the data sets at resumed later on. Intermediate results can be inspected at hand. KNIME is written in J ava and its graphical workflow any time and new nodes can be inserted and may use al ready created data wi thout preceding nodes having to be re-executed. The data tables are stored together wi th the workflow structure and til e node::; ' ::;ettin gs. 26 x CI 1:1 tU. Node Descriltion tl CondDonaI 80a PIo4: In Statistics: In Regre~lon: ~ Node 'S linear Regression (Leo .......) : Scattet' Wat,c. Perform s" ~ multMsnate lirlear Nodel7 regreSSion. ~ Polynoollal Regression PWNl WritOiIBETA) Node 18 leamer: ltillmer that builds ill polynomial re.g resslon model from the Input data Regression (Predittor): Dd:oVleWs Pr edicts the response S I m!II!I!III using a re9r~ n [iI~ A ege . Jion model. !.,.. Li'learCofleiation ~ I I .... "...."v_ read test date Linear CorTetation: i i It VtLe Coultef ~ COl'folalionF"lOl • M..... LfJ e. AhOcia6on AlAs. , 'B Bbye. s 12:;J CkJrltfr.g i t ~C1.nterA'~ r W Fuuy c-Means I r i H ~IIIChicaln'''Ie!ing ... Welcome to Kf'l I HE v 2 . 0.2 .00 2 0620 - the Ko n~ t Clonr: Intormation lI i ner ! r SOTAle8ne1 Copyright, 2003 - 20D!), Un i Konat emr: ancS KNI HE Gmb H, Germany , ~ SOlA Pre6ctOl' .... ....... .... ......... ". ....................................................... •••• I I 1 . k-Means Log til e 1::1 l ocated at : 0: \ kn1 me_ Z.O .2 w T lme B et"le~\ lIork3pao e \ .metMotel\ k nime, kn..!l rtJ t.i3 RUe IrdJclion . , ! .!.l -- , I Figure 1: The KNIME workbench with a small example workflow, One of KNIME 's key features is hiliting, In its simplest form, 3. ARCHITECTURE it allows the user to select and highlight several rows in a The architecture of KNIME was designed with three main data table and the same rows are also highlighted in all other principles in mind, views that show the same data table (or at least the high lited rows), This type of hiliting is simply accomplished by • Visual, interactive framework: Data flow s should be using the 1:1 correspondence between the tables' unique row combined by simple drag&drop from a variety of pro keys, However, there are several nodes that change the input cessing units, Customized applications can be modeled table's structure and yet there is still some relation between through individual data pipelines. input and output rows, A good example for such a 1-to-n re lation are clustering algorithms, One of the node's input are • Modularity: Processing units and data containers should the training (or test) patterns, the output are cluster pro not depend on each other in order to enable easy distri totypes, Each of the clusters covers several input patterns. bution of computation and allow for independent de By hiliting one or more clusters in the output table all input velopment of differellt algorithms, Data types are eTl patterns which are part of those cluster(s) are hilited in the capsulated, that is, no types are predefined, new types input table. Similar translations are, of course, also possible can easily be added bringing along type specific ren for other summarizing models: branches/leaves of a decision derers and comparators, New types can be declared tree, frequent patterns, di scriminative molecular fragments, compatible to existing types. to name just three examples. • Easy extensibility: It should be easy to add new pro One of the important design decisions was to ensure easy ex cessing nodes or views and distribute them through a tensibili ty, so that other users can add functionality, usually simple plugin mechanism without the need for compli in the form of new nodes (and sometimes also data types), cated install/ deinstall procedures. This has already been done by several commercial vendors but also by other university groups or open source program In order to achieve this, a data analysis process consists of mers, The usage of Eclipse as the core platform means ~hat a pipeline of nodes, connected by edges that transport ei contributing nodes iri the form of plugins is a very simple ther data or models, Each node processes the arriving data procedure. The official KNIME website offers several ex and/or model(s) and produces results on its outputs when tension plugins for business intelligence and reporting via requested , Figure 2 schematically illustrates this process, BIRT [2], statistical analysis with R[4] or extended machine The type of processing ranges from basic data operations learning capabilities from Weka [5], among many others, such as filtering or merging to simple statistical functions, sll ch as computations of mean, standard deviation or linear 27 Figure 2: A schematic [or the flow o[ data and models in a KNIME workflow. regression coefficients to computation intensive data mod eling operators (clustering, decision trees, neural networks, to name just a few). In addition, most of the modeling nodes allow for an interactive exploration of their results Figure 3: A diagram of the data structure and the main through accompanying views.