KNIME for Open-Source Bioimage Analysis - a Tutorial
Total Page:16
File Type:pdf, Size:1020Kb
KNIME for Open-Source BioImage Analysis - A Tutorial Christian Dietz and Michael R. Berthold Abstract The open analytics platform KNIME is a modular environment that en- ables easy visual assembly and interactive execution of workflows. KNIME is al- ready widely used in various areas of research, for instance in cheminformatics or classical data analysis. In this tutorial the KNIME Image Processing Extension is introduced, which adds the capabilities to process and analyze huge amounts of images. In combination with other KNIME extensions, KNIME Image Processing opens up new possibilities for inter-domain analysis of image data in an understand- able and reproducible way. 1 Introduction Every day, research involves recording increasing numbers of images as a result of the constantly improving imaging techniques, making them key to life science research. Advanced microscopy allows the acquisition of multidimensional images almost without any user interaction and can therefore generate a plethora of het- erogeneous image data. However, to make sense of the generated image data and finally draw conclusions, an exhaustive analysis of the images has to be conducted. In addition to classical image processing techniques, more sophisticated algorithms are increasingly being applied - from the field of machine learning and data mining (Eliceiri et al, 2012). The extracted information is then further analysed with estab- lished statistical analysis techniques. For instance, detecting objects within images (i.e. segmentation) and the detailed statistical evaluation of the collected results are Christian Dietz University of Konstanz, Chair for Bioinformatics and Information Mining, Universitaetsstrasse 10, 78464 Konstanz, Germany e-mail: [email protected] Michael R. Berthold University of Konstanz, Chair for Bioinformatics and Information Mining, Universitaetsstrasse 10, 78464 Konstanz, Germany e-mail: [email protected] 1 2 Christian Dietz and Michael R. Berthold essential stages of a typical image analysis process (Saha et al, 2013; Ljosa et al, 2012; Aligeti et al, 2014). For a full exploitation of the outcome, an appropriate vi- sualization of the information or a linkage to other information sources from other domains may be necessary to gain new insights. A large number of monolithic and highly task-oriented software solutions has been proposed to tackle the problems that occur in each step of bio-image analysis tasks (Eliceiri et al, 2012). As a result, researchers are required to choose from a set of stand-alone tools, which have to be orchestrated to solve the given task. Typically, two approaches are used that link these kinds of tools: one approach is to transfer the data manually between the tools while the other approach involves writing a customized program or script to automate a particular process. However, these approaches typically lead to a number of critical problems. • Transferring the data manually involves a human being and is therefore time- consuming and does not scale with the amount of the acquired images. • Customized scripts are prone to errors. Furthermore, results calculated with these highly problem-specific scripts are frequently unable to be reproduced or reused by others. A straightforward, but infeasible solution to the described problems is to build a single monolithic platform that covers the complete range of functionalities re- quired by a bio-image analysis workflow. However, future demands are yet unknown and therefore a closed, proprietary software solution does not scale with the new requirements that evolve with technological advance. Therefore, the open-source community has realized the great need for, and benefit of, closer cooperation by fostering interoperability among individual projects and open, extensible platforms. Following this approach, the open-source analytics platform KNIME (Berthold et al, 2008) provides the ability to seamlessly integrate a diverse and powerful collection of existing software tools and libraries. KNIME is a user-friendly and comprehen- sive open-source data integration, processing, analysis, and exploration platform de- signed to handle large amounts of heterogeneous data. It has been developed since 2006 and is used by professionals in industry and academia. As an integration plat- form, KNIME directly combines the advantages of several different tools and do- mains. The integrated tools are encapsulated KNIME nodes, the basic processing units in KNIME, which in turn can be combined to form so-called workflows. KN- IME workflows not only inherently document the entire analysis process, but they can also be exported and easily made available to others, who can subsequently re- produce the results or use the workflows as a starting point for their own analysis. To guarantee reproducibility, KNIME makes sure that whenever any of the mod- ules change in any way, for example the change of a version an integrated tool, the previous version of that module is carefully deprecated but remains part of the plat- form. Hence, workflows published years ago still run with the most recent releases of KNIME. Once a workflow has been created, it can be applied to hundreds of thousands of images and other large data sets - even on small-scale devices thanks to the intelligent caching technology of KNIME. This makes KNIME well-suited for high-throughput screenings, in which the analysis results can also be quite large. KNIME for Open-Source BioImage Analysis - A Tutorial 3 The KNIME Image Processing Extension enhances KNIME by providing algo- rithms and data structures to process and analyse images. To avoid reinventing the wheel, KNIME Image Processing uses and integrates state-of-the-art libraries such as ImageJ1 (Schindelin et al, 2012) and ImageJ21, SCIFIO2, OMERO (Allan et al, 2012), ClearVolume (Royer et al, 2015), ImgLib2 (Pietzsch et al, 2012), CellProfiler (Kamentsky et al, 2011), TrackMate3 and others. These well-known image process- ing tools can not only exchange data and therefore be used in combination, it is also possible to link their output to other extensions from completely different domains. For example, once interesting hits have been identified in the image data, the re- spective molecules can be explored with one of the many KNIME cheminformatics extensions, for instance the KNIME RDKit extension4. An image processing and analysis workflow typically consists of a subset of sev- eral consecutive steps: Loading images, (pre)-processing, segmentation, tracking, feature extraction, model learning and the subsequent visualization and statistical analysis of the information gathered in the previous steps. Different problems can be incurred in each of these steps, depending on the image analysis task itself. How- ever, by combining KNIME Image Processing nodes with nodes from other avail- able KNIME extensions, it is easy to orchestrate these comprehensible workflows, which can span multiple domains, to solve the issues in KNIME without needing to program a single line of code. In Section 2 the main concepts of KNIME and KNIME Image Processing are introduced. Taking this as a basis, Section 3 goes on to explain an image processing workflow example in a step by step process. 2 Basic Concepts This section explains how KNIME and its extensions are downloaded and installed. Next, the KNIME User Interface is described, while the last part of this section covers the most fundamental concepts of KNIME Image Processing, which are im- portant for understanding the image processing workflow explained in Section 3. 2.1 Download and Installation The open analytics platform KNIME can be downloaded and installed from the KNIME website5. KNIME comes packed with an installer for Windows and Mac systems. Linux users simply have to extract KNIME. As KNIME is a plugin- 1 http://imagej.net. 2 http://scif.io. 3 http://fiji.sc/TrackMate. 4 http://tech.knime.org/community/rdkit. 5 http://www.knime.org. 4 Christian Dietz and Michael R. Berthold based system, there are several extensions that are not part of the basic KNIME installation. These extensions are easily installed via so-called update-sites. KN- IME Image Processing6, for example, is installed from the Trusted Community Contributions site. For details on how to install additional plugins, please see http://tech.knime.org/community. 2.2 KNIME User Interface Fig. 1 KNIME User Interface Figure 1 shows the KNIME User Interface. The KNIME Explorer (A) depicts the various locations where workflows can be stored or uploaded. By default, two locations are available: (i) The KNIME Example server on which several exam- ple workflows can be found. (ii) The LOCAL workspace, which was selected on the first start-up of KNIME. A new workflow can be created with File > New > New KNIME Workflow. This new, empty workflow is accessed via the LOCAL workspace. Workflows in KNIME are essentially graphs that connect nodes (atomic processing units in KNIME), and visually model the individual processing steps of a certain task. A Double-Click on the workflow in the KNIME Explorer (A) opens it in the Workflow Editor (C). The user is now able to drag&drop nodes from the Node Repository (B) onto the canvas of the workflow editor, to compose complex yet clear workflows, for example to process and analyse images. The nodes can then be con- nected by drawing a line from the output node to the input node, enabling the data to be passed from node to node. Additionally, each KNIME node provides a Node 6 http://knime.imagej.net. KNIME for Open-Source BioImage Analysis - A Tutorial 5 Description (D) explaining which input data it requires, explanations of the required parameters, what the node does with the incoming data and the output of the node. The Node Repository (B) contains all of the KNIME nodes that are part of the cur- rently installed KNIME extensions. The default KNIME Open Analytics Platform installation provides a basic set of nodes for data manipulation, data mining, a se- lection of data views, node control, time series analytics and basic IO and Database nodes.