Bioimage Informatics for Big Data

Bioimage Informatics for Big Data Hanchuan Peng1*, Jie Zhou2, Zhi Zhou1, Alessandro Bria 3,4, Yujie Li1,5, Dean Mark Kleissas6, Nathan G. Drenkow6, Brian Long1, Xiaoxiao Liu1, and Hanbo Chen1,5 1Allen Institute for Brain Science, Seattle, WA, USA. 2 Department of Computer Science, Northern Illinois University, Dekalb, IL, USA. 3 Department of Engineering, University Campus Bio-Medico of Rome, Rome, Italy. 4 Department of Electrical and Information Engineering, University of Cassino and L.M., Cassino, Italy. 5 Department of Computer Science, University of Georgia, Athens, GA, USA. 6 Johns Hopkins University Applied Physics Laboratory, Laurel, MD, USA. * Correspondence: [email protected] Abstract: Bioimage Informatics is a field wherein high-throughput image informatics methods are used to solve challenging scientific problems related to biology and medicine. When the image datasets become larger and more complicated, many conventional image analysis approaches are no longer applicable. Here we discuss two critical challenges of large-scale Bioimage Informatics applications, namely data accessibility and adaptive data analysis. We highlight case studies to show that these challenges can be tackled based on distributed image computing as well as machine learning of image examples in a multi-dimensional environment. The Big Data Challenges There have been substantial advances of Bioimage Informatics in the last fifteen years (Peng, 2008; Swedlow, et al., 2009; Myers, 2012). Now with the annual conferences of Bioimage Informatics (http://bioimageinformatics.org) and related topics, as well as the formally added paper submission categories in several computational biology and bioinformatics journals (e.g. BMC Bioinformatics and Bioinformatics (Oxford)), more researchers have been attracted to this growing field. Occasionally, Bioimage Informatics has been thought to be related to studies that use image- analysis and computer-vision methods to solve bioinformatics problems in some biology domains such as cell biology and neuroscience (e.g. Danuser, 2011; Jug, 2014; Mikut, 2013). It is however a view that does not necessarily reflect all the intended applications in this field. In a 2012 editorial of Bioinformatics journal (Peng, et al., 2012), Bioimage Informatics is defined as a category including “Informatics methods for the acquisition, analysis, mining and visualization of images produced by modern microscopy, with an emphasis on the application of novel computing techniques to solve challenging and significant biological and medical problems at the molecular, sub-cellular, cellular, and super-cellular (organ, organism, and population) levels. This also encourages large-scale image informatics methods/applications/software, various enabling techniques (e.g. cyber-infrastructures, quantitative validation experiments, pattern recognition, etc.) for such large-scale studies, and joint analysis of multiple heterogeneous datasets that include images as a component. Bioimage related ontology and database studies, image-oriented large-scale machine learning, data mining, and other analytics techniques are also encouraged.” In short, we believe that Bioimage Informatics emphasizes the high- throughput aspect of the image informatics methods and applications. It is important to stress that the current pace of image data generation has very much exceeded the processing capability in conventional computer vision and image analysis labs. For the four microscopic imaging modalities most used today, namely brightfield imaging, confocal or multi- photon laser scanning microscopy, light-sheet microscopy, and electron microscopy, it has been very easy to produce big image data with hundreds of giga-voxels or even many tera-voxels, where each voxel could correspond to one or multiple bytes of data. This is not only the situation of concerted large scale projects such as the MindScope project at the Allen Institute for Brain Science (Anastassiou, et al., 2015) or the FlyLight project at Janelia Research Campus of the Howard Hughes Medical Institute (Jenett, et al., 2012), but also a commonly confronted scenario in much smaller projects in individual research labs (Silvestri, et al., 2013). In addition to the scale of the image datasets, the complexity of bioimages also makes it very challenging to rely on conventional computational methods to analyze data efficiently. There are two specific challenges. First, in many (if not most) bioimages, there are at least three to five intrinsic dimensions, namely the X, Y, Z spatial dimensions, the “color” channel dimension that reflects colocalizing patterns, and the time dimension. Further dimensions might also be encountered to include other experimental parameters or perturbations. It is often very hard to navigate through such high-dimensional datasets, let alone detect or mine the biologically or medically relevant patterns from such data. Second, bioimages often contain a number of different spatial regions corresponding to cells, intracellular objects or cell populations. Most applications of bioimage segmentation, registration and classification (Qu, et al., 2015) will involve the determination of certain relationships among these objects. For instance, a goal in neuroscience is to quantify the distribution of synapses that connect neurons. To achieve such a goal, the very complicated 3D morphology of a neuron should be reconstructed (traced), and synapses that may have very different shapes should be segmented. In addition, the spatial relationship of neuron(s) and synapses should be characterized. Achieving these complex computational analyses is often technically challenging (Micheva, et al., 2010; Kim, et al., 2012; Mancuso, et al., 2013; Collman, et al., 2015). Here we discuss briefly two critical challenges of very large-scale Bioimage Informatics applications, namely data accessibility and adaptive data analysis, related to the aforementioned concerns of the scale and complexity of bioimage datasets. Some recent advances in bioimage management as well as machine learning for bioimages that address these two challenges are highlighted. Big Bioimage Storage and Accessing Complementary to recent advances in the fields of bioimage acquisition (Khmelinskii, et al., 2012; Tomer, et al., 2012), visualization (Peng, et al., 2010; De Chaumont, et al., 2012; Schneider, et al., 2012; Peng, et al., 2014), and analysis (Kvilekval, et al., 2010; Luisi, et al., 2011; Schneider, et al., 2012; Peng, et al., 2014), storage and management of big images form another exciting line of work to address challenges of large-scale bioimage data. Open Microscopy Environment (OME) (Swedlow, et al., 2003), BISQUE (Kvilekval, et al., 2010), CCDB (Martone, et al., 2002), and the Vaa3D-AtlasManager (Peng, et al., 2011) are among the existing systems that pioneered different aspects of big data management. Many of these systems provide web services for remote data management. Sometimes, visualization has also been built into the image-serving websites (Saalfeld, et al., 2009) to allow browsing through multi-dimensional images, their individual 2D sections, the maximum/minimum intensity projections, and/or movies of image data. For cutting-edge applications, large volume of bioimages are often in the scale of multiple terabytes (Bria, et al., 2015). For instance, whole brain imaging studies of mammalian brains often produce terabytes of raw image data for one single brain. To access such terabytes-sized datasets, it is necessary to produce effective data structures for storage and access. Normally, for 3D datasets, octree or similar data structures are used to organize the big data into many different hierarchical levels. In the coarse level, the image data are downsampled. Each of such coarse level voxels’ locations is associated with the higher resolution image voxels. Therefore, when a user browses through data at different resolution scales, the data can be read from the storage device directly instead of being calculated in real time. HDF5 format is often used as a convenient way to store such hierarchical data. Custom octree data structures are also considered in many ongoing projects. However, with hierarchical organization of the large image data, it is still hard to navigate through very large data in the multi-dimensional space. One key limitation as noticed in recent studies (Peng, et al., 2014) is that it often takes too long time for a user to manually identify the correct 3D regions of interest (ROI) to visualize across different resolution scales. One critical new technique, called 3D Virtual Finger, was proposed to generate a 3D ROI with one computer mouse operation (click, stroke, or zoom operations) when the user is operating the rendered images on an ordinary 2D computer display device (e.g. computer screen). Typically, computing such an ROI only takes a few milliseconds, thus the speed to navigate large images across different resolution scales is the fastest as it can be and is completely limited by the file IO speed of the storage device. The Virtual Finger function has been used to develop one of the fastest known 3D large data visualizers Vaa3D-TeraFly (Bria, et al., 2015) for visualizing terabytes of multi-dimensional bioimage data (Figure 1). In an often used server-client infrastructure, transferring big bioimage data from one location to another can be tackled by implementing a hierarchical organization of data on the server side and providing the Virtual Finger type of random access

Load more