An Overview of Data Science Uses in Bioimage Informatics
Total Page:16
File Type:pdf, Size:1020Kb
Methods 115 (2017) 110–118 Contents lists available at ScienceDirect Methods journal homepage: www.elsevier.com/locate/ymeth An Overview of data science uses in bioimage informatics Anatole Chessel LOB, Ecole Polytechnique, CNRS, INSERM, Université Paris-Saclay, 91128 Palaiseau cedex, France article info abstract Article history: This review aims at providing a practical overview of the use of statistical features and associated data Received 31 August 2016 science methods in bioimage informatics. To achieve a quantitative link between images and biological Received in revised form 9 December 2016 concepts, one typically replaces an object coming from an image (a segmented cell or intracellular object, Accepted 30 December 2016 a pattern of expression or localisation, even a whole image) by a vector of numbers. They range from Available online 3 January 2017 carefully crafted biologically relevant measurements to features learnt through deep neural networks. This replacement allows for the use of practical algorithms for visualisation, comparison and inference, Keywords: such as the ones from machine learning or multivariate statistics. While originating mainly, for biology, Data science in high content screening, those methods are integral to the use of data science for the quantitative Bioimage informatics High content screening analysis of microscopy images to gain biological insight, and they are sure to gather more interest as Image analysis the need to make sense of the increasing amount of acquired imaging data grows more pressing. Ó 2017 Elsevier Inc. All rights reserved. Contents 1. Introduction ......................................................................................................... 110 2. Numerical features computation......................................................................................... 111 2.1. Features computation strategy . ............................................................................ 112 2.2. Feature quality control and post-processing . ............................................................ 112 3. Storage, management and sharing of image-derived data. ................................................... 112 4. Features comparison . ......................................................................................... 113 5. Features analysis and interpretation. ...................................................................... 114 6. Implementations . ......................................................................................... 115 7. Conclusion . ......................................................................................................... 115 References . ......................................................................................................... 115 1. Introduction Numerous previous recent reviews, from the more specific to the more generic, have been documenting that rise and the meth- For the last couple of decades the development of increasingly ods developed and used. [1] is a fairly wide review of applications efficient fluorescence probes along with technological advances of classical computer vision techniques to biology, while [2] pro- in microscopy has led to terabytes of increasingly resolved images vide a more recent and in depth overview of bioimage informatics. being acquired across models and conditions. The discipline of [3,4] are older reviews focusing on High Content Screening (HCS); bioimage informatics (BII) is rising to develop the means to provide more recently, [5,6] are comprehensive reviews on HCS while [7] a quantitative analysis of those data, and integrate them into larger focuses on some example of phenotype analysis. [8] is an earlier biological questions and studies, inline with a more general trend review on the uses of machine learning on biological imaging of the life sciences toward more quantitative and integrative which focuses more specifically on supervised and unsupervised approaches. learning. [9] provide a more methodological review and compar- ison of segmentation methods. Here we will more specifically focus on the next steps after E-mail address: [email protected] image segmentation, i.e. the computation and use of statistical http://dx.doi.org/10.1016/j.ymeth.2016.12.014 1046-2023/Ó 2017 Elsevier Inc. All rights reserved. A. Chessel / Methods 115 (2017) 110–118 111 features, at the intersection with wider data science techniques. and technical but interested readers will be able to find more in Answering biological questions involve performing comparison depth tutorial and courses online; see also Section 6 for pointers. and inference on complex objects, be it cell shape for morphology, The plan will follow the chronology of projects using those vesicles distributions, actin or microtubule structures, etc...whose methods, beginning by how the features themselves are computed handling is not trivial: how to compute the average of a set of cell (Section 2), and handled for storage and visualisation (Section 3). shape? Or compare the intracellular organisation of segmented We will then focus specifically on comparison in Section 4 as gen- microtubules within cells? A classical path to an answer to those eric multivariate two-sample tests are an important but tricky questions is to associate n real numbers to each of them, the so- topic and look into wider statistical and machine learning tech- called features (i.e. equivalently send those objects to Rn), and niques for inference and interpretation in Section 5. We will finish use the wealth of algorithms and methods available to handle such by a few word about the practical implementations of all those numerical data. algorithms in Section 6. A lot of the examples will come from high Analysing images and image derived data in biology has a content screening (HCS) studies for historical reasons, but are now long history [10], and has grown in scope and visibility in the last increasingly of use across bioimage informatics. From Section 3 thirty years, concomitant with the increased importance of digital onward, we leave the realm of images per se and will be looking microscopy in the life sciences and the rise of quantitative biol- at rather generic biological data science questions of wider ogy. High throughput and high content microscopy, given the relevance. scale at which they do experiments, spanning 100s to 1000s of conditions or more, has always been key in moving those 2. Numerical features computation developments forward. They include full genome screens [11–13], systematic investigation of genetic interactions [14,15], As said, the aim of the computation of numerical features is to or more focused investigation of the influence of cell context send the objects of interests to Rn, i.e. associate n real number to [16] or of cell motility[17], the variability of cell shape [18],of each of them. Let us note first it’s only one solution (the most com- iPSC cell lines [19] or the investigation of small molecules effects monly used and usually the most practical), among others. For for drug developements [20–22]. example one could try and work directly in object space, by com- But similar techniques have been used outside of HCS, to anal- puting an innate distance between objects, and avoid numerical yse time lapse data [23,24], In-Situ Hybridation experiements features entirely. Examples include [33], where a representation [25,26], cell lineages [27], perform content based image retreival of drosophila embryo is built with which computation are done, [28] or build models of protein localisation [29]. Recent advance or [34] which compute kernels on graphs. Those approaches were in microscopy, leading to the acquisition of very large images particularly pursued for shape analysis, using for examples shape [30,27,31], where a single image can weight 100 Gb to several diffeomorphisms to compute geodesics in shape space [35–37] Tb, will also benefit from such large scale data analysis. but will not be presented further here as they tend to be more All those apparently very diverse applications have at their core complex and more closely tied to particular type of data or analysis the computation of numerical features on their objects of interests, frameworks. to be used as representation. The main purpose of feature representations is to quantitatively describe complex objects and concepts, essential for their further quantitative analysis. Some data science framework ‘Data science’ is a recent Conversely, one will only be able to access the aspects of an objects label that aggregates various fields and practices that aim at analysing actual data. Classical framework includes: that are described by the representation used, hence the choice of Statistics. From a statistical point of view, features are con- the right features, and the right way of visualising and comparing sidered random variables, i.e. variable whose values follow a them corresponds to hypothesis on the data at hand and their specific distribution, with the feature set being a multivariate variabilities. random variable, and each object under study to which a fea- The precise nature of the study will of course determine the ture set is associated (a individual cell, a track, a gene) can be objects for which features are computed, and how. Computing fea- seen as a particular realisation of that random variable. Sta- tures for whole images or systematic tiles of images has been done. tistical methods can be parametric, if they assume random More