Deep Learning based 3D Image Segmentation Methods and Applications
A dissertation presented to the faculty of the Russ College of Engineering and Technology of Ohio University
In partial fulfillment of the requirements for the degree Doctor of Philosophy
Yani Chen May 2019 © 2019 Yani Chen. All Rights Reserved. 2
This dissertation titled Deep Learning based 3D Image Segmentation Methods and Applications
by YANI CHEN
has been approved for the School of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by
Jundong Liu Associate Professor of Electrical Engineering and Computer Science
Dennis Irwin Dean, Russ College of Engineering and Technology 3 Abstract
CHEN, YANI, Ph.D., May 2019, Computer Science Deep Learning based 3D Image Segmentation Methods and Applications (115 pp.) Director of Dissertation: Jundong Liu Medical image segmentation is the procedure to delineate anatomical structures and other regions of interest in various image modalities. While crucial and often a prerequisite step for other analysis tasks, accurate automatic segmentation is difficult to obtain, especially for three dimensional (3D) data. Recently, deep learning techniques have revolutionized many domains of artificial intelligence (AI) including image search, speech recognition and 2D/3D natural image/video segmentation. When it comes to 3D image segmentation, however, the majority of deep learning solutions either treat 3D volumes as stacked 2D slices, overlooking the adjacent information between slices, or directly perform 3D convolutional operations with isotropic kernels that are inconsistent with the anisotropic dimensions in 3D medical data. Neural networks based on 3D convolutions tend to be computationally costly, as well as require much more training data to account for the increased number of parameters that need to be tuned. The scarcity of annotated data in medical imaging also adds up the difficulty. To remedy the aforementioned drawbacks of existing solutions, we propose two works for 3D volume segmentation in this dissertation. The first work is multi-view ensemble convolutional neural network (CNN) framework in which multiple decision maps gener- ated along different 2D views are integrated. The second work is a novel end-to-end deep learning architecture that combines CNN and Recurrent Neural Network (RNN) to bet- ter leverage the dimensional anisotropism in 3D medical data. Our model is designed with the aim to take advantage of CNNs remarkable power in capturing multi-scale 2D features, while rely on multi-view ensemble learning or inter-slice sequential learning to ensure certain level of output consistency through inter-slice contextual constraints. Ex- 4 periments conducted on hippocampus magnetic resonance imaging (MRI) data for both work demonstrate that the multi-view solution and the joint CNN-RNN model achieve sig- nificant improvements over single-view approaches, and outperform many state-of-the-art solutions in hippocampus segmentation. Our work also show better results when compar- ing with 3D CNN method. In addition, we further validate our proposed work on other neuroimage segmentation task, i.e., multiple-class segmentation for brain tumor (glioma) using pre-operative multi-modal MRI scans. Also, experimental results demonstrate that our proposed solutions can effectively improve the accuracy and consistency of the tumor segmentation, and show a very comparative performance when compared to the state-of- the-art solutions. 5 Dedication
To my lovely grandfather and parents 6 Acknowledgments
I would like to express my special appreciation to my advisor, Dr. Jundong Liu, for his guidance and support over my Ph.D. studies. I am very grateful for his advice and many insightful discussions on my research, especially during the time where it seems there is not an obvious solution. My great appreciation also goes to Dr. Charles D. Smith, our long time collaborator, for his valuable advice on application directions, study interpretations and system designs. I would also like to thank my dissertation committee members, Dr. David Juedes, Dr. Razvan Bunescu, Dr. Chang Liu, Dr. Li Xu and Dr. Sergiu Azicovici, for all your professional guidance either on my research or on my coursework; your suggestions and feedbacks have been absolutely invaluable to me. I would like to express my sincere appreciation for all the service and time you devoted. My gratitude is also with my wonderful lab mates, Huihui Xu, Bibo Shi, who gave me a lot of guidance and help when I just came to Ohio University to pursue my Ph.D degree. Discussions with them and other lab mates, Pin Zhang, Zhewei Wang, Nidel Abuhajar, provided me with a lot of inspirations on my research. I am very grateful to all of you. Last but not least, I want to thank my family, especially my parents, who give me endless believe and love for supporting me anything without asking for any return. Particularly appreciative to my lovely grandfather who keeps encouraging me that I can be a better self. 7 Table of Contents
Page
Abstract...... 3
Dedication...... 5
Acknowledgments...... 6
List of Tables...... 9
List of Figures...... 10
List of Acronyms...... 12
1 Introduction...... 13 1.1 Background – image segmentation...... 13 1.2 Area overview...... 14 1.3 Contributions...... 17 1.4 Dissertation overview...... 18
2 Preliminaries...... 20 2.1 Building blocks of CNN...... 21 2.1.1 Convolution...... 22 2.1.2 Activation function...... 25 2.1.3 Pooling...... 27 2.1.4 Upsampling...... 28 2.2 Training of CNN...... 31 2.2.1 Optimization...... 32 2.2.2 Data augmentation...... 34 2.2.3 Regularization...... 35 2.2.4 Transfer learning...... 37 2.3 RNN - recurrent neural network...... 38
3 Literature Review...... 40 3.1 CNN based Segmentation...... 41 3.1.1 Patch-wise CNN for segmentation...... 42 3.1.2 Fully convolutional network...... 43 3.1.3 SegNet...... 44 3.1.4 U-Net...... 45 3.2 RNN based Segmentation...... 47 3.2.1 LSTM...... 47 8
3.2.2 Convolutional LSTM and its application on segmentation...... 49 3.3 Segmentation based on combination of CNN and RNN...... 50 3.3.1 U-Net + Bi-directional CLSTM...... 50
4 Multi-view Ensemble ConvNet for Hippocampus Segmentation...... 52 4.1 Motivation: Hippocampus segmentation...... 52 4.2 Method...... 55 4.2.1 U-Seg-Net...... 56 4.2.2 Ensemble-Net...... 58 4.3 Data...... 60 4.4 Experimental settings...... 61 4.5 Evaluation measurements...... 61 4.6 Experiment results...... 62 4.6.1 U-Seg-Net on nine views...... 62 4.6.2 Multi-view Ensemble ConvNets...... 63 4.7 Discussion...... 66
5 Sequential FCN for Hippocampus Segmentation...... 67 5.1 Method: U-Seg-Net + CLSTM...... 67 5.1.1 3D U-Seg-Net...... 70 5.2 Data and experimental setting...... 71 5.3 Experiment results...... 72 5.3.1 CLSTMs...... 72 5.3.2 Joint Model of U-Seg-Net and CLSTMs...... 74 5.3.3 Comparison with other methods...... 77 5.4 Discussion...... 79
6 Application on Multi-class Brain Tumor Segmentation...... 81 6.1 Motivation: Glioma segmentation...... 81 6.1.1 Prior work...... 85 6.2 Proposed models...... 87 6.2.1 U-Seg-Net+CLSTM with Squeeze-and-Excitation Unit...... 88 6.3 Data...... 90 6.4 Experimental settings...... 91 6.5 Quantitative evaluations...... 92 6.6 Experiment results...... 93 6.6.1 SE-U-Seg-Net + CLSTM...... 96 6.7 Discussion...... 97
7 Conclusion and Discussions...... 99
References...... 102 9 List of Tables
Table Page
4.1 Three different configurations for our Ensemble-Net. The convolutional layer parameters are denoted as ”conv[kernel size]-[number of kernels]”...... 59 4.2 Mean and standard deviation of Dice ratio (%) and HD for the hippocampus segmentation results using nine different single views. (L: left hippocampus, R: right hippocampus)...... 63 4.3 Mean and standard deviation of Dice ratio (%) for the hippocampus segmenta- tion results using different combination methods...... 64 4.4 Comparison of the proposed method with other state-of-the-art existing methods on hippocampus segmentation...... 65
5.1 Early stopping epochs used for different models...... 72 5.2 Mean and standard deviation of Dice ratio (%) and HD for the hippocampus segmentation results using 2 layers CLSTM and the combination from different directions...... 74 5.3 Mean and standard deviations of Dice ratio for the hippocampus segmentation results by put the connection in different positions...... 76 5.4 Mean and standard deviations of Hausdorff distances for the hippocampus segmentation results by put the connection in different positions...... 77 5.5 Mean and standard deviation of Dice ratio (%) for the hippocampus segmenta- tion results using 3D-U-Seg-Net with different settings for filter size...... 78 5.6 Comparison of the proposed method with other state-of-the-art methods on hippocampus segmentation (in 3D Dice ratios (%))...... 79
6.1 Mean and standard deviation of subject Dice ratios (%) and Hausdorff distance (1mm) on three glioma sub-regions using U-Seg-Net and ensemble...... 94 6.2 Mean and standard deviation of subject Dice ratios (%) and Hausdorff distance (1mm) for the tumor segmentation results using U-Seg-Net+CLSTM and ensemble...... 94 6.3 Comparison with other methods...... 96 6.4 Mean and standard deviation of subject Dice ratios (%) and Hausdorff distance (1mm) for the tumor segmentation results using network with SE block on sagittal view...... 97 10 List of Figures
Figure Page
2.1 A convolutional neural network for character recognition [1]...... 20 2.2 The components of a typical convolutional neural network layer in ”complex layer terminology” [2]...... 21 2.3 2D convolution operations...... 23 2.4 Two-channel input feature maps to three-channel output feature maps [3].... 24 2.5 Visualization Conv1,3,5 neurons learned from ImageNet dataset [4]...... 25 2.6 Activation functions...... 26 2.7 Derivatives for (a) Sigmoid (b) Tanh (c) ReLU...... 27 2.8 Max pooling example...... 28 2.9 Unpooling...... 29 2.10 2D transposed convolution example. ∗ denote convolution, ∗−1 denote transposed convolution and ⊗ denote matrix multiplication...... 30 2.11 Mappings in convolution and transposed convolution...... 31 2.12 Images generated by cycle GAN [5]. Images in first and third columns are real images and the transferred output are in second and fourth columns...... 35 2.13 Subnetworks after dropout...... 36 2.14 Transfer learning...... 37 2.15 Structures of RNN: Left (a): rolled RNN, Right (b): unrolled RNN [2]..... 38
3.1 A patch-wise CNN segmentation model for the membrane segmentation of Electron Microscopy images [6]...... 42 3.2 The architecture of FCN [7]...... 43 3.3 The architecture of SegNet [8]...... 44 3.4 The architecture of U-Net [9]...... 46 3.5 Structure of RNN and LSTM: (a) Standard RNN with a single layer, (b) LSTM with multiple layers...... 47 3.6 Inner structure of CLSTM [10]...... 49 3.7 Models proposed in [11]: (a) Structure of kU-Net, (b) Overview of the framework (Deep BDC-LSTM) ...... 51
4.1 Comparison of the hippocampus region from coronal slices of the T1-weighted MRI of respectively NC, MCI and AD subjects [12]...... 54 4.2 Example showing the hippocampus and six other subcortical regions around hippocampus (caudate nucleus, putamen, globus pallidus, nucleus accumbens, thalamus and amygdala) in axial, sagittal, coronal and 3D views (hippocampus- cyan; caudate nucleus-yellow; putamen-magenta; globus pallidus-green; nucleus accumbens-blue; thalamus-red;amygdala-white) [13]...... 56 4.3 Model overview...... 57 4.4 Sagittal, coronal and axial views of hippocampi in an MRI scan...... 58 11
4.5 Surface rendering of segmentation results of U-Seg-Net on the sagittal, coronal, axial views, majority voting of 3 views, 9 views and ground truth in one case. (The Dice ratio for the obtained segmentations are shown under each image). Please refer to text for details...... 64
5.1 Joint models. The black arrows indicates the recurrent connections...... 69 5.2 The overall architecture of our proposed model: U-Seg-Net + CLSTM..... 70 5.3 The architecture of 3D U-Seg-Net...... 71 5.4 Procedure of CLSTM...... 73 5.5 Surface rendering of segmentation results of U-Seg-Net, U-Seg-Net + CLSTM and ground truth of one case from the axial view. (The Dice ratio for the obtained segmentations are 71.80% and 81.73%). Please refer to text for details. 75 5.6 Surface rendering of segmentation results of 3D U-Seg-Net, and ground truth of one case. The Dice ratio and HD for the obtained segmentations is 91.35% and 12.57). Please refer to text for details...... 78
6.1 Four tumor substructures, edema (yellow), non-enhancing solid core (red), necrotic/cystic core (green), enhancing core(blue) reflected by multi-modal MRI: (A) T2-FLAIR, (B) T2, (C) T1-Gd, (D) overall combination [14]..... 83 6.2 Four different MRI modalities showing a high grade glioma, each enhancing different subregions of the tumor. From left; T1, T1-Gd, T2, and FLAIR [15].. 84 6.3 The architecture of U-Seg-Net + CLSTM network for brain tumor segmentation. 87 6.4 Squeeze and excitation block [16]. Fsq and Fex are squeeze and excitation operations. Fscale is the feature map recalibration operation...... 89 6.5 The SE-U-Seg-Net + CLSTM model...... 90 6.6 Slices extracted from different views and different modalities. Slices extracted from sagittal, coronal and axial views are shown in first, second and third row accordingly...... 92 6.7 Slices extracted from single view segmentation results using U-Seg-Net and joint U-Seg-Net + CLSTM models. Sagittal, coronal and axial views are shown in first, second and third row accordingly. Images are cropped and resized for better visualization. Please refer to text for details...... 95 12 List of Acronyms
CNN Convolutional Neural Network
FCN Fully Convolutional Neural network
RNN Recurrent Neural Network
LSTM Long Short-Term Memory
CLSTM Convolutional Long Short-Term Memory
FC-LSTM Fully Connected Long Short-Term Memory
BD-LSTM Bi-directional Long Short-Term Memory
HD Hausdorff Distance
2D Two-Dimensional
3D Three-Dimensional
SGD Stochastic Gradient Descent
GPU Graphics Processing Unit
GAN Adversarial Neural Network
MRI Magnetic Resonance Imaging
CT Computed Tomography 13 1 Introduction
1.1 Background – image segmentation
Image segmentation is one of the most fundamental problems in image understanding, and has broad applications in various areas, such as object detection, medical imaging and machine vision. Generally speaking, image segmentation is the process of partitioning a digital image (or 3D volume) into multiple meaningful regions. Pixels (or voxels) in each segment should possess certain similar characteristics, e.g., intensity of color consistency to indicate that they belong to the same category or same object of interest. When it comes to specific application areas, however, image segmentation often has different interpretations or aims. For instance, in machine vision, segmentation is regarded as a transition step between low level and high level vision subsystems [17]; in remote sensing, it is typically adopted as an prerequisite step that is necessary for the following tasks such as landscape change detection or land cover classification [18]. Specifically in medical image analysis, accurate automated segmentation plays a vital role in computer-aided diagnosis and image guided therapy. Under this particular context, the purpose of segmentation is to delineate anatomical structures or indicate the boundary of organs or other regions of interest (ROI), which is often informed by prior knowledge or with expectation on the final segmentation result [19]. After the ROIs are segmented out, the geometric properties of the segments and other shape related information can be derived, and thus facilitate many clinical analysis and routines. Automatic segmentation of medical images is a very challenging task due to the high variability and complexity of medical images and the fact that the images are often corrupted by noise. Over the last two decades, remarkable progress has been made in the research community, with numerous solutions published for medical image segmentation. These methods can be roughly divided in four different groups, which are thresholding- 14 based, region-based, edge-based, and clustering-based [20]. We refer readers to some recent surveys [21–24] on these conventional methods. Some of the robust solutions have been integrated into commercial software packages. However, they are generally limited to segment clear-cut organs such as bones. Challenges still remain in designing automatic solutions to partition more complex organs. One of the examples is human brain — while it is difficult to achieve accurate delineation, cortical and subcortical parcellations are very important for detecting tumors, edema, and necrotic tissues. Accurate detection of these tissues is often the basis for reliable diagnoses. Recently, deep learning methods that utilize different types of deep artificial neural networks have been successfully and widely used in a lot of typical computer vision tasks, such as image recognition, image classification and image segmentation. The advance and success of deep learning models in semantic segmentation for natural images holds great promise for solving many difficult segmentation tasks in medical tasks . In this work, we will focus on designing new deep learning models to solve 3D medical image segmentation problems.
1.2 Area overview
Although the earliest medical image segmentation work can be traced back to thresholding technique, one of the oldest, simplest and most popular techniques that exists for several decades, here we are not trying to come up with a comprehensive survey about the progress of medical image segmentation methods from the beginning. Instead, we are more interested in the recent advances, especially on some very difficult medical image segmentation tasks, including atlas based methods, patch based methods, and deep learning based methods. Atlas based methods have attracted much attentions in the medical imaging community, especially for brain image segmentation. Early atlas-based segmentation 15 methods register the most similar atlas from a library of atlases to the target subject, and the segmentation or labeling for the target subject is obtained from the corresponding labels of the chosen atlas. However, this single atlas based segmentation methods tend to produce biased segmentation results. To overcome this problem, multi-atlas-based techniques [25, 26] have been proposed, which apply registration from multiple atlases (manually segmented by experts) to the target image, and then rely on label fusion or propagation to decide the voxel membership under the coordinate framework of the target. Segmentation results of atlas based solutions are highly dependent on the performance of the underlying registration algorithms. In addition, during the label fusion procedure, most methods do not consider the relevance between the sample and the atlases (the target sample is more similar with some atlases) and just assign the atlases with the same weights. Atlas based technique usually assumes there is an one-to-one correspondence between the target image and the atlases, which imposes significant difficulties on the underlying adopted registration and is very difficult to achieve in practice. Due to this issue, patch- based solutions have gained popularity for their simplicity and robust performance. Such kind of solutions [27, 28] focus on identifying local similarities between the target image and anatomical atlases at the patch (small subvolumes) level, in which the non-rigid or one-to-one registration step is not required. Voxel neighborhoods with similar intensity profiles are considered to belong to the same structure and therefore share the same labels. Patches can be extracted directly from predefined subvolumes across the training atlases. Considering image similarities over small image patches may not lead to optimal estimation [29], several recent works [30–33] impose sparsity constraints upon label weights to ensure that only a small number of highly relevant atlas patches will be selected to represent the underlying target patch. Discriminative dictionary learning [32, 33] and progressive dictionaries [34] are also utilized to guide the representation transition from intensities to labels. 16
More recently, deep learning solutions, in particular deep convolutional neural networks (CNNs) have been widely used to solve image recognition problems in various domains including computer vision and medical image computing [9, 35, 36], and achieved state-of-the-art performance. Most of these applications adopt 2D convolutional networks that take image patches as input, which ignores the global contextual information; only a few approaches use a post-processing method such as conditional random field (CRF) to further enforce spatial consistency. In [37], the authors used multi-scale schema on top of the 2D/3D intensity patch input to enforce the spatial consistency of the 3D whole brain MRI segmentation. In [6], each pixel is classified first by applying the trained neural network on a patch surrounding the pixel. Afterwards, the whole image is segmented in a sliding window fashion. In [7], a fully convolutional networks (FCN) was proposed to densely label input images through an end-to-end deep network architecture. The U- Net model [9] improves FCN to make it work with very few training images yet still produce more precise segmentation results on medical images. The decoding path in U- Net is equipped with a large number of feature channels via skip/bridge connections to corresponding encoding path, allowing the modified network to better propagate context information to higher resolution layers. A few 3D deep learning based segmentation methods [38–40] have been published recently to deal with 3D medical images directly. Typically, these methods can be grouped into two categories: (1) Apply 2D segmentation models on the 2D slices, and stack/combine the 2D segmentation results in certain ways. (2) Design new network architectures that directly take 3D image volume as the input [39, 40]. Both of the two groups have their own issues. The drawback of using 2D models to solve 3D segmentation tasks is that the spatial information between adjacent slices is not considered and utilized from the beginning, and thus the full 3D context information may not be fully recovered during the following post processing step of slice stacking or combining. Applying 3D 17
CNN model directly will nevertheless require much more parameters being involved, and accordingly significantly more training data to avoid overfitting problems. However, it is generally not feasible to acquire many annotated 3D medical images. Another issue of applying 3D convolutional neural network models on 3D medical data is caused by the adopted isotropic kernels. Because 3D medical images are typically scanned/produced slice by slice, it means the isotropic assumption might not be applicable for this type of 3D medical images [11].
1.3 Contributions
The aforementioned limitations in existing 3D image segmentation models constitute the main motivation for the work proposed in this dissertation. Specifically, we aim to explore the best strategies to solve the issue of 3D contextual information loss in 2D slice-based segmentation solutions, with the consideration of limited training samples and anisotropic properties for medical imaging. In light of 3D contextual information loss issue in existing 2D slice-based medical image segmentation models, we intend to enrich the current deep learning algorithms with alternative perspectives. On one hand, due to the fact of limited training samples for most 3D medical image segmentation tasks, we still adopt the 2D slice-based FCN models as basis, but explore the feasibility of combining 2D slice-based decisions from multiple perspective views, in order to compensate the lost 3D contextual information. Furthermore, instead of a direct stacking method, better combination strategies to fuse the 2D decision maps are explored within ensemble learning. On the other hand, it has to be admitted that not all 3D contextual information might be fully recovered using this post-processing strategy. The ultimate remedy for this issue would be to directly instill 3D contextual information into learning process from the beginning. Thus, we derive a generalized framework that combines CNN and recurrent neural network (RNN) to fully explore 18 the neighborhood information as a guidance for 3D segmentation. Application-wise, the proposed segmentation models are validated on two important clinical tasks, hippocampus segmentation and multi-class brain tumor segmentation. By using the proposed models, we obtain rather high segmentation accuracies for both tasks, which are comparable or even surpass some reported results by state-of-the-art solutions.
1.4 Dissertation overview
This dissertation is organized as follows. We first review the background work relevant to this dissertation: Chapter 2 describes the fundamental mathematics for deep learning problem, which are necessary to understand our proposed models. Chapter 3 is a large survey of current progresses of deep learning based image segmentation algorithms on both medical and natural images. Our proposed models are also inspired or relying on some of these works. We then present and illustrate our contributions on 3D image segmentation. In Chapter 4, we propose an automated 3D segmentation method relying on the utilization of a multi- view ensemble convolutional neural network to combine multiple decision maps. We further apply this multi-view ensemble network on the hippocampus segmentation task to demonstrate its effectiveness. In Chapter 5, we continue to explore better ways to integrate inter-slice contextual information into 2D segmentation models, and propose to incorporate sequential learning that has the ability to leverage inter-slice spatial dependencies into 2D segmentation network. Specifically, we adopt the convolutional long short-term memory (CLSTM) model to propagate the inter-slice information between FCN, and design an end- to-end joint learning model to further improve the spatial consistency for 2D slice-based segmentation. Also, we validate this model on hippocampus segmentation task, as well as multi-class brain tumor segmentation task (Chapter 6). 19
Finally, we conclude this dissertation in Chapter 7 with a summary of our contributions as well as discussions on potential future work. 20 2 Preliminaries
Figure 2.1: A convolutional neural network for character recognition [1].
Convolutional neural network (CNN) is one of the most widely used deep learning architectures, in which the connectivity pattern between its neurons is inspired by the organization of the animal’s visual cortex. CNN is a special kind of feedforward neural networks. Generally speaking, the modern framework of CNN can be traced back to the model LeNet-5, which was originally proposed by LeCun in 1989 [41] for the recognition of handwritten digits, and later improved in [1]. As shown in Fig 2.1, LeNet-5 is a multi- layer artificial neural network and can be trained with the backpropagation algorithm [42] similar as other neural networks. The success of LeNet-5 demonstrated a very significant real-world application of CNNs, and was the first work to highlight the practical need for a key modifications of conventional neural nets beyond plain backpropagation toward modern deep learning. Compared with plain conventional neural networks, LeNet-5 uses convolution operations to replace general matrix multiplications in certain layers, which is the primary distinction for CNNs. Using convolution operations enables CNNs to extract local features and combine them to form higher order of features in a more efficient way. Typical steps of designing and training a neural network can be summarized as: specify data set ( train/test, input/output), determine the objective function based on the tasks, and choose an optimization procedure to train the model. Take segmentation 21 problems for example, usually the input is images containing objects of interest to be segmented, and the output of the model is the predicted segmentation probability map or binary mask. The commonly used objective function is cross-entropy (binary segmentation) defined on the output prediction and the ground truth segmentation. To train the model, stochastic gradient descent (SGD) is commonly selected as the optimizer to update the model’s weights in alternations of back propagation and forward propagation. The following sections in this chapter are grouped in two parts: The first part briefly reviews the main building blocks of CNNs, namely convolution, deconvolution, activation functions and pooling. Then, how to build a CNN using these blocks will be introduced. The second part will focus on optimization related concepts or strategies for deep learning practitioners, including optimization methods, data augmentation, regularization, as well as transfer learning used to deal with small dataset and accelerate training process. In the end, we will also briefly introduce related definitions and concepts on recurrent neural network. Note that, with fast development of deep learning in recent years, new techniques are being developed are published rather rapidly. Thus, only the basic and most related concepts will be introduced in this dissertation.
2.1 Building blocks of CNN
Figure 2.2: The components of a typical convolutional neural network layer in ”complex layer terminology” [2]. 22
The convolutional layer is the core building block of a CNN. As shown in Fig. 2.2, convolutional layer typically consists of several sequential operations, namely convolution, nonlinearity activation and pooling. By stacking convolutional layers, CNNs are able to see successively larger portions of the image, and extract higher order, more complex features with the increasing depth.
2.1.1 Convolution
In mathematics, convolution is an operation on two functions of a real-valued argument to calculate the amount of overlap of one function g(t) as it slides over another function f (t). The convolution s(t) of f (t) and g(t)[43] is defined as the integral of the product of the two functions after one is reversed and shifted, as following: Z s(t) = ( f ∗ g)(t) = f (τ)g(t − τ)dτ (2.1)
However in CNN, the commutative property [44] of convolution is not preserved, and the convolution operation has a different meaning. In CNN terminology, we often refer the first argument f (t) to the convolution as input, the second argument g(t) as kernel and the output s(t) as feature map(s). Convolution in CNN is more of a cross-correlation operation [43], as shown in Eq.(2.2), a similarity measure of two series as a function of the displacement of one relative to the other, which is also known as a sliding dot product or sliding inner-product. Many machine learning libraries, including deep learning libraries, implement cross-correlation but name it as convolution. In the following, we will also use this kind of definition for convolution, specifically indicating operations without flipping of the kernel, which can be written as:
Xτ=∞ s(t) = ( f ∗ g)(t) = f (t + τ)g(τ) (2.2) τ=−∞ Fig. 2.3 shows a concrete example of 2D convolution. As shown, I is an input 2D matrix, K is a kernel matrix, and I ∗ K is their convolution result which is the output feature 23
0 1 1 1 0 1
0 1 0 1 0 0 ×0 ×1 ×0 2 4 3 2 0 1 0 1 0 1 1 1 1 4 2 5 4 ×1 ×1 ×1 ∗ 1 1 1 = 1 1 0 1 1 1 2 3 4 5 ×0 ×1 ×0 0 1 0 1 0 0 1 1 1 2 1 3 5 K 1 0 0 0 1 0 I ∗ K
I
Figure 2.3: 2D convolution operations.
map. In the convolution operation, the orange area is called local receptive field, the blue part is a 3 × 3 kernel, and the yellow block ”4” is the sum of the dot product of numbers inside the local receptive field and the kernel. By performing such dot product operation over the entire input matrix in a sliding window fashion, we would get the convolution result, a 5 × 5 feature map. The 2D convolution can be generalized to N-D convolution, which is defined on a multi-dimensional input array, and a multi-dimensional kernel array, the parameters of which are usually adapted by the learning algorithm for different applications. N is also called kernel’s spatial dimensionality. Convolved with the same input, kernels with different weights would lead to different feature maps, which, if in the perspective of image processing, means the operation has different effect on the input, such as blurring, sharpening or edge detection. In CNN, multiple kernels are used (or learned) in each level to extract different local features for that level, such as oriented edges, corners or blobs, as shown in Fig. 2.5. Fig. 2.4 shows a concrete example of how to convolve a two channels 2D input (2 × 5 × 5) to generate a new three channels 2D feature maps (3 × 3 × 3) using 6 dif- ferent 3 × 3 2D kernels (2 × 3 × 3 × 3) [3]. The operation assumes the kernel slides over the input with stride 1 (stride: distance between two consecutive positions of the sliding window). In practice, such setting can be tuned differently for different models or tasks. 24
The exact shape of the output feature map from a convolutional layer is decided by the shape of its input, the shape of kernel shape, the padding strategy, and the stride. For more information, we refer readers to [3] for more detailed explanations.
Figure 2.4: Two-channel input feature maps to three-channel output feature maps [3].
Compared with the dense connections used in conventional neural network, convolu- tions in CNN is a type of sparse connection, which has much fewer parameters. Such design not only reduces the model size, but also significantly improves the abstraction power of CNN. In addition, the biggest difference of convolution filters (kernels) in CNN from the hand-crafted ones in traditional image processing is that, these kernels are learned to be task/data-driven through optimizing the whole network on training data set with the guidance of the objective function. As shown in Fig. 2.5, it provides an example to visualize the learned filters in AlexNet model [35], which is trained on ImageNet dataset [45] for multi-class image classification task. It can be seen that, the learned filters of the first layer are able to detect some basic local patterns, such as edges or blobs, which are very similar to hand-crafted Gabor filters. In the middle layer, the network learns to extract more complex 25 features, such as texture patterns or object parts. In the end, for the classification layer, the network learns class-specific features, which are directly related to the final task.
Figure 2.5: Visualization Conv1,3,5 neurons learned from ImageNet dataset [4].
Other than the aforementioned basic convolution operation, several new convolution variants have been developed recently for different applications, such as dilated convolution [46], deformable convolution [47], and so on. We refer readers to [16, 48, 49] for more details.
2.1.2 Activation function
In deep learning models, the activation function basically decides whether a neuron should be activated or not, which serves as a gate controlling whether or not to pass the signal to the next layer similarly as in the biological neurons. This gate typically applies a nonlinear transformation on the input signal, then feeds it to the next layer of neurons as input. Due to the nonlinearity, the activation function is thus capable of solving more complex problems, other than the simple linear regression. Theoretically, by imposing the non-linear transformation, the overall network is able to learn and represent any arbitrary 26 complex functions. In addition, differentiability is another prerequisite for the activation function used in deep learning models, in order to optimize the models via gradient descent method (introduced in section 2.2.1). Next, several widely used nonlinear activation func- tions will be described, with their advantages and disadvantages.
Figure 2.6: Activation functions. 1
Sigmoid function is one of the most widely used activation functions, the curve of which is S-shape, as shown in Fig. 2.6. Since Sigmoid function maps the input to the output ranging between (0 to 1), it is often used in the output layer for models that predict a class probability. Sigmoid function is rarely used in the middle hidden layers due to the vanishing gradient problem that impedes the training. The vanishing/exploding gradient means: if the input (network parameters) of the activation function is too small or large, the gradient of the activation function would have very small values, thus making the gradients to be zero (”vanish”), and the parameters of the layers would never be updated; In contrary, if the input to the activation function is around zero, the gradient of the activation function would be a large number, which will lead to a very big update on the learned parameters of
1 Image source: https://medium.com/@shrutijadon10104776. 27 the model (exploding). The derivative of Sigmoid is illustrated in Fig. 2.7(a). We can also find Sigmoid is a differentiable and monotonic function. Tanh (hyperbolic tangent function) is also a S-shape activation function, and it slightly performs better than Sigmoid in dealing with ”vanishing” gradient, since the gradients still shrink but much slower. ReLU (rectified linear units) is the most widely used activation function currently, which is defined as: f (x) = max(0, x). This means the output of the ReLU f (x) will be zero when the input signal x is smaller than or equal to zero. Such set- ting will reduce the learning ability of the overall network from the data, since the output of the activation is always zero when x is negative, and these neurons become ”dead”. For this issue, some alternatives have been proposed to solve the irreversible ”dead neuron” problem in ReLu, like leaky ReLU, Maxout and ELU, which are shown in Fig. 2.6.
Figure 2.7: Derivatives for (a) Sigmoid (b) Tanh (c) ReLU.
2.1.3 Pooling
Pooling is an operation to summarize statistic of subareas of previous feature maps, and it reduces the size of feature maps and also makes the feature maps’ representation more robust to small shifts and distortions. Pooling works by sliding a window across the input and feeding the content of the window to calculate the local statistics, i.e., mean or maximum. Fig. 2.8 shows a max pooling example that transforms the 4 × 4 input to a 2 × 2 28 feature map.
Figure 2.8: Max pooling example.
2.1.4 Upsampling
For detection/classification CNN models like Lenet-5 and AlexNet, after several convolution+pooling operations, fully-connected layer(s) will be added to predict the input’s class probability. But in semantic segmentation problems, the purpose is to generate a pixel-wise segmentation probability map which has the same size as the input data, so the down-sampled feature maps need to be scaled to the original input size to facilitate this task. Accordingly, it is necessary to design an opposite process of convolution (or pooling) to expand the feature maps, namely upsampling the feature maps. In the following, we will introduce three different upsampling operations, including unpooling, interpolation, and deconvolution. Unpooling layers have the opposite effects as the pooling layers, and also exist together with corresponding pooling layers. In each max-pooling layer, the coordinates of max-value is stored, so in a corresponding unpooling layer, values from a previous layer are entered into the stored coordinates, setting zeros for the rest positions, as shown in Fig. 2.9. By doing this, localization information is restored. Interpolation is the most straightforward way of upsampling low scale/resolution feature maps to high scale/resolution feature maps. Commonly used interpolation methods 29
Figure 2.9: Unpooling.
include but not limit to piecewise constant (nearest neighbor), linear, polynomial and spline interpolation. Also, interpolation methods can be grouped based on dimensionality of the input data, such as bilinear & bicubic (2D), and trilinear interpolation (3D). Other than unpooling and interpolation, deconvolution is currently the most widely used upsampling method for semantic segmentation tasks. Different from interpolation or unpooling, parameter weights/kernels in deconvolution are learnable through network’s optimization procedure. The concept of deconvolution is widely used in signal processing or image processing. In mathematics, deconvolution is an operation that refers to reversing the effects of ”convolution” operation. For instance, assume h is a natural image, k is a smoothing kernel, g is the smoothed image by performing convolution operation on h using kernel k, i.e. g = h ∗ k. Deconvolution is the operation to recover the original image h, given g and k. However, the term of deconvolution in CNN specifically indicates a convolutional operation that upsamples the input to an output of higher spatial dimension. Thus, some may argue that deconvolution is an inappropriate name to describe such operation in CNN. And transposed convolution or backward convolution would be more appropriate, since this specific operation is achieved through transposed convolution. 30
4 5 8 7 2 9 6 1 1 4 1 1 4 1 1 8 8 8 122 148 2 1 6 29 30 7 ∗ 1 4 3 = ∗>1 1 4 3 = 3 6 6 4 126 134 4 4 10 29 33 13 3 3 1 3 3 1 6 5 7 8 output'2×2 Input'2×2 12 24 16 4 kernel'3×3 kernel'3×3 (b) Input'4×4 (a) output'4×4
4 1 0 0 0 2 5 4 1 0 0 9 8 1 4 0 0 6 7 0 1 0 0 1 1 1 0 1 0 6 8 4 1 4 1 29 1 4 1 0 1 4 3 0 3 3 1 0 0 0 0 0 122 2 2 9 6 1 3 4 1 4 30 8 reshape'to 122 148 1 0 1 4 1 0 1 4 3 0 3 3 1 0 0 0 0 ⊗ = 148 ⊗ = reshape'to 6 29 30 7 8 2×0 0 3 0 1 7 4×. 0 0 0 0 1 4 1 0 1 4 3 0 3 3 1 0 126 126 134 4 10 29 33 13 3 3 0 1 0 10 0 0 0 0 0 1 4 1 0 1 4 3 0 3 3 1 134 4 12 24 16 4 6 3 3 4 1 flatten'the'input' 29 output'(4×1) rearrange'kernel'to'a'4×16%matrix 6 1 3 3 4 matrix'(2×2)%%into'a' 33 column'vector'(4×1) 4 0 1 0 3 13 6 0 0 3 0 12 5 0 0 3 3 24 7 0 0 1 3 16 8 0 0 0 1 4 flatten'the'input'matrix'(4×4) (c) rearrange'kernel'to'a' output'(16×1) (d) %into'a'column'vector'(16×1) 16×4%matrix
Figure 2.10: 2D transposed convolution example. ∗ denote convolution, ∗−1 denote transposed convolution and ⊗ denote matrix multiplication. 2
Fig. 2.10 is an example that explains how transposed convolution works. In (a), it shows a 2D convolution example: a 4 × 4 input convolves a 3 × 3 kernel with the stride of 1 using no padding, which generates the output of a 2 × 2 matrix. This convolution can be reinterpreted in a way of matrix multiplication, as illustrated in (c). First the 3 × 3 kernel is rearranged to a 4 × 16 matrix (according to stride and padding), and the input is flattened to a 16 × 1 vector. Then, the matrix multiplication between them will produce a 4 × 1 column vector. If we reshape the vector to a 2×2 matrix, it is the same as the output from (a). Upon this, transposed convolution is defined as the ”reverse” of such operation, which means we want to generate a 4 × 4 matrix from a 2 × 2 input and a 3 × 3 kernel, as shown in (b). To achieve such operation, we transpose the 4 × 16 kernel matrix in (c) into 16 × 4 matrix, and multiply it with a flattened 4 × 1 input column vector. In this way, we could get a 16 × 1 vector, and it can be reshaped to a 4 × 4 matrix, which is the same size as in (b). The whole
2 Example source: https://towardsdatascience.com/@naokishibuya 31 procedure of transposed convolution is shown in (d). One thing to note is that the same kernel size and kernel values are used in both convolution and transposed convolution for demonstration in Fig. 2.10, but they are not necessary to be same in practice. For convolution operations, each pixel in the output feature map is the summation of a local receptive field multiplied with a kernel matrix. Thus it is a ”many to one” mapping. On the contrary, transposed convolution is a ”one to many” mapping. Fig. 2.11 shows the opposite effects of two mappings. Same as convolution, the kernel is learnable, and the shape of output feature maps from transposed convolution is also affected by the stride, padding size. Since transposed convolution is a ”reversed” convolution, such relationship can be derived directly from convolution. For more details, we refer reader to [3] for more details.
Figure 2.11: Mappings in convolution and transposed convolution.
2.2 Training of CNN
Although a concrete design of CNN architecture is one of the preconditions for the model to be successful in the target task, training CNN is arguably the most difficulty of all the problems in deep learning, and many strategies and tricks are proposed to ease the training of CNN. In the following sections, several commonly used optimization methods for CNN will be first introduced, followed by some training tricks and strategies, including 32 data augmentation, regularization, and transfer learning. Note that, these methods or tricks are not specific for CNN training, and can be applied for different types of deep learning models.
2.2.1 Optimization
The goal of a feedforward network is to approximate some function h, so that with certain kind of input the network could generate expected output. With the building blocks introduced in previous sections, CNN is eventually a nonlinear function which is decided by parameter θ. For different tasks, the goal of training CNN is to seek a set of optimal parameters for the task, which is through optimizing a well defined objective function J(θ). For example, LeNet-5 model maps an input image to one of the 26 characters. Information flows through the network from input x, then the intermediate operations, and finally to the
i (i) output hθ(x). hθ(x) is a vector that represent the probability (p(y = j|x )) of the predicted class label for each of the k different possible categories. The objective function for such classification task can be written as :
m k 1 X X J(θ) = − [ 1(y(i) = j) log p(yi = j|x(i)); θ] (2.3) m i=1 j=0 where 1{.} is the indicator function, such that 1{a true statement} = 1, and 1{a false statement} = 0. yi and j are predicted label and ground truth label accordingly. Since the CNN model is typically very complex consisting of multiple layers, it is not feasible to derive a closed form solution for this non-convex objective function. Thus, gradient descent is used to optimize nearly all of deep learning problems through propagating the gradient of the objective function to update the parameters of each layer from the output layer to the input layer. This is called back-propagation, an iterative optimization procedure. During each iteration, the updating can be written as: 33
θt+1 = θt + ∆θt (2.4)
∆θt = −λ · ∇θ J(θ; x; y) (2.5)
where, θt is the parameters in t-th iteration, λ is the learning rate that determines how much we are adjusting the weights of the network with respect the loss gradient. Large training data sets improve the recognition ability of deep neural network, but also increase the computational complexity. Different from optimization for traditional machine learning algorithms that process all the training examples simultaneously in a large batch, deep learning models including most CNN models use mini-batch optimization strategy, which computes each update to the parameters based on an expected value of the cost function estimated using only a subset of training data. This is called mini-batch stochastic gradient descent, or it is now common to call them simply stochastic gradient descent (SGD), as expressed by:
(i) (i) ∆θt = −λ · ∇θ J(θ; x ; y ) (2.6) where i stands for a mini-batch sampled from all training data. Gradient descent has often been regarded as slow. For deep learning models that typically have complicated structures and are trained on large data sets, how to improve the speed of training without hurting convergence for the SGD algorithm is very crucial and tricky. In general, vanilla SGD does not guarantee a good and fast convergence due to the following challenges:
• In SGD, choosing an appropriate learning rate is crucial and difficult. A small learning rate leads to slow convergence, while a large learning rate will hinder convergence and cause the loss function to fluctuate around the minimum. Thus, currently learning rate is typically treated as a tunable hyper-parameter. 34
• For high-dimensional parameters, SGD sets the same learning rate for each dimension without considering the fact that each dimension contributes to the overall cost in different ways.
• Another challenge is how to avoid getting trapped in in a local minimum or saddle points where the gradient is close to zero in all dimensions.
Several optimizers have been developed to deal with aforementioned challenges, such as learning rate annealing methods or setting different learning rate for each dimension of the parameters. Another potential direction is to use second order derivatives of the cost function to guide and speed up the gradient descent. However, the computation cost is prohibitive. Therefore, the majority of practical solutions are seeking a way to approximate the second order information. Several commonly used SGD variants are proposed, such as SGD with momentum, AdaGrad, Adadelta, RMSProp and Adam. For complete theory about these optimizers, we refer readers to [50] for more details.
2.2.2 Data augmentation
Optimizing convolutional neural network is a non-convex optimization problem, which is well known to be very difficult. In addition, deep CNN models typically consist of millions of parameters. Thereby, to ensure a good performance as well as generalization of the deep CNN models, a large training data set that is diverse enough to represent the overall data distribution is necessary. However, such conditions are usually difficult to achieve, especially in medical image computing. Thus, one of the most widely used methods to overcome this problem of limited quantity and limited diversity of data is to generate (manufacture) new training data by augmenting existing data. Typically for image data, there are two ways for data augmentation, one is to alternate original images with geometry transformation, such as: rotation, flip, scale, crop, translation and adding noise. 35
More recently, generative models, especially Generative Adversarial Networks (GANs) [51] are also utilized to generate synthetic images due to its extraordinary performance in learning latent representations from the original training data. There are some applications to prove that adding these synthesized images into training can help the training procedure [52]. The following images in Fig. 2.12 are two examples for comparison between original real images and generated synthetic images using cycle GAN [5].
Figure 2.12: Images generated by cycle GAN [5]. Images in first and third columns are real images and the transferred output are in second and fourth columns.
2.2.3 Regularization
A central problem in machine learning is how to make algorithms perform well on new unseen inputs, which is known as the generalization ability of the learned model. This is especially important for training CNN models because they are prone to overfitting. For deep learning models, several regularization strategies that are commonly used for traditional machine learning algorithms still hold true for training CNNs. For example, some methods try to impose restrictions on the parameter values by adding extra terms in the object function, such as sparsity constraints. Sometimes, these terms can also be 36 designed to encode specific kinds of prior knowledge or to express a generic preference for a simpler model. Next, we introduce two specific regularization techniques that are recently designed for deep learning models. Dropout: Dropout is a specific technique to combat the overfitting problem of neural networks. The term ”dropout” refers to randomly drop units (along with their connec- tions) from the neural network during training. This significantly alleviates the parameter co-adaptation problem. During training, different thinned networks are sampled from the original model by removing (zero out) a random fraction of nodes (and corresponding activations). During the testing, all activations are used to generate the prediction, but mul- tiplied by a factor to account for the missing activations during training. Overall, dropout can be viewed as a form of ensemble learning, in which several different classifiers with dif- ferent numbers of parameters are trained separately, and then the final prediction generated by averaging all response of these classifiers are used in test stage, as illustrated in Fig. 2.13.
Figure 2.13: Subnetworks after dropout.
Batch normalization The main idea of batch normalization is inspired by the typical normalization for input training data. It provides a remedy for internal covariate shift problem by applying a normalization for each intermediate layer’s activations of CNN. Specifically, during training, a Z-score normalization is conducted on the output of previous 37 layer by subtracting the batch mean and dividing by the batch standard deviation. In this way, it allows each layer of a network to learn by itself a little bit more independently of other layers, and also allows for much higher learning rates due to its inhibiting effect on abnormal activations. As it introduces some noise in each layer, batch normalization is also regarded as one kind regularization. If used together with dropout, extra attentions should be paid to avoid losing too much useful information in training.
2.2.4 Transfer learning
Figure 2.14: Transfer learning.
In general, transfer learning is a research problem in machine learning that focuses on how to apply the prior knowledge gained from solving one task on a different but related task. This is an extremely valuable technique for deep learning, since most of tasks at hand suffer from the issue of training data scarcity. In deep learning, the most commonly used transfer learning approach is to finetune a pretrained source model on new datasets for new tasks. The pretrained model could come from training on related tasks, or directly from released models that are trained on very large and generalized data sets. For CNN models, the validation of this type of transfer learning is lying on the fact that lower layers of CNN extract very generalized latent information which is not task specific and can be easily to transfer to other tasks. As shown in Fig. 2.14, an example of CNN transfer learning is 38 illustrated: we use the parameters of a pretrained model as the initialization for a different task, and remove the last task-specific classifier layer, then finetune the overall model using data from the new task.
2.3 RNN - recurrent neural network
Figure 2.15: Structures of RNN: Left (a): rolled RNN, Right (b): unrolled RNN [2]
Recurrent neural networks (RNNs) [2] are a family of feedforward neural networks
that are specialized for processing sequential data x1, x2, ..., xn, which are related in time or space and whose order cannot be changed. Different from CNNs that treat data points independently and lose the network states after processing each data, RNNs model data points with temporal or sequential structure, and is able to maintain “memory information” about previous inputs that affects the following network outputs. RNNs have been successfully applied to solve a lot of Natural Language Processing problems [53–55]. Fig 2.15(a) shows a typical directed acyclic RNN model. It takes an input sequence x and produces an output sequence o, and has recurrent connections within hidden units h. Fig 2.15(b) is the corresponding unfolded graph for the same RNN model in (a), which shows more details about how the sequential information passes the model. Generally, the majority of RNNs use the following or similar equation to define their hidden units: h(t) = tanh(b + Wh(t−1) + Ux(t)). 39
U, W are the weight matrix parameters connecting input units with hidden units, and previous hidden units with current hidden units, respectively. tanh is the activation function. The equation is regarded as being recurrent, because the current hidden state h at the current time point t relies on the same definition at the previous time point t − 1 , and the same operation would be applied at each time point of a long sequential data. The final output layer o reads information out of the hidden state h to make predictions,
o(t) = c + Vh(t)
V is the weight matrix parameters connecting hidden units with output units. An apparent difference between traditional deep neural network and RNN is, the same parameters (U, V, W) in RNNs are shared across all steps. This means RNNs are performing the same task at each time step, and the model can take different lengths of input, and also requires fewer parameters to be computed. The time step index t shown in the equations above does not need to be interpreted as the flow of time in real world. It refers only to the position of a sequential data that follows certain kind of order. In addition, RNNs may also be applied in higher dimensions, including spatial domain such as images. 40 3 Literature Review
Medical image segmentation is a key enabling technology for various medical applications such as diagnostics, planning and guidance. Automatic and efficient implementations have long been a research focus in medical image computing community probably since several decades ago. Prior to the emergence of deep learning methods in recent years, medical image segmentation has also witnessed a long and rich history of developments, which could be roughly grouped into three stages, according to [56]. The first stage mainly consists of low level techniques that are still relying on heuristics, such as threshold, region growing, and edge tracing. Later in the second stage, more focuses were shifted to uncertainty models, better optimization methods, and avoidance of heuristics. Many classical and popular models were invented in this stage, including fuzzy C-means clustering [57], active contour [58], graph cut, and some early learning based solutions [59, 60]. In the more recent third stage, more efforts were spent on how to efficiently incorporate high-level knowledge (i.e., a prior shape of the desired object, expert-defined rules) into segmentation models. As introduced previously, atlas-based or patch-based segmentation models start to become the mainstream for different medical segmentation tasks. In this section, due to the broad spectrum of topics and long history of medical image segmentation, we are not aiming to conduct an exhaustive survey in the currently literature, but rather devote the remaining section to more recently proposed deep learning based segmentation solutions, which could provide a sufficient basis for the future discussion of our own models. Specifically, we review some recent CNN or RNN based segmentation models, respectively, following by a related joint model. For a more comprehensive review of classifical medical image segmentation, we refer readers to some survey papers [56, 61, 62]. 41
3.1 CNN based Segmentation
In recent years, the availability of large image datasets, high-performance computing systems, such as GPUs or large-scale distributed clusters made training very deep CNNs possible. Since 2006, many methods [63–66] have been developed to overcome the difficulties encountered in training deep CNNs. In 2012, the AlexNet [35] model proposed by Krizhevsky et al. won ILSVRC-2012 computer vision competition [67] by a significant amount of margin than the other traditional non-CNN methods. Although the AlexNet model is the combination of very old CNN concepts (pooling and convolutional layers, variations on the input data) with several new key insights (very efficient GPU implementation, ReLU neurons, dropout), it very well demonstrated the power of a well- trained deep CNNs. With the success of AlexNet, several other CNN based works including VGGNet [68], GoogleNet [48] and ResNet [66], keep breaking the ILSVRC IMAGENET challenge record, and a lot of other works [36, 69, 70] have also demonstrated the success of deep CNNs in other recognition or classification applications. Therefore, it is natural for researchers to think about how to progress the coarse inference lying in recognition tasks to fine pixel-wise prediction in segmentation tasks. Two approaches are commonly used to apply or extend CNN based recognition methods on segmentation problems [71]. The first one is to train a CNN model on small patches extracted from the training images using the center pixels’ class as the label. And the final segmentation results are obtained by applying the CNN model across all the pixels of the test image. The second is an end-to-end solution, which relies on training a fully convolutional network (FCN) on the whole images or image proposals (part of the image contain single object), and outputs the segmentation result directly from the network. Next, we will introduce several representative works of CNN based segmentation models from those two categories. 42
Figure 3.1: A patch-wise CNN segmentation model for the membrane segmentation of Electron Microscopy images [6].
3.1.1 Patch-wise CNN for segmentation
In [6], a deep convolutional neural network is adopted as a pixel classifier for the membrane segmentation of Electron Microscopy (EM) images. The deep CNN is trained with image patches extracted from the original whole image, and the label for each image patch is decided by its center pixel. Specifically, if the center pixel of an image patch belongs to membrane, this image patch would be assigned with the class of membrane. In this way, the image segmentation task becomes a dense patch-wise classification for each extracted image patch. Given a new test image, the final segmentation result can be obtained by performing the trained patch-wise classifier on each pixel of the test image. Fig 3.1 displays the whole process of segmenting an EM image and the architecture of the patch-wise CNN segmentation model they used. In their work, each pixel was treated independently, and they didn’t consider the relationship between adjacent pixels. Their experimental results showed this method achieved a state-of-the-art performance, and it also won the EM segmentation challenge at ISBI 2012 by a large margin. One apparent drawback of this patch-wise CNN segmentation method is the extremely high time cost during the prediction. Another drawback is that the model would ignore 43 global information due to the constraint of patch size. There is always a tradeoff between the patch size and the degree about the global information they can involve. For larger patch size used, larger degree of global information can be incorporated in but may potentially lose fine local details. Thus, for this point, multiple scale based solutions [72] are proposed to capture different levels of information to further improve the segmentation accuracy.
3.1.2 Fully convolutional network
Figure 3.2: The architecture of FCN [7].
Instead of applying existing successful CNN models as patch-wise classifiers for image segmentation task, researchers were also trying to design more efficient end- to-end CNN-based segmentation architectures. The problem about using the CNN for segmentation tasks is that the spatial information would be significantly or totally abandoned during the stacked convolutional and pooling operations, as well as the final fully connected layer. Although the spatial information might not be very useful for object recognition, it is extremely important for accurate segmentation. To deal with this problem, Long et al. [7] proposed a fully convolutional network (FCN) model which replaces the 44 fully connected layers of CNNs with convolutional layers. In other words, they convert the classification nets into convolutional nets that produce coarse pixel-wise class probability maps. The coarse probability maps are then further upsampled to the same spatial size as the input image by using deconvolution operation. However, the segmentation resulted from the upsampling of only the final conventional layer is pretty coarse and lost lots of details. They addressed this by a skip net architecture which integrates the outputs of different convolutional layers. This combination of lower fine layers and upper coarse layers forces the model respect global structure during making local prediction, which significantly improves the segmentation accuracy. The FCN model is the first work to train a CNN based end-to-end model for pixel-wise image segmentation, and it can also utilize existing supervised pre-trained CNN models. However, the model contains a large size of encoding network and a relatively smaller decoding network. So, it is good at extracting the overall shape of an object but difficult to reconstruct highly nonlinear structures of object boundaries accurately.
3.1.3 SegNet
Figure 3.3: The architecture of SegNet [8].
To ameliorate the aforementioned boundary issues of FCN, some researchers [73, 74] tried to integrate FCN with conditional random fields, and showed some progress. 45
However, this imposes extra computation burden and difficulty on the overall process. In [8], a more elegant and efficient SegNet model was proposed to improve FCN, which adopts an end-to-end segmentation architecture as shown in Fig 3.3. Overall, SegNet is composed of two parts: an encoder network and a decoder network. For the encoder network, the architecture is exactly the same as VGGNet model except removing the final three fully connected layers. By doing so, the encoder network only contains around 10% parameters of the original VGGNet model, which would in certain degree ease the process of training. The decoder network is symmetric to the encoder network: for each layer in the encoder network, there is a corresponding decoder layer in the decoder network. To recover the spatial information lost in the encoder network and reconstruct the original size of activations, upsampling layers are employed in the decoder network which perform the reverse operation of pooling. Similarly, corresponding to convolutional layers used in encoder network, they were also utilized in decoder network to densify the enlarged, yet sparse activations obtained by upsampling layers. By stacking upsampling, convolutional and rectification layers together, a “reverse” network for encoding process, the so called decoder network is constructed with connection to the encoder network, which in total form the overall SegNet. The encoder and decoder networks are more balanced in SegNet than in FCN, and experimental results also showed substantially better performance than FCN.
3.1.4 U-Net
Based on the work of FCN, [9] proposed U-Net model, which is another end-to-end segmentation model that works very well with very few training images and yields more precise segmentation results especially on biomedical images. In Fig 3.4, we can see the architecture is very similar with SegNet. U-Net has a contracting and a symmetric expanding path. The contracting path is used to capture latent high-order features and the 46
Figure 3.4: The architecture of U-Net [9].
expanding path is used to increase the low resolution feature maps from the contracting path. Instead of directly remembering the location of the maximum activation in the max- pooling layers to recover the lost spatial information as in SegNet, high resolution feature maps generated from contracting path were directly concatenated to the corresponding feature maps in the expanding path, which forms skip connections between two paths at different levels. This yields a U-shape architecture as shown in Fig. 3.4. Different from FCN, large number of feature channels are applied in the expanding path to allow the network to propagate context information to higher resolution layers. Experimental results show that U-Net can be trained end-to-end from very few images by using random elastic deformations to achieve data augmentation , which makes it very suitable for medical image segmentation tasks. In fact, U-Net has demonstrated this by winning Cell 47
Tracking Challenge at ISBI 2015 on the two most challenging transmitted light microscopy categories.
3.2 RNN based Segmentation
3.2.1 LSTM
(a) (b)
Figure 3.5: Structure of RNN and LSTM: (a) Standard RNN with a single layer, (b) LSTM with multiple layers. 3
Thanks to the “parameter sharing” property and recurrent operations, RNNs are able to deal with a very long sequential data by using a very “deep” computational graphs. However, due to vanishing/exploding gradient problems, using traditional backpropagation method, it is very difficult to train RNNs to learn dependencies between steps that have long interval. Some methods have been proposed to deal with this problem [75, 76], and Long Short Term-Memory networks (LSTM) [77] are one of the most widely used models among them. Gating mechanism was introduced in LSTMs to reduce the effect of vanishing gradients in RNNs. In LSTMs, a new type of memory cell is used to replace the traditional hidden unit. Fig 3.5 shows the difference between a typical RNN hidden unit and a LSTM memory cell. The memory cell is the essential idea behind LSTMs, and it has the ability to
3 Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 48 reserve different amounts of information within the cell depending on different situations. Thus, LSTMs are able to learn when to remember information for long periods of time and when to just focus on recent accumulated information. Specifically, the memory cell is a composite unit containing several elements. In addition to the input xt and output ot, the memory cell also has 3 regulation gates( ft, it, ot) that control the process of removing or adding information to its memory state ct. The memory state ct is a combination of information from the previous memory and the new input, both are banned through their corresponding gates: forget gate ft and input gate it. The following equations describe this procedure:
it = σ(Wi xt + Uict−1 + bi)
ft = σ(W f xt + U f ct−1 + b f )
ot = σ(Wo xt + Uoct−1 + bo)
ct = ft ◦ ct−1 + it ◦ tanh(Wc xt + Ucct + bc)
ht = ot ◦ tanh(ct)
ft, it, ot represent the forget gate, input gate, and output gate, respectively. They are called gates because the sigmoid function is used within them to squash the values of the input vectors between 0 and 1. By multiplying the output vector from gates with another vector element-wisely, different amounts of information of that vector can be kept after passing through the memory cell. With this mechanism, when LSTMs are trained using backpropagation and the total error value on a set of training sequences are back- propagated from the output, the error becomes trapped in the memory cell. LSTMs are the most commonly used RNN variant, and have been successfully applied on many different tasks [78, 79]. 49
, Ht+1 Ct+1
, Ht Ct Xt+1
t 1, t 1 t H − C − X
Figure 3.6: Inner structure of CLSTM [10].
3.2.2 Convolutional LSTM and its application on segmentation
In spite of the success of LSTMs, the standard LSTMs are explicitly designed to deal with one dimensional sequential data. This means if we want to apply LSTM on multi- dimensional data, the data has to be preprocessed and presented as one-dimensional. Due to the fully connections in input-to-hidden and hidden-to-hidden transitions, conventional LSTMs (FC-LSTM) have some limitation in handling spatio-temporal data and ignores the multi-dimensional spatial information. To solve this problem, several works have been proposed to modify the LSTMs so as to include spatial information [10, 79, 80]. Convolutional LSTM (CLSTM) [10] extends FC-LSTM through the replacement of full connected (matrix multiplications) with convolutional operations on both the input- to-hidden and hidden-to-hidden transitions. The basic idea of CLSTM is to modify the multiplication operations in LSTM into convolutions such that a much more convenient implementation is resulted. The inner structure of CLSTM is shown in Fig. 3.6. All the inputs, cell outputs, hidden states, and gates of the CLSTM are 3D matrices, whose last two dimensions refer to spatial dimensions instead of 1-D in FC-LSTM. The CLSTM is a generalization of FC-LSTM: it has all the benefits of FC-CLSTM, but can also keep the multi-dimensional spatial information of spatio-temporal input due to the convolutional operation. There have been many successful applications of CLSTM. For example, [80] applied CLSTM to 3D medical image segmentation problem by using six CLSTMs. Each 50
CLSTM processes the entire volume in one unique direction. The outputs from the six CLSTMs are combined to create the full context of each pixel. Their model achieved best pixel-wise brain image segmentation results on MRBrainS13 (ISBI 2015 workshop on Neonatal and Adult MR Brain Image Segmentation).
3.3 Segmentation based on combination of CNN and RNN
Previous sections described several image segmentation models based on either only CNNs or only RNNs. CNNs have the advantages of learning salient image features, and there exists a lot of pre-trained CNNs models that are ready for transfer learning [35, 48, 68]. However, when using CNNs for 3D segmentation tasks, the widely-adopted way is to treat 3D image volume as a stack of 2D slices, apply 2D CNN segmentation model on them, and then combine the 2D segmentation results. By doing so, it is inevitable to lose the continuous spatial information along the third dimension, which is not acceptable for 3D medical images. If directly applying 3D CNNs, much more parameters will be needed to describe the 3D convolutional kernels, and more samples are also required for training. Nevertheless, it is generally not possible to have a very large (1000+) 3D image data set in medical area. On the other hand, RNNs or LSTMs have the ability to reserve the temporal or temporal-spatial information. But, RNNs are typically not as powerful as CNNs to extract salient image features, since although they have a very deep network along the temporal direction, each hidden unit is indeed a shallow structure along the spatial dimension. Therefore, some researchers are trying to combine CNNs and RNNs [11, 81, 82], and [11] is one successful application on 3D medical image segmentation.
3.3.1 U-Net + Bi-directional CLSTM
Considering the presence of dimension anisotropism in 3D medical images, [11] proposed a 3D medical image segmentation framework that combines FCN and RNN. The FCN is used to extract 2D slice features, while RNN is used to integrate context information 51
(a) (b)
Figure 3.7: Models proposed in [11]: (a) Structure of kU-Net, (b) Overview of the framework (Deep BDC-LSTM) .
between those 2D slices along the z-axis. Specifically, for FCN part, they proposed a new architecture, kU-Net, which consists of 2 U-Nets in different scales. Fig. 3.7(a) shows the structure of kU-Net. A 2D image slice and its downsampled version are fed into the two U-Nets respectively. In this way, both the low level, detailed information and high level, abstract information are propagated in parallel along the kU-Net, and are eventually combined together by concatenating the outputs of them. For RNN part, they used a stacked bi-direction convolutional LSTM network (BDC-LSTM) to integrate the context information between the 2D outputs generated from kU-Net. BDC-LSTM is an extension of CLSTM, which simply stacks two CLSTM together, but feeds in the same input sequence in opposite direction twice. The contextual information from the two CLSTM are concatenated as the final output, which assures information from both direction along the z-axis are incorporated at the same time. In the end, multiple BDC-LSTM were stacked together to form a deep structure. The overall structure of this model is shown in Fig 3.7(b). Evaluated in two different 3D biomedical image segmentation applications, they showed this new model can achieve the state-of-the-art performance and outperformed known 2D schemes. 52 4 Multi-view Ensemble ConvNet for Hippocampus
Segmentation
In this chapter, we introduce the proposed multi-view ensemble convolutional neural network (ConvNet). Specifically, we are exploring the feasibility of combining 2D slice- based decisions from multiple 2D planar views, and construct a CNN based multi-view ensemble net, in order to solve the issue of limited training samples for 3D medical image segmentation, as well as the loss of 3D spatial context information using only 2D slice- based CNN models. We start with the practical motivation in the context of a clinical task, i.e., hippocampus segmentation in 3D MR images, and then describe in detail the whole framework of the proposed model, including U-Seg-Net and Ensemble-Net. Experimental settings and results for the application of hippocampus segmentation task are presented to demonstrate the effectiveness of our model, with comparison to some state-of-the-art solutions in the same area. The work presented in this chapter has been published in [83].
4.1 Motivation: Hippocampus segmentation
Brain is arguably the most important organ for human beings. It performs like a commander for the human nervous system: collecting signals from human body’s sensory organs and then sending instruction to guide the movement of muscles. In worldwide, brain disorders are the major cause for morbidity, disability, and premature mortality, and it includes any impairment that may affect the brain function either caused by illness, generics or traumatic injuries [84]. For instance, there are over one-quarter of adult Americans who have been diagnosed with mental illness, such as Alzheimer’s Disease (AD), Post-Traumatic Stress Disorder (PTSD), and Major Depressive Disorder (MDD) [84]. Therefore, how to prevent and further improve treatment for brain disorders has long been a worldwide issue and problem. 53
Among different brain cortical and subcortical structures, the hippocampus, located under the cerebral cortex and in the medial temporal lobe, is an important component of limbic system. It plays essential roles in the consolidation of information from short-term memory to long-term memory, and in spatial memory that enables navigation. Neural degeneration and dysfunction of hippocampus has been studied and associated with many brain disorders, including temporal lobe epilepsy, AD, mild cognitive impairment (MCI), schizophrenia, MDD, bipolar disorder, and many others [85]. Accessing the morphometric characteristics and the structural integrity of the hippocampus (HC) is important to diagnosis and monitor these brain disorders. For Alzheimer’s Disease, the atrophy of hippocampus, reflected by structural T1-weighted MRI scans, has been demonstrated as a biomarker that is able to predicting the progression of MCI to AD [86, 87]. In Fig. 4.1, it shows a bounding box around the hippocampus region from coronal slices of the T1-weighted MRI of respectively NC, MCI and AD subjects, in which the atrophy of hippocampus can be clearly observed. In addition, monitoring the volumetric change of the hippocampus for AD patient is also useful for accessing the treatment outcomes for potential drugs. In epilepsy, asymmetry in hippocampal volumes (atrophy of the hippocampus in one hemisphere) has also been utilized as a predictor, and its atrophy are also used to measure the progression of the disease [88, 89]. In many other studies, although not yet been validated a clinical biomarker, the atrophy of hippocampus has also been shown highly relative with some diseases, including: schizophrenia [90], PTSD [91], and bipolar disorder [92]. Given the aforementioned evidence and widespread agreement of the usefulness of HC volumetry, accurate segmentation of hippocampus with appropriate reproducibility is the necessity of its establishment as a biomarker [93]. Thanks to the invention and continuous improvement of Magnetic Resonance Imaging (MRI) techniques, it is now a commonly accepted, non-invasive imaging solution to quantify the volume and access the shape of 54
Figure 4.1: Comparison of the hippocampus region from coronal slices of the T1-weighted MRI of respectively NC, MCI and AD subjects [12].
the hippocampus [94]. Although the development of MRI techniques has led to better quality for brain imaging with sufficient resolution and contrast enabling quantification of HC’s symmetry and atrophy, manual or semi-automated segmentation is still considered as the gold standard for HC assessment and adopted as the clinical routine. However, the manual or semi-automated solutions are very time-consuming and laborious, suffer from both inter- and intra-rater variability, which hinders the effective, large-scale morphological study of hippocampus [95]. To overcome these limitations, fully automatic approaches have been developed for hippocampal segmentation, which can be roughly divided into atlas-based, patch-based, and learning-based methods [96]. We have briefly discussed the pros and cons of these methods in sections 1.2 and chapter 3. Although these methods have demonstrated some promising results, segmentation errors and mislabeled voxels still remain unavoidable. 55
In general, automatic segmentation of the hippocampus in MRI is a challenging task. On one hand, the gray levels of the hippocampus in MRI are very similar as other neighboring structures, including Amygdala, the Caudate Nucleus and the Thalamus, as shown in Fig. 4.2. There is not a very well-defined, clear border to separate hippocampus from its neighboring structures. On the other hand, some MRI intrinsic effects or artifacts, such as partial-volume effect or non-homogeneous intensity, bring another level of difficulty for accurate automatic hippocampus segmentation [96]. To overcome these limitations and further improve existing automatic solutions for hippocampus segmentation, it is of great and specific values to utilize or explore extra information that is also meaningful for clinical operators (i.e., radiologist) when they perform manual segmentation, such as the prior knowledge of the shape and position of the hippocampus, or even the way of their MRI reading/assessment, i.e., reading from three anatomical planes (sagittal, coronal, and axial) in a slice-sliding fashion. Inspired by this, as well as the limitations of existing 2D slice-based image segmentation models as discussed in section 1.3, we propose an automated 3D hippocampus segmentation method in MRI, relying on the utilization of a multi-view ensemble convolutional neural network, and demonstrate its effectiveness by systematic experiments on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data set.
4.2 Method
As shown in Fig. 4.3, our model consists of two major components. First, a series of U-Seg-Net based segmentation networks are employed to conduct 2D segmentation on the slices extracted from nine different views, which produce nine preliminary 3D probability maps for each hippocampus. Second, the Ensemble-Net is trained to fuse these probability maps to generate the final segmentation with improved accuracy. 56
Figure 4.2: Example showing the hippocampus and six other subcortical regions around hippocampus (caudate nucleus, putamen, globus pallidus, nucleus accumbens, thalamus and amygdala) in axial, sagittal, coronal and 3D views (hippocampus-cyan; caudate nucleus-yellow; putamen-magenta; globus pallidus-green; nucleus accumbens- blue; thalamus-red;amygdala-white) [13].
4.2.1 U-Seg-Net
Our work is built on a 2D segmentation network which has similar architecture to U-Net, as shown in Fig. 4.3 (a). As a modification and extension of FCN, U-Net utilizes a symmetric encoding-decoding architecture. The encoding path utilizes several 57
Sagittal Coronal Axial Diag1 Diag2 Diag3 Diag4 Diag5 Diag6
1 32 128 1 32 32+128 128
64 64 64+128 128
Conv 3×3 Relu Max pool 2×2 128 128 Copy De-conv 2×2 Ensemble-Net
(a) U-Seg-Net (b) Model overview
Figure 4.3: Model overview.
convolutional + pooling operations to extract latent representations, while deconvolutional + convolutional layers are adopted to gradually upsample the latent feature maps and learn a semantic segmentation of the input. With the purpose of pixel-wise semantic segmentation, ”bridge connections” is utilized to compensate location information loss caused by max- pooling via concatenating feature maps at different levels in the decoding path with the feature maps in the corresponding encoding path. In the proposed U-Seg-Net, we kept the main ”encoding + decoding + bridge” architecture as U-Net, and made some necessary modifications to accommodate the data and task of our own. In U-Net, each step has two convolution layers before max-pooling or deconvolution, while our U-Seg-Net preserves only one of them and hence reduces total number of parameters. Additionally, in U-Seg-Net padding is used in convolution layers to retain the spatial dimensionality of feature maps, thus it avoids the need of resizing input images which is implemented in U-Net. We also use different number of kernels in convolutional layers in encoding and decoding path, so essentially U-Seg-Net is not a strictly symmetric architecture. The kernels used in the U-Seg-Net all have the same size of 3×3. In the last convolutional layer of U-Seg-Net, the number of the output feature maps 58 is reduced to one, since our segmentation task at hand is binary (foreground hippocampus vs background). And cross-entropy is used as the objective function. Furthermore, dropout and batch normalization is also used to avoid overfitting. In our work, the 3D volume of a hippocampus is viewed as n planar slices of 2D images aligned along the third axis (z-axis), which is fed into a U-Seg-Net as n independent samples. The final segmentation of a 3D hippocampus can be generated by stacking the 2D segmentation results inferred for individual slices.
4.2.2 Ensemble-Net
Figure 4.4: Sagittal, coronal and axial views of hippocampi in an MRI scan.
Several justified concerns for our slice-based 3D segmentation (U-Seg-Net) exist. First, the segmentation results only depend on 2D images without considering their neighbors, and 3D contextual information is neglected in the network. Due to the shape of hippocampus, extracted 2D slices have very different foreground-background ratios at different positions and thus may easily lead to bumpy and inconsistent results. Second, sampling planar slices from different directions/views (sagittal, coronal and axial views, shown in Fig. 4.4) would have different effects on the final segmentation as different types of shape information is contained along different views. For example, slices containing 59 anterior tips of hippocampi sampling along axial view occupy few foreground pixels and would result in bad segmentation results. From sagittal view, however, they are a part of clearly visible structures in several slices and therefore easy to be segmented. To re-establish contextual information for each pixel, integrating multiple decisions from segmentation results along different views would provide a solution and achieve a more 3D-awared segmentation results for the whole structure. Ensemble learning is a justified and straightforward solution for multiple view integration. Although data from different views to certain extent is related, each view carries their unique 3D information. When multiple, independent, and diverse decisions are combined, the random errors cancel each other out, and correct decisions are reinforced.
Table 4.1: Three different configurations for our Ensemble-Net. The convolutional layer parameters are denoted as ”conv[kernel size]-[number of kernels]”.
Ensemble-Net Configuration
Ensemble-Net 1 Ensemble-Net 2 Ensemble-Net 3
Layer 1 conv5-1 conv3-2 conv1-4
Layer 2 ——- conv3-1 conv3-4
Layer 3 ——- ——- concat(Layer 1, Layer 2)
Layer 4 ——- ——- conv1-1
In our work, a multi-view ensemble ConvNet is proposed as the weighted combination of multiple U-Seg-Net decisions from different views. To utilize utmost 3D context for each pixel, nine different views are analyzed, including three orthogonal views(axial, coronal, sagittal), and six diagonal views, as shown in Fig. 4.3(b). In this work, we explore two types of ensemble learning. The first one relies on a 3D CNN-based network, dubbed 60 as Ensemble-Net, which takes each view’s stacked segmentation results (the segmentation probability maps) of a 3D hippocampus image and infers a refined segmentation probability map. To be more specific, as a fully convolutional net the input of the Ensemble-Net is a 4D tensor, of which multiple views are concatenated in channel-wise fashion. It aims to learn a nonlinear combination of probability maps from different views, while the spatial dimension of the input is reserved through padding-enabled convolutions. Table 4.1 shows three network configurations we used for our Ensemble-Net. Alternatively, we also implement pixel-wise majority voting as comparison, for which each view is directly assigned with the same weights.
4.3 Data
In this work, baseline T1-weighted whole brain MRI images and their hippocampus segmentation masks from ADNI database 4 are used. Considering hippocampus has a relatively fixed position and shape in human brain, we use FIRST 5 as a coarse segmentation method to roughly detect the location of hippocampus first, then crops each hippocampus out using a same size of bounding box. This preprocessing step is straightforward but necessary for the subsequent refined segmentation of using our proposed solutions, since training on the original T1 images with the whole brain volume are too computationally expensive. Although the Dice between the segmented hippocampus using FIRST and the ground truth mask is approximately 70%, it is already accurate enough for the purpose of localizing and cropping the hippocampus. After this processing step, hippocampus are cropped along with its neighboring information from T1 images in 3D box, with the size of 26×50×40 for the left side and 30×56×44 for the right. Those 2 boxes are big enough to contain the hippocampus which is as shown in Fig. 4.3 (b).
4 http://www.loni.usc.edu/ADNI 5 https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FIRST 61
4.4 Experimental settings
In each of the following experiments, respective models are trained and evaluated independently for segmenting left and right hippocampus. For each experiment, 10-fold cross-validation are utilized. There is in total 110 subjects, so 99 subjects are for training and 10 for testing. All the models are implemented using MXNet6 deep learning framework. AdaDelta is used as the optimizer, with its default hyperparameter settings: learning rate as 1, rho as 0.95, epsilon (fuzzy factor) as 1e-08 and learning rate decay over each update as 0. We observe the achieved training loss convergence and empirically set early stopping epoch as 50.
4.5 Evaluation measurements
Dice coefficient [97] and Hausdorff distance [98] are used to evaluate the quality of segmentation in this work. Dice coefficient is one of the most widely used similarity metrics for evaluating segmentation results. Given two sets, X and Y, Dice coefficient is defined as
2|X ∩ Y| Dice = (4.1) 2|X| + |Y| where |·| denotes the number of elements in the set, e.g. pixels or voxels. For segmentation, X and Y are typically corresponding to the ground truth binary mask and the predicted binary mask for the target of interest. In this work, for each hippocampus, we calculate Dice coefficient directly in 3D space. Hausdorff distance (HD) measures the distance of two subsets A and B in a metric space. It is defined as
6 https://mxnet.apache.org/ 62
HD(A, B) = max(h(A, B), h(B, A)) (4.2) where directed Hausdorff distance h(· , ·) is calculated as
h(A, B) = max min ka − bk (4.3) a∈A b∈B k · k in 4.3 denotes a norm, which is Euclidean distance in our work. Since HD is a maximum value, generally it is sensitive to outliers [98]. When evaluating segmentation results, it is a good indicator for the stability and consistency of subject-wise segmentation.
4.6 Experiment results
In this section, we evaluate the proposed solution, multi-view Ensemble ConvNet for hippocampus segmentation task, with different configurations. In addition, we also compare our model with some recently reported solutions that also use ADNI database for hippocampus segmentation.
4.6.1 U-Seg-Net on nine views
Since our multi-view Ensemble Net relies on U-Seg-Net on each view. Here, we first present the segmentation performance of 2D slice based U-Seg-Net. 3D Dice coefficients and Hausdorff distances of the segmentation based on nine single views are shown in Table 4.2. Generally speaking, there exist no significant differences among the results from these 9 different views. Specifically, in Coronal view, U-Seg-Net achieves the best results of Dice coefficients (87.36% and 87.62%) for both left and right hippocampus segmentation ; in Diag 2 view, U-Seg-Net achieves the best result of HD (2.88) for left hippocampus segmentation; in Diag3, U-Seg-Net achieves the best result of HD (2.55) for right hippocampus segmentation. 63
Table 4.2: Mean and standard deviation of Dice ratio (%) and HD for the hippocampus segmentation results using nine different single views. (L: left hippocampus, R: right hippocampus).
Sagittal Coronal Axial Diag1 Diag2 Diag3 Diag4 Diag5 Diag6
L 86.64±2.51 87.36±2.06 86.23±2.99 86.79±2.71 86.73±2.90 86.61±3.46 85.99±2.67 85.85±2.21 86.60±2.57 Dice R 86.98±1.60 87.62±2.05 86.52±2.53 86.96±2.34 86.98±1.88 87.16±2.26 86.46±2.07 86.21±2.94 86.44±2.00
L 3.21±0.68 3.56±0.68 3.25±0.76 2.97±0.67 2.88±0.59 3.01±1.05 3.16±0.87 3.36±0.71 2.86±0.44 HD R 3.01±0.68 3.24±0.90 3.02±0.66 3.03±0.58 2.78±0.53 2.55±0.35 2.66±0.26 2.86±0.40 3.04±0.96
4.6.2 Multi-view Ensemble ConvNets
We further apply ensemble learning to combine multiple view results obtained from aforementioned individual U-Seg-Net, to improve the overall segmentation of hippocampus. In Table 4.3, we present results of applying majority voting on 3 views and 9 views respectively, as well as results of applying three types of Ensemble ConvNets on 9 views. It is obvious that all ensemble methods achieve better performance than single view. By combining more views, marginal improvements are obtained. Also, by switching from linear combinations in majority voting to learned nonlinear combination in the proposed ConvNet, we also obtain marginal improvements. However, when the complexity of ConvNets increases, the benefit of nonlinear combination is lost due to the limited number of training data in this work. To further demonstrate the benefit of ensemble learning, Fig. 4.5 shows surface rendering of one segmented hippocampus using different methods: individual U-Seg- Net for 3 orthogonal views respectively, and majority voting (ensemble learning) for 3 views and 9 views separately. It can been observed that, the segmentation results from 3 orthogonal views have apparent deformation and abrupt areas in different places comparing 64 with ground truth. The 3-view combination integrates 3D information from 3 orthogonal views and improves the segmentation accuracy. The 9-view combination further refines the boundary and leads to a more accurate and smoother segmentation result.
Table 4.3: Mean and standard deviation of Dice ratio (%) for the hippocampus segmentation results using different combination methods.
3-View Vote 9-View Vote Ensemble-Net 1 Ensemble-Net 2 Ensemble-Net 3
L 88.92 ± 1.88 89.20 ± 2.31 89.28 ± 1.98 89.01 ± 1.47 89.25 ± 1.54 Dice R 88.92 ± 1.59 89.35 ± 1.83 89.45 ± 1.73 89.05 ± 1.99 88.23 ± 2.45
L 2.45 ± 0.55 2.32 ± 0.44 2.47 ± 0.42 2.32 ± 0.40 3.71 ± 0.92 HD R 2.44 ± 0.39 2.31 ± 0.36 2.26 ± 0.25 2.39 ± 0.32 3.44 ± 0.94
Sagittal Coronal Axial 3-view 9-view GT 91.26% 91.92% 91.65% 93.37% 93.90% GT
Figure 4.5: Surface rendering of segmentation results of U-Seg-Net on the sagittal, coronal, axial views, majority voting of 3 views, 9 views and ground truth in one case. (The Dice ratio for the obtained segmentations are shown under each image). Please refer to text for details.
In the end, we compared our proposed model with some recent work [27, 31, 32, 34, 99] that also focus on hippocampus segmentation. [27] used path-based methods based 65 on the similarity of intensity content between patches and target to obtain segmentation labels. [99] used a hierarchical SVM with automated feature selection method. [32] used the discriminative dictionary learning and sparse coding techniques where dictionaries and classifiers are learned simultaneously from a set of brain atlases. [31] proposed multi- atlas patch-based label fusion methods with new techniques: enforcing each image patch encode both local and semi-local image information and adopting a hierarchical label fusion method that iteratively improves the labeling result. [34] proposed multi-atlas patch- based label fusion methods, and progressively constructed dynamic dictionary learning of an optimal weighting coefficients for the label fusion framework. Please note that we do not intend to make a head-to-head quantitative comparison among these methods, as the studies were conducted and reported based on different datasets, subjects, ground truth setup and evaluation metrics. Nevertheless, best overall (Average) segmentation performance is demonstrated by our methods, as shown in Table 4.4.
Table 4.4: Comparison of the proposed method with other state-of-the-art existing methods on hippocampus segmentation.
..... Method Left Right Average
Morra et al. [99] Ada-SVM 81.40 82.20 81.80 Coupe et al. [27] Nonlocal Patch-based —– —– 88.40 Tong et al. [32] DDLS 87.20 (median) 87.20 (median) 87.20 Wu et al. [31] Hierarchical Multi-Atlas —— —— 88.50 Song et al. [34] Progressive SPBL 88.20 88.50 88.35 Our method 9ViewEnsem-Net1 89.48 89.46 89.47 66
4.7 Discussion
In this work, we present an automated hippocampus segmentation model based on ensembling multi-view convolutional networks. The proposed U-Seg-Net + Ensemble ConvNet is easy to train and achieves state-of-the-art performance in hippocampus segmentation. Compared with the stacked 2D slice-based segmentation from a single view, the proposed model is able to re-establish 3D contextual information and lead to a more 3D-awared segmentation results, which is reflected by the smoother and more accurate segmented hippocampi. However, we also notice that the proposed nonlinear ensemble ConvNet only achieves marginal improvements over linear voting methods, and we believe this is because the benefit of nonlinear combination is degrading with the increased complexity of ConvNets given the limited training data. In addition, exploring an end-to-end training strategy or better structure for this model is also another direction of our future efforts . 67 5 Sequential FCN for Hippocampus Segmentation
In this chapter, we present our another proposed model that combines long short- term memory (LSTM) with fully convolutional network (FCN) for the purpose of improving both the accuracy and slice-wise consistency in 3D medical image segmentation. Specifically, we adopt U-Seg-Net, which is introduced in the previous chapter, as the underlying 2D slice-based model to extract the latent features, and use LSTM to propagate these features along the third dimension in order to compensate the contextual information. This setup allows the most essential features of each slice to be shared and spread along the slice sequence. In the end, we also explore three-view ensemble to supplement the individual segmentations with 3D neighborhood information and further boost the overall performance. Still, we demonstrate the effectiveness of the proposed model on the application of 3D MRI based hippocampus segmentation. The work presented in this chapter has been published in [100].
5.1 Method: U-Seg-Net + CLSTM
Multi-view Ensemble ConvNet integrates separate 3D segmentation results of U-Seg- Net, aiming to re-establish the 3D contextual information to achieve a better and refined results. Nevertheless, the two stage processing in this approach might not be able to fully recover the overall 3D spatial contextual information lost in U-Seg-Net, thus leads to a sub-optimal refinement. To solve this issue, our second proposed solution attempts to directly instill inter-slice contextual information into training of slice-based U-Seg-Net. Specifically, we utilize the sequential learning model CLSTM to combine slice-based U- Seg-Net in order to leverage inter-slice spatial dependency, which is illustrated in Fig. 5.2. In this way, slice-wise 2D semantic information in U-Seg-Net can be propagated to adjacent slices along z-axis simultaneously with the training of U-Seg-Net. 68
The combination of CLSTM and U-Seg-Net can be rather straightforward. In practice, the recurrent units can be inserted at multiple places of U-Seg-Net and result in different architectures for our joint model. In a similar work [11], the authors proposed a method to combine two kU-Nets with RNN by directly constructing the recurrent connections between the last feature maps of their kU-Nets. However, it is noteworthy that this arrangement is the most straightforward solution, but might not be the optimal one. With the consideration that there exists different degrees of spatial information loss at different levels of spatial pooling or upsampling, using multiple level RNNs can propagate both higher level semantic and high resolution location information at same time between slices. More specifically, feature maps close to input and output have the most spatial and location information, and feature maps in the middle part of U-Seg-Net has the most semantic information. Upon this, the recurrent connections can actually be placed in any layer of U-Seg-Net. In our work, we explore four combination options as shown in Fig. 5.1:
(a) between the sequence of high resolution feature maps in the encoding path of the U-Seg-Net;
(b) between the high resolution feature maps in the decoding path of the U-Seg-Net;
(c) between the middle layer feature maps;
(d) at all three aforementioned places.
Fig. 5.2 shows the full architecture of our joint model (d) with 3 recurrent connections in U-Seg-Net. Since there is no natural beginning direction for z-axis in 2D slices of a 3D hippocampus, sequence learning can be rolled out from both opposite directions, denoted as z+ and z−, to avoid information loss. This is fulfilled by a bi-directional CLSTM, in which two CLSTMs process the same input sequence from the opposite directions and 69
128 1 1 32 128 1 1 32 32 32+128 32 32+128 128 128
64 64 128 64 64+128 128 64 64+128
128 128 128 128 (a) (b)
128 1 1 32 128 1 1 32 32 32+128 32 32+128 128 128
64 64 128 64 64+128 128 64 64+128
128 128 128 128 (c) (d)
Figure 5.1: Joint models. The black arrows indicates the recurrent connections.
concatenate their outputs at the corresponding neurons to a new tensor as cell output. Furthermore, multiple repeats (layers) of bi-directional CLSTM can be stacked together for form a deep structure [11] with the output of the first one taken as the input of the second one. In this work, we only explore 1 layer of bi-directional CLSTM considering the size of data set. Overall, the architecture of the joint U-Seg-Net and CLSTM model can be interpreted as following: feature maps learned from the 2D slices of the same volume are extracted independently with 1 convolutional operation first, then are fed into the 1st bi- directional CLSTM to make information propagate through all the slices. After another 2 convolutional + pooling operations, the extracted latent feature maps go into the 2nd bi- directional CLSTM to continue propagating the condensed information between different slices. Further after 3 convolutional + upsampling operations, 3D consistency is ensured again through the 3rd bi-directional CLSTM before the final segmentation probability maps are generated. This whole model is end-to-end trainable. 70
……
Slice&p(1
Slice&p
Slice&p+1
:& CLSTM
……
Figure 5.2: The overall architecture of our proposed model: U-Seg-Net + CLSTM.
View Ensemble The joint U-Seg-Net + CLSTM solution could also benefit from ensemble learning of different views. In this work, we further combine the resulted probability maps of the joint model trained from 3 different views (axial, sagittal, and coronal). To limit the complexity of the overall model, only pixel-wise majority voting is used to perform the ensemble.
5.1.1 3D U-Seg-Net
Here, we introduce 3D U-Seg-Net model, which is a direct extension of 2D U-Seg-Net and will be used to compare our proposed solutions in following experiments. 3D U-Seg- Net is a fully convolutional neural network that directly performs on 3D volumes. Similar with 2D U-Seg-Net structure, 3D U-Seg-Net also comprises of the encoding, decoding and skipping bridge layers, with all the 2D convolutional, pooling and deconvolutional operations replaced by 3D operations, as shown in Fig. 5.3. Theoretically the effective receptive field of 3 layers of convolutional + pooling operations with kernel size 3×3×3 is 72×72×72. Considering this is already large enough to capture the content of our 3D input of hippocampus and our data set is relatively limited, we do not explore more deeper 71 network structures and only use pooling and convolution operations three times, same as in 2D U-Seg-Net.
1 32 128 1 32 32+128 128
64 64 64+128 128 Conv,3×3×3,Relu Max,pool,2×" ×" 128 128 Copy De7Conv,2×"×"
Figure 5.3: The architecture of 3D U-Seg-Net.
5.2 Data and experimental setting
The data we used is same as in Chapter 4. We also evaluated our proposed model with 10-fold cross validation, specifically, with 99 subjects used for training and 11 subjects for testing in each fold. For each cropped hippocampus 3D array of training subjects, three sets of 2D image slices along sagittal, coronal, and axial view were firstly extracted to feed into and train our U-Seg-Net + CLSTM model, respectively. Then, the outputs of three sets of 3D probability maps were combined using majority voting. Note that we conducted the training and testing procedure independently for left and right hippocampus. Evaluation measurements Same as Chapter 4, Dice coefficient [97] and Hausdorff distance [98] are used to evaluate the quality of segmentation in this work. We refer reader for the detailed definition about those two measurements in Chapter 4. The 3D Dice ratio and Hausdorff distance were calculated subject-wise for each view and their combinations. Mean and standard deviation averaged from 10 folds are reported. 72
Optimization All the models are implemented using MXNet7 deep learning framework. AdaDelta is used as the optimizer, with its default hyperparameter settings: learning rate as 1, rho as 0.95, epsilon (fuzzy factor) as 1e-08 and learning rate decay over each update as 0. Due to the different complexities for each model, convergence is achieved at different stages. We observe the training loss and empirically set early stopping epoch differently for each type of model, as shown in in Table 5.1. When training the joint U-Seg-Net + CLSTM models, we initialize the parameters of U-Seg-Net part using the corresponding parameters from the trained individual U-Seg-Net. We find this would provide the joint model with a good starting point and lead to a better performance than training from scratch.
Table 5.1: Early stopping epochs used for different models.
Model U-Seg-Net Ensemble-Net U-Seg-Net+CLSTMs CLSTMs 3D U-Seg-Net
Epoch 50 30 150 120 120
5.3 Experiment results
5.3.1 CLSTMs
The proposed joint U-Seg-Net+ CLSTM solution relies on U-Seg-Net to extract the latent features and CLSTM to propagate the slice-wise information. Thus, similarly as previous section, we also show the hippocampus segmentation performance of only using CLSTM first. Same as [80], we apply CLSTM for 3D hippocampus segmentation using 6 directions (forward, backward sweeps for each of the three orthogonal views). In each setting, CLSTM processes the entire volume in one unique direction. Furthermore, 2-
7 https://mxnet.apache.org/ 73 direction ensemble for each orthogonal view and 6-direction ensemble for all 3 orthogonal views are conducted using voxel-wise averaging. The whole procedure is demonstrated in Fig 5.4. For each of CLSTM model used in the experiments, only two hidden layers are used, with 100 filters in the first layer and 1 filter for the second layer to generate the segmentation probability map. This setting is empirically decided, since we found more filters (i.e., 100) would generate better results comparing with fewer filters (40). The overall segmentation results are shown in Table 5.2. As shown, the 6-direction ensemble achieves better segmentation results than all single directions both in Dice ra- tio and HD. Similar for the 2-direction combination, which achieves better results than its corresponding single direction, except slightly worse performance than backward direc- tion in coronal view for Dice ratio. Generally speaking, the performance of CLSTM is worse than using U-Seg-Net, since CLSTM is shallow, and is not effective to extract latent information. This serves as another reason for our proposed solution of combining FCN with CLSTM. In addition, the experimental results also validate the effectiveness of the bi-directional CLSTM and view ensemble strategies.