<<

Deep Learning based 3D Image Segmentation Methods and Applications

A dissertation presented to the faculty of the Russ College of Engineering and Technology of Ohio University

In partial fulfillment of the requirements for the degree Doctor of Philosophy

Yani Chen May 2019 © 2019 Yani Chen. All Rights Reserved. 2

This dissertation titled Deep Learning based 3D Image Segmentation Methods and Applications

by YANI CHEN

has been approved for the School of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by

Jundong Liu Associate Professor of Electrical Engineering and Computer Science

Dennis Irwin Dean, Russ College of Engineering and Technology 3 Abstract

CHEN, YANI, Ph.D., May 2019, Computer Science Deep Learning based 3D Image Segmentation Methods and Applications (115 pp.) Director of Dissertation: Jundong Liu Medical image segmentation is the procedure to delineate anatomical structures and other regions of interest in various image modalities. While crucial and often a prerequisite step for other analysis tasks, accurate automatic segmentation is difficult to obtain, especially for three dimensional (3D) data. Recently, deep learning techniques have revolutionized many domains of artificial intelligence (AI) including image search, speech recognition and 2D/3D natural image/video segmentation. When it comes to 3D image segmentation, however, the majority of deep learning solutions either treat 3D volumes as stacked 2D slices, overlooking the adjacent information between slices, or directly perform 3D convolutional operations with isotropic kernels that are inconsistent with the anisotropic dimensions in 3D medical data. Neural networks based on 3D convolutions tend to be computationally costly, as well as require much more training data to account for the increased number of parameters that need to be tuned. The scarcity of annotated data in also adds up the difficulty. To remedy the aforementioned drawbacks of existing solutions, we propose two works for 3D volume segmentation in this dissertation. The first work is multi-view ensemble convolutional neural network (CNN) framework in which multiple decision maps gener- ated along different 2D views are integrated. The second work is a novel end-to-end deep learning architecture that combines CNN and Recurrent Neural Network (RNN) to bet- ter leverage the dimensional anisotropism in 3D medical data. Our model is designed with the aim to take advantage of CNNs remarkable power in capturing multi-scale 2D features, while rely on multi-view ensemble learning or inter-slice sequential learning to ensure certain level of output consistency through inter-slice contextual constraints. Ex- 4 periments conducted on hippocampus magnetic resonance imaging (MRI) data for both work demonstrate that the multi-view solution and the joint CNN-RNN model achieve sig- nificant improvements over single-view approaches, and outperform many state-of-the-art solutions in hippocampus segmentation. Our work also show better results when compar- ing with 3D CNN method. In addition, we further validate our proposed work on other neuroimage segmentation task, i.e., multiple-class segmentation for brain tumor (glioma) using pre-operative multi-modal MRI scans. Also, experimental results demonstrate that our proposed solutions can effectively improve the accuracy and consistency of the tumor segmentation, and show a very comparative performance when compared to the state-of- the-art solutions. 5 Dedication

To my lovely grandfather and parents 6 Acknowledgments

I would like to express my special appreciation to my advisor, Dr. Jundong Liu, for his guidance and support over my Ph.D. studies. I am very grateful for his advice and many insightful discussions on my research, especially during the time where it seems there is not an obvious solution. My great appreciation also goes to Dr. Charles D. Smith, our long time collaborator, for his valuable advice on application directions, study interpretations and system designs. I would also like to thank my dissertation committee members, Dr. David Juedes, Dr. Razvan Bunescu, Dr. Chang Liu, Dr. Li Xu and Dr. Sergiu Azicovici, for all your professional guidance either on my research or on my coursework; your suggestions and feedbacks have been absolutely invaluable to me. I would like to express my sincere appreciation for all the service and time you devoted. My gratitude is also with my wonderful lab mates, Huihui Xu, Bibo Shi, who gave me a lot of guidance and help when I just came to Ohio University to pursue my Ph.D degree. Discussions with them and other lab mates, Pin Zhang, Zhewei Wang, Nidel Abuhajar, provided me with a lot of inspirations on my research. I am very grateful to all of you. Last but not least, I want to thank my family, especially my parents, who give me endless believe and love for supporting me anything without asking for any return. Particularly appreciative to my lovely grandfather who keeps encouraging me that I can be a better self. 7 Table of Contents

Page

Abstract...... 3

Dedication...... 5

Acknowledgments...... 6

List of Tables...... 9

List of Figures...... 10

List of Acronyms...... 12

1 Introduction...... 13 1.1 Background – image segmentation...... 13 1.2 Area overview...... 14 1.3 Contributions...... 17 1.4 Dissertation overview...... 18

2 Preliminaries...... 20 2.1 Building blocks of CNN...... 21 2.1.1 Convolution...... 22 2.1.2 Activation function...... 25 2.1.3 Pooling...... 27 2.1.4 Upsampling...... 28 2.2 Training of CNN...... 31 2.2.1 Optimization...... 32 2.2.2 Data augmentation...... 34 2.2.3 Regularization...... 35 2.2.4 Transfer learning...... 37 2.3 RNN - recurrent neural network...... 38

3 Literature Review...... 40 3.1 CNN based Segmentation...... 41 3.1.1 Patch-wise CNN for segmentation...... 42 3.1.2 Fully convolutional network...... 43 3.1.3 SegNet...... 44 3.1.4 U-Net...... 45 3.2 RNN based Segmentation...... 47 3.2.1 LSTM...... 47 8

3.2.2 Convolutional LSTM and its application on segmentation...... 49 3.3 Segmentation based on combination of CNN and RNN...... 50 3.3.1 U-Net + Bi-directional CLSTM...... 50

4 Multi-view Ensemble ConvNet for Hippocampus Segmentation...... 52 4.1 Motivation: Hippocampus segmentation...... 52 4.2 Method...... 55 4.2.1 U-Seg-Net...... 56 4.2.2 Ensemble-Net...... 58 4.3 Data...... 60 4.4 Experimental settings...... 61 4.5 Evaluation measurements...... 61 4.6 Experiment results...... 62 4.6.1 U-Seg-Net on nine views...... 62 4.6.2 Multi-view Ensemble ConvNets...... 63 4.7 Discussion...... 66

5 Sequential FCN for Hippocampus Segmentation...... 67 5.1 Method: U-Seg-Net + CLSTM...... 67 5.1.1 3D U-Seg-Net...... 70 5.2 Data and experimental setting...... 71 5.3 Experiment results...... 72 5.3.1 CLSTMs...... 72 5.3.2 Joint Model of U-Seg-Net and CLSTMs...... 74 5.3.3 Comparison with other methods...... 77 5.4 Discussion...... 79

6 Application on Multi-class Brain Tumor Segmentation...... 81 6.1 Motivation: Glioma segmentation...... 81 6.1.1 Prior work...... 85 6.2 Proposed models...... 87 6.2.1 U-Seg-Net+CLSTM with Squeeze-and-Excitation Unit...... 88 6.3 Data...... 90 6.4 Experimental settings...... 91 6.5 Quantitative evaluations...... 92 6.6 Experiment results...... 93 6.6.1 SE-U-Seg-Net + CLSTM...... 96 6.7 Discussion...... 97

7 Conclusion and Discussions...... 99

References...... 102 9 List of Tables

Table Page

4.1 Three different configurations for our Ensemble-Net. The convolutional layer parameters are denoted as ”conv[kernel size]-[number of kernels]”...... 59 4.2 Mean and standard deviation of Dice ratio (%) and HD for the hippocampus segmentation results using nine different single views. (L: left hippocampus, R: right hippocampus)...... 63 4.3 Mean and standard deviation of Dice ratio (%) for the hippocampus segmenta- tion results using different combination methods...... 64 4.4 Comparison of the proposed method with other state-of-the-art existing methods on hippocampus segmentation...... 65

5.1 Early stopping epochs used for different models...... 72 5.2 Mean and standard deviation of Dice ratio (%) and HD for the hippocampus segmentation results using 2 layers CLSTM and the combination from different directions...... 74 5.3 Mean and standard deviations of Dice ratio for the hippocampus segmentation results by put the connection in different positions...... 76 5.4 Mean and standard deviations of Hausdorff distances for the hippocampus segmentation results by put the connection in different positions...... 77 5.5 Mean and standard deviation of Dice ratio (%) for the hippocampus segmenta- tion results using 3D-U-Seg-Net with different settings for filter size...... 78 5.6 Comparison of the proposed method with other state-of-the-art methods on hippocampus segmentation (in 3D Dice ratios (%))...... 79

6.1 Mean and standard deviation of subject Dice ratios (%) and Hausdorff distance (1mm) on three glioma sub-regions using U-Seg-Net and ensemble...... 94 6.2 Mean and standard deviation of subject Dice ratios (%) and Hausdorff distance (1mm) for the tumor segmentation results using U-Seg-Net+CLSTM and ensemble...... 94 6.3 Comparison with other methods...... 96 6.4 Mean and standard deviation of subject Dice ratios (%) and Hausdorff distance (1mm) for the tumor segmentation results using network with SE block on sagittal view...... 97 10 List of Figures

Figure Page

2.1 A convolutional neural network for character recognition [1]...... 20 2.2 The components of a typical convolutional neural network layer in ”complex layer terminology” [2]...... 21 2.3 2D convolution operations...... 23 2.4 Two-channel input feature maps to three-channel output feature maps [3].... 24 2.5 Visualization Conv1,3,5 neurons learned from ImageNet dataset [4]...... 25 2.6 Activation functions...... 26 2.7 Derivatives for (a) Sigmoid (b) Tanh (c) ReLU...... 27 2.8 Max pooling example...... 28 2.9 Unpooling...... 29 2.10 2D transposed convolution example. ∗ denote convolution, ∗−1 denote transposed convolution and ⊗ denote matrix multiplication...... 30 2.11 Mappings in convolution and transposed convolution...... 31 2.12 Images generated by cycle GAN [5]. Images in first and third columns are real images and the transferred output are in second and fourth columns...... 35 2.13 Subnetworks after dropout...... 36 2.14 Transfer learning...... 37 2.15 Structures of RNN: Left (a): rolled RNN, Right (b): unrolled RNN [2]..... 38

3.1 A patch-wise CNN segmentation model for the membrane segmentation of Electron Microscopy images [6]...... 42 3.2 The architecture of FCN [7]...... 43 3.3 The architecture of SegNet [8]...... 44 3.4 The architecture of U-Net [9]...... 46 3.5 Structure of RNN and LSTM: (a) Standard RNN with a single layer, (b) LSTM with multiple layers...... 47 3.6 Inner structure of CLSTM [10]...... 49 3.7 Models proposed in [11]: (a) Structure of kU-Net, (b) Overview of the framework (Deep BDC-LSTM) ...... 51

4.1 Comparison of the hippocampus region from coronal slices of the T1-weighted MRI of respectively NC, MCI and AD subjects [12]...... 54 4.2 Example showing the hippocampus and six other subcortical regions around hippocampus (caudate nucleus, putamen, globus pallidus, nucleus accumbens, thalamus and amygdala) in axial, sagittal, coronal and 3D views (hippocampus- cyan; caudate nucleus-yellow; putamen-magenta; globus pallidus-green; nucleus accumbens-blue; thalamus-red;amygdala-white) [13]...... 56 4.3 Model overview...... 57 4.4 Sagittal, coronal and axial views of hippocampi in an MRI scan...... 58 11

4.5 Surface rendering of segmentation results of U-Seg-Net on the sagittal, coronal, axial views, majority voting of 3 views, 9 views and ground truth in one case. (The Dice ratio for the obtained segmentations are shown under each image). Please refer to text for details...... 64

5.1 Joint models. The black arrows indicates the recurrent connections...... 69 5.2 The overall architecture of our proposed model: U-Seg-Net + CLSTM..... 70 5.3 The architecture of 3D U-Seg-Net...... 71 5.4 Procedure of CLSTM...... 73 5.5 Surface rendering of segmentation results of U-Seg-Net, U-Seg-Net + CLSTM and ground truth of one case from the axial view. (The Dice ratio for the obtained segmentations are 71.80% and 81.73%). Please refer to text for details. 75 5.6 Surface rendering of segmentation results of 3D U-Seg-Net, and ground truth of one case. The Dice ratio and HD for the obtained segmentations is 91.35% and 12.57). Please refer to text for details...... 78

6.1 Four tumor substructures, edema (yellow), non-enhancing solid core (red), necrotic/cystic core (green), enhancing core(blue) reflected by multi-modal MRI: (A) T2-FLAIR, (B) T2, (C) T1-Gd, (D) overall combination [14]..... 83 6.2 Four different MRI modalities showing a high grade glioma, each enhancing different subregions of the tumor. From left; T1, T1-Gd, T2, and FLAIR [15].. 84 6.3 The architecture of U-Seg-Net + CLSTM network for brain tumor segmentation. 87 6.4 Squeeze and excitation block [16]. Fsq and Fex are squeeze and excitation operations. Fscale is the feature map recalibration operation...... 89 6.5 The SE-U-Seg-Net + CLSTM model...... 90 6.6 Slices extracted from different views and different modalities. Slices extracted from sagittal, coronal and axial views are shown in first, second and third row accordingly...... 92 6.7 Slices extracted from single view segmentation results using U-Seg-Net and joint U-Seg-Net + CLSTM models. Sagittal, coronal and axial views are shown in first, second and third row accordingly. Images are cropped and resized for better visualization. Please refer to text for details...... 95 12 List of Acronyms

CNN Convolutional Neural Network

FCN Fully Convolutional Neural network

RNN Recurrent Neural Network

LSTM Long Short-Term Memory

CLSTM Convolutional Long Short-Term Memory

FC-LSTM Fully Connected Long Short-Term Memory

BD-LSTM Bi-directional Long Short-Term Memory

HD Hausdorff Distance

2D Two-Dimensional

3D Three-Dimensional

SGD Stochastic Gradient Descent

GPU Graphics Processing Unit

GAN Adversarial Neural Network

MRI Magnetic Resonance Imaging

CT Computed Tomography 13 1 Introduction

1.1 Background – image segmentation

Image segmentation is one of the most fundamental problems in image understanding, and has broad applications in various areas, such as , medical imaging and . Generally speaking, image segmentation is the process of partitioning a (or 3D volume) into multiple meaningful regions. Pixels (or voxels) in each segment should possess certain similar characteristics, e.g., intensity of consistency to indicate that they belong to the same category or same object of interest. When it comes to specific application areas, however, image segmentation often has different interpretations or aims. For instance, in machine vision, segmentation is regarded as a transition step between low level and high level vision subsystems [17]; in remote sensing, it is typically adopted as an prerequisite step that is necessary for the following tasks such as landscape change detection or land cover classification [18]. Specifically in medical , accurate automated segmentation plays a vital role in computer-aided diagnosis and image guided therapy. Under this particular context, the purpose of segmentation is to delineate anatomical structures or indicate the boundary of organs or other regions of interest (ROI), which is often informed by prior knowledge or with expectation on the final segmentation result [19]. After the ROIs are segmented out, the geometric properties of the segments and other shape related information can be derived, and thus facilitate many clinical analysis and routines. Automatic segmentation of medical images is a very challenging task due to the high variability and complexity of medical images and the fact that the images are often corrupted by noise. Over the last two decades, remarkable progress has been made in the research community, with numerous solutions published for medical image segmentation. These methods can be roughly divided in four different groups, which are thresholding- 14 based, region-based, edge-based, and clustering-based [20]. We refer readers to some recent surveys [21–24] on these conventional methods. Some of the robust solutions have been integrated into commercial software packages. However, they are generally limited to segment clear-cut organs such as bones. Challenges still remain in designing automatic solutions to partition more complex organs. One of the examples is human brain — while it is difficult to achieve accurate delineation, cortical and subcortical parcellations are very important for detecting tumors, edema, and necrotic tissues. Accurate detection of these tissues is often the basis for reliable diagnoses. Recently, deep learning methods that utilize different types of deep artificial neural networks have been successfully and widely used in a lot of typical tasks, such as image recognition, image classification and image segmentation. The advance and success of deep learning models in semantic segmentation for natural images holds great promise for solving many difficult segmentation tasks in medical tasks . In this work, we will focus on designing new deep learning models to solve 3D medical image segmentation problems.

1.2 Area overview

Although the earliest medical image segmentation work can be traced back to thresholding technique, one of the oldest, simplest and most popular techniques that exists for several decades, here we are not trying to come up with a comprehensive survey about the progress of medical image segmentation methods from the beginning. Instead, we are more interested in the recent advances, especially on some very difficult medical image segmentation tasks, including atlas based methods, patch based methods, and deep learning based methods. Atlas based methods have attracted much attentions in the medical imaging community, especially for brain image segmentation. Early atlas-based segmentation 15 methods register the most similar atlas from a library of atlases to the target subject, and the segmentation or labeling for the target subject is obtained from the corresponding labels of the chosen atlas. However, this single atlas based segmentation methods tend to produce biased segmentation results. To overcome this problem, multi-atlas-based techniques [25, 26] have been proposed, which apply registration from multiple atlases (manually segmented by experts) to the target image, and then rely on label fusion or propagation to decide the voxel membership under the coordinate framework of the target. Segmentation results of atlas based solutions are highly dependent on the performance of the underlying registration . In addition, during the label fusion procedure, most methods do not consider the relevance between the sample and the atlases (the target sample is more similar with some atlases) and just assign the atlases with the same weights. Atlas based technique usually assumes there is an one-to-one correspondence between the target image and the atlases, which imposes significant difficulties on the underlying adopted registration and is very difficult to achieve in practice. Due to this issue, patch- based solutions have gained popularity for their simplicity and robust performance. Such kind of solutions [27, 28] focus on identifying local similarities between the target image and anatomical atlases at the patch (small subvolumes) level, in which the non-rigid or one-to-one registration step is not required. Voxel neighborhoods with similar intensity profiles are considered to belong to the same structure and therefore share the same labels. Patches can be extracted directly from predefined subvolumes across the training atlases. Considering image similarities over small image patches may not lead to optimal estimation [29], several recent works [30–33] impose sparsity constraints upon label weights to ensure that only a small number of highly relevant atlas patches will be selected to represent the underlying target patch. Discriminative dictionary learning [32, 33] and progressive dictionaries [34] are also utilized to guide the representation transition from intensities to labels. 16

More recently, deep learning solutions, in particular deep convolutional neural networks (CNNs) have been widely used to solve image recognition problems in various domains including computer vision and [9, 35, 36], and achieved state-of-the-art performance. Most of these applications adopt 2D convolutional networks that take image patches as input, which ignores the global contextual information; only a few approaches use a post-processing method such as conditional random field (CRF) to further enforce spatial consistency. In [37], the authors used multi-scale schema on top of the 2D/3D intensity patch input to enforce the spatial consistency of the 3D whole brain MRI segmentation. In [6], each pixel is classified first by applying the trained neural network on a patch surrounding the pixel. Afterwards, the whole image is segmented in a sliding window fashion. In [7], a fully convolutional networks (FCN) was proposed to densely label input images through an end-to-end deep network architecture. The U- Net model [9] improves FCN to make it work with very few training images yet still produce more precise segmentation results on medical images. The decoding path in U- Net is equipped with a large number of feature channels via skip/bridge connections to corresponding encoding path, allowing the modified network to better propagate context information to higher resolution layers. A few 3D deep learning based segmentation methods [38–40] have been published recently to deal with 3D medical images directly. Typically, these methods can be grouped into two categories: (1) Apply 2D segmentation models on the 2D slices, and stack/combine the 2D segmentation results in certain ways. (2) Design new network architectures that directly take 3D image volume as the input [39, 40]. Both of the two groups have their own issues. The drawback of using 2D models to solve 3D segmentation tasks is that the spatial information between adjacent slices is not considered and utilized from the beginning, and thus the full 3D context information may not be fully recovered during the following post processing step of slice stacking or combining. Applying 3D 17

CNN model directly will nevertheless require much more parameters being involved, and accordingly significantly more training data to avoid overfitting problems. However, it is generally not feasible to acquire many annotated 3D medical images. Another issue of applying 3D convolutional neural network models on 3D medical data is caused by the adopted isotropic kernels. Because 3D medical images are typically scanned/produced slice by slice, it means the isotropic assumption might not be applicable for this type of 3D medical images [11].

1.3 Contributions

The aforementioned limitations in existing 3D image segmentation models constitute the main motivation for the work proposed in this dissertation. Specifically, we aim to explore the best strategies to solve the issue of 3D contextual information loss in 2D slice-based segmentation solutions, with the consideration of limited training samples and anisotropic properties for medical imaging. In light of 3D contextual information loss issue in existing 2D slice-based medical image segmentation models, we intend to enrich the current deep learning algorithms with alternative perspectives. On one hand, due to the fact of limited training samples for most 3D medical image segmentation tasks, we still adopt the 2D slice-based FCN models as basis, but explore the feasibility of combining 2D slice-based decisions from multiple perspective views, in order to compensate the lost 3D contextual information. Furthermore, instead of a direct stacking method, better combination strategies to fuse the 2D decision maps are explored within ensemble learning. On the other hand, it has to be admitted that not all 3D contextual information might be fully recovered using this post-processing strategy. The ultimate remedy for this issue would be to directly instill 3D contextual information into learning process from the beginning. Thus, we derive a generalized framework that combines CNN and recurrent neural network (RNN) to fully explore 18 the neighborhood information as a guidance for 3D segmentation. Application-wise, the proposed segmentation models are validated on two important clinical tasks, hippocampus segmentation and multi-class brain tumor segmentation. By using the proposed models, we obtain rather high segmentation accuracies for both tasks, which are comparable or even surpass some reported results by state-of-the-art solutions.

1.4 Dissertation overview

This dissertation is organized as follows. We first review the background work relevant to this dissertation: Chapter 2 describes the fundamental mathematics for deep learning problem, which are necessary to understand our proposed models. Chapter 3 is a large survey of current progresses of deep learning based image segmentation algorithms on both medical and natural images. Our proposed models are also inspired or relying on some of these works. We then present and illustrate our contributions on 3D image segmentation. In Chapter 4, we propose an automated 3D segmentation method relying on the utilization of a multi- view ensemble convolutional neural network to combine multiple decision maps. We further apply this multi-view ensemble network on the hippocampus segmentation task to demonstrate its effectiveness. In Chapter 5, we continue to explore better ways to integrate inter-slice contextual information into 2D segmentation models, and propose to incorporate sequential learning that has the ability to leverage inter-slice spatial dependencies into 2D segmentation network. Specifically, we adopt the convolutional long short-term memory (CLSTM) model to propagate the inter-slice information between FCN, and design an end- to-end joint learning model to further improve the spatial consistency for 2D slice-based segmentation. Also, we validate this model on hippocampus segmentation task, as well as multi-class brain tumor segmentation task (Chapter 6). 19

Finally, we conclude this dissertation in Chapter 7 with a summary of our contributions as well as discussions on potential future work. 20 2 Preliminaries

Figure 2.1: A convolutional neural network for character recognition [1].

Convolutional neural network (CNN) is one of the most widely used deep learning architectures, in which the connectivity pattern between its neurons is inspired by the organization of the animal’s visual cortex. CNN is a special kind of feedforward neural networks. Generally speaking, the modern framework of CNN can be traced back to the model LeNet-5, which was originally proposed by LeCun in 1989 [41] for the recognition of handwritten digits, and later improved in [1]. As shown in Fig 2.1, LeNet-5 is a multi- layer artificial neural network and can be trained with the backpropagation [42] similar as other neural networks. The success of LeNet-5 demonstrated a very significant real-world application of CNNs, and was the first work to highlight the practical need for a key modifications of conventional neural nets beyond plain backpropagation toward modern deep learning. Compared with plain conventional neural networks, LeNet-5 uses convolution operations to replace general matrix multiplications in certain layers, which is the primary distinction for CNNs. Using convolution operations enables CNNs to extract local features and combine them to form higher order of features in a more efficient way. Typical steps of designing and training a neural network can be summarized as: specify data set ( train/test, input/output), determine the objective function based on the tasks, and choose an optimization procedure to train the model. Take segmentation 21 problems for example, usually the input is images containing objects of interest to be segmented, and the output of the model is the predicted segmentation probability map or binary mask. The commonly used objective function is cross-entropy (binary segmentation) defined on the output prediction and the ground truth segmentation. To train the model, stochastic gradient descent (SGD) is commonly selected as the optimizer to update the model’s weights in alternations of back propagation and forward propagation. The following sections in this chapter are grouped in two parts: The first part briefly reviews the main building blocks of CNNs, namely convolution, deconvolution, activation functions and pooling. Then, how to build a CNN using these blocks will be introduced. The second part will focus on optimization related concepts or strategies for deep learning practitioners, including optimization methods, data augmentation, regularization, as well as transfer learning used to deal with small dataset and accelerate training process. In the end, we will also briefly introduce related definitions and concepts on recurrent neural network. Note that, with fast development of deep learning in recent years, new techniques are being developed are published rather rapidly. Thus, only the basic and most related concepts will be introduced in this dissertation.

2.1 Building blocks of CNN

Figure 2.2: The components of a typical convolutional neural network layer in ”complex layer terminology” [2]. 22

The convolutional layer is the core building block of a CNN. As shown in Fig. 2.2, convolutional layer typically consists of several sequential operations, namely convolution, nonlinearity activation and pooling. By stacking convolutional layers, CNNs are able to see successively larger portions of the image, and extract higher order, more complex features with the increasing depth.

2.1.1 Convolution

In mathematics, convolution is an operation on two functions of a real-valued argument to calculate the amount of overlap of one function g(t) as it slides over another function f (t). The convolution s(t) of f (t) and g(t)[43] is defined as the integral of the product of the two functions after one is reversed and shifted, as following: Z s(t) = ( f ∗ g)(t) = f (τ)g(t − τ)dτ (2.1)

However in CNN, the commutative property [44] of convolution is not preserved, and the convolution operation has a different meaning. In CNN terminology, we often refer the first argument f (t) to the convolution as input, the second argument g(t) as kernel and the output s(t) as feature map(s). Convolution in CNN is more of a cross-correlation operation [43], as shown in Eq.(2.2), a similarity measure of two series as a function of the displacement of one relative to the other, which is also known as a sliding dot product or sliding inner-product. Many machine learning libraries, including deep learning libraries, implement cross-correlation but name it as convolution. In the following, we will also use this kind of definition for convolution, specifically indicating operations without flipping of the kernel, which can be written as:

Xτ=∞ s(t) = ( f ∗ g)(t) = f (t + τ)g(τ) (2.2) τ=−∞ Fig. 2.3 shows a concrete example of 2D convolution. As shown, I is an input 2D matrix, K is a kernel matrix, and I ∗ K is their convolution result which is the output feature 23

0 1 1 1 0 1

0 1 0 1 0 0 ×0 ×1 ×0 2 4 3 2 0 1 0 1 0 1 1 1 1 4 2 5 4 ×1 ×1 ×1 ∗ 1 1 1 = 1 1 0 1 1 1 2 3 4 5 ×0 ×1 ×0 0 1 0 1 0 0 1 1 1 2 1 3 5 K 1 0 0 0 1 0 I ∗ K

I

Figure 2.3: 2D convolution operations.

map. In the convolution operation, the orange area is called local receptive field, the blue part is a 3 × 3 kernel, and the yellow block ”4” is the sum of the dot product of numbers inside the local receptive field and the kernel. By performing such dot product operation over the entire input matrix in a sliding window fashion, we would get the convolution result, a 5 × 5 feature map. The 2D convolution can be generalized to N-D convolution, which is defined on a multi-dimensional input array, and a multi-dimensional kernel array, the parameters of which are usually adapted by the learning algorithm for different applications. N is also called kernel’s spatial dimensionality. Convolved with the same input, kernels with different weights would lead to different feature maps, which, if in the perspective of image processing, means the operation has different effect on the input, such as blurring, sharpening or . In CNN, multiple kernels are used (or learned) in each level to extract different local features for that level, such as oriented edges, corners or blobs, as shown in Fig. 2.5. Fig. 2.4 shows a concrete example of how to convolve a two channels 2D input (2 × 5 × 5) to generate a new three channels 2D feature maps (3 × 3 × 3) using 6 dif- ferent 3 × 3 2D kernels (2 × 3 × 3 × 3) [3]. The operation assumes the kernel slides over the input with stride 1 (stride: distance between two consecutive positions of the sliding window). In practice, such setting can be tuned differently for different models or tasks. 24

The exact shape of the output feature map from a convolutional layer is decided by the shape of its input, the shape of kernel shape, the padding strategy, and the stride. For more information, we refer readers to [3] for more detailed explanations.

Figure 2.4: Two-channel input feature maps to three-channel output feature maps [3].

Compared with the dense connections used in conventional neural network, convolu- tions in CNN is a type of sparse connection, which has much fewer parameters. Such design not only reduces the model size, but also significantly improves the abstraction power of CNN. In addition, the biggest difference of convolution filters (kernels) in CNN from the hand-crafted ones in traditional image processing is that, these kernels are learned to be task/data-driven through optimizing the whole network on training data set with the guidance of the objective function. As shown in Fig. 2.5, it provides an example to visualize the learned filters in AlexNet model [35], which is trained on ImageNet dataset [45] for multi-class image classification task. It can be seen that, the learned filters of the first layer are able to detect some basic local patterns, such as edges or blobs, which are very similar to hand-crafted Gabor filters. In the middle layer, the network learns to extract more complex 25 features, such as texture patterns or object parts. In the end, for the classification layer, the network learns class-specific features, which are directly related to the final task.

Figure 2.5: Visualization Conv1,3,5 neurons learned from ImageNet dataset [4].

Other than the aforementioned basic convolution operation, several new convolution variants have been developed recently for different applications, such as dilated convolution [46], deformable convolution [47], and so on. We refer readers to [16, 48, 49] for more details.

2.1.2 Activation function

In deep learning models, the activation function basically decides whether a neuron should be activated or not, which serves as a gate controlling whether or not to pass the signal to the next layer similarly as in the biological neurons. This gate typically applies a nonlinear transformation on the input signal, then feeds it to the next layer of neurons as input. Due to the nonlinearity, the activation function is thus capable of solving more complex problems, other than the simple linear regression. Theoretically, by imposing the non-linear transformation, the overall network is able to learn and represent any arbitrary 26 complex functions. In addition, differentiability is another prerequisite for the activation function used in deep learning models, in order to optimize the models via gradient descent method (introduced in section 2.2.1). Next, several widely used nonlinear activation func- tions will be described, with their advantages and disadvantages.

Figure 2.6: Activation functions. 1

Sigmoid function is one of the most widely used activation functions, the curve of which is S-shape, as shown in Fig. 2.6. Since Sigmoid function maps the input to the output ranging between (0 to 1), it is often used in the output layer for models that predict a class probability. Sigmoid function is rarely used in the middle hidden layers due to the vanishing gradient problem that impedes the training. The vanishing/exploding gradient means: if the input (network parameters) of the activation function is too small or large, the gradient of the activation function would have very small values, thus making the gradients to be zero (”vanish”), and the parameters of the layers would never be updated; In contrary, if the input to the activation function is around zero, the gradient of the activation function would be a large number, which will lead to a very big update on the learned parameters of

1 Image source: https://medium.com/@shrutijadon10104776. 27 the model (exploding). The derivative of Sigmoid is illustrated in Fig. 2.7(a). We can also find Sigmoid is a differentiable and monotonic function. Tanh (hyperbolic tangent function) is also a S-shape activation function, and it slightly performs better than Sigmoid in dealing with ”vanishing” gradient, since the gradients still shrink but much slower. ReLU (rectified linear units) is the most widely used activation function currently, which is defined as: f (x) = max(0, x). This means the output of the ReLU f (x) will be zero when the input signal x is smaller than or equal to zero. Such set- ting will reduce the learning ability of the overall network from the data, since the output of the activation is always zero when x is negative, and these neurons become ”dead”. For this issue, some alternatives have been proposed to solve the irreversible ”dead neuron” problem in ReLu, like leaky ReLU, Maxout and ELU, which are shown in Fig. 2.6.

Figure 2.7: Derivatives for (a) Sigmoid (b) Tanh (c) ReLU.

2.1.3 Pooling

Pooling is an operation to summarize statistic of subareas of previous feature maps, and it reduces the size of feature maps and also makes the feature maps’ representation more robust to small shifts and distortions. Pooling works by sliding a window across the input and feeding the content of the window to calculate the local statistics, i.e., mean or maximum. Fig. 2.8 shows a max pooling example that transforms the 4 × 4 input to a 2 × 2 28 feature map.

Figure 2.8: Max pooling example.

2.1.4 Upsampling

For detection/classification CNN models like Lenet-5 and AlexNet, after several convolution+pooling operations, fully-connected layer(s) will be added to predict the input’s class probability. But in semantic segmentation problems, the purpose is to generate a pixel-wise segmentation probability map which has the same size as the input data, so the down-sampled feature maps need to be scaled to the original input size to facilitate this task. Accordingly, it is necessary to design an opposite process of convolution (or pooling) to expand the feature maps, namely upsampling the feature maps. In the following, we will introduce three different upsampling operations, including unpooling, interpolation, and deconvolution. Unpooling layers have the opposite effects as the pooling layers, and also exist together with corresponding pooling layers. In each max-pooling layer, the coordinates of max-value is stored, so in a corresponding unpooling layer, values from a previous layer are entered into the stored coordinates, setting zeros for the rest positions, as shown in Fig. 2.9. By doing this, localization information is restored. Interpolation is the most straightforward way of upsampling low scale/resolution feature maps to high scale/resolution feature maps. Commonly used interpolation methods 29

Figure 2.9: Unpooling.

include but not limit to piecewise constant (nearest neighbor), linear, polynomial and spline interpolation. Also, interpolation methods can be grouped based on dimensionality of the input data, such as bilinear & bicubic (2D), and trilinear interpolation (3D). Other than unpooling and interpolation, deconvolution is currently the most widely used upsampling method for semantic segmentation tasks. Different from interpolation or unpooling, parameter weights/kernels in deconvolution are learnable through network’s optimization procedure. The concept of deconvolution is widely used in signal processing or image processing. In mathematics, deconvolution is an operation that refers to reversing the effects of ”convolution” operation. For instance, assume h is a natural image, k is a smoothing kernel, g is the smoothed image by performing convolution operation on h using kernel k, i.e. g = h ∗ k. Deconvolution is the operation to recover the original image h, given g and k. However, the term of deconvolution in CNN specifically indicates a convolutional operation that upsamples the input to an output of higher spatial dimension. Thus, some may argue that deconvolution is an inappropriate name to describe such operation in CNN. And transposed convolution or backward convolution would be more appropriate, since this specific operation is achieved through transposed convolution. 30

4 5 8 7 2 9 6 1 1 4 1 1 4 1 1 8 8 8 122 148 2 1 6 29 30 7 ∗ 1 4 3 = ∗>1 1 4 3 = 3 6 6 4 126 134 4 4 10 29 33 13 3 3 1 3 3 1 6 5 7 8 output'2×2 Input'2×2 12 24 16 4 kernel'3×3 kernel'3×3 (b) Input'4×4 (a) output'4×4

4 1 0 0 0 2 5 4 1 0 0 9 8 1 4 0 0 6 7 0 1 0 0 1 1 1 0 1 0 6 8 4 1 4 1 29 1 4 1 0 1 4 3 0 3 3 1 0 0 0 0 0 122 2 2 9 6 1 3 4 1 4 30 8 reshape'to 122 148 1 0 1 4 1 0 1 4 3 0 3 3 1 0 0 0 0 ⊗ = 148 ⊗ = reshape'to 6 29 30 7 8 2×0 0 3 0 1 7 4×. 0 0 0 0 1 4 1 0 1 4 3 0 3 3 1 0 126 126 134 4 10 29 33 13 3 3 0 1 0 10 0 0 0 0 0 1 4 1 0 1 4 3 0 3 3 1 134 4 12 24 16 4 6 3 3 4 1 flatten'the'input' 29 output'(4×1) rearrange'kernel'to'a'4×16%matrix 6 1 3 3 4 matrix'(2×2)%%into'a' 33 column'vector'(4×1) 4 0 1 0 3 13 6 0 0 3 0 12 5 0 0 3 3 24 7 0 0 1 3 16 8 0 0 0 1 4 flatten'the'input'matrix'(4×4) (c) rearrange'kernel'to'a' output'(16×1) (d) %into'a'column'vector'(16×1) 16×4%matrix

Figure 2.10: 2D transposed convolution example. ∗ denote convolution, ∗−1 denote transposed convolution and ⊗ denote matrix multiplication. 2

Fig. 2.10 is an example that explains how transposed convolution works. In (a), it shows a 2D convolution example: a 4 × 4 input convolves a 3 × 3 kernel with the stride of 1 using no padding, which generates the output of a 2 × 2 matrix. This convolution can be reinterpreted in a way of matrix multiplication, as illustrated in (c). First the 3 × 3 kernel is rearranged to a 4 × 16 matrix (according to stride and padding), and the input is flattened to a 16 × 1 vector. Then, the matrix multiplication between them will produce a 4 × 1 column vector. If we reshape the vector to a 2×2 matrix, it is the same as the output from (a). Upon this, transposed convolution is defined as the ”reverse” of such operation, which means we want to generate a 4 × 4 matrix from a 2 × 2 input and a 3 × 3 kernel, as shown in (b). To achieve such operation, we transpose the 4 × 16 kernel matrix in (c) into 16 × 4 matrix, and multiply it with a flattened 4 × 1 input column vector. In this way, we could get a 16 × 1 vector, and it can be reshaped to a 4 × 4 matrix, which is the same size as in (b). The whole

2 Example source: https://towardsdatascience.com/@naokishibuya 31 procedure of transposed convolution is shown in (d). One thing to note is that the same kernel size and kernel values are used in both convolution and transposed convolution for demonstration in Fig. 2.10, but they are not necessary to be same in practice. For convolution operations, each pixel in the output feature map is the summation of a local receptive field multiplied with a kernel matrix. Thus it is a ”many to one” mapping. On the contrary, transposed convolution is a ”one to many” mapping. Fig. 2.11 shows the opposite effects of two mappings. Same as convolution, the kernel is learnable, and the shape of output feature maps from transposed convolution is also affected by the stride, padding size. Since transposed convolution is a ”reversed” convolution, such relationship can be derived directly from convolution. For more details, we refer reader to [3] for more details.

Figure 2.11: Mappings in convolution and transposed convolution.

2.2 Training of CNN

Although a concrete design of CNN architecture is one of the preconditions for the model to be successful in the target task, training CNN is arguably the most difficulty of all the problems in deep learning, and many strategies and tricks are proposed to ease the training of CNN. In the following sections, several commonly used optimization methods for CNN will be first introduced, followed by some training tricks and strategies, including 32 data augmentation, regularization, and transfer learning. Note that, these methods or tricks are not specific for CNN training, and can be applied for different types of deep learning models.

2.2.1 Optimization

The goal of a feedforward network is to approximate some function h, so that with certain kind of input the network could generate expected output. With the building blocks introduced in previous sections, CNN is eventually a nonlinear function which is decided by parameter θ. For different tasks, the goal of training CNN is to seek a set of optimal parameters for the task, which is through optimizing a well defined objective function J(θ). For example, LeNet-5 model maps an input image to one of the 26 characters. Information flows through the network from input x, then the intermediate operations, and finally to the

i (i) output hθ(x). hθ(x) is a vector that represent the probability (p(y = j|x )) of the predicted class label for each of the k different possible categories. The objective function for such classification task can be written as :

m k 1 X X J(θ) = − [ 1(y(i) = j) log p(yi = j|x(i)); θ] (2.3) m i=1 j=0 where 1{.} is the indicator function, such that 1{a true statement} = 1, and 1{a false statement} = 0. yi and j are predicted label and ground truth label accordingly. Since the CNN model is typically very complex consisting of multiple layers, it is not feasible to derive a closed form solution for this non-convex objective function. Thus, gradient descent is used to optimize nearly all of deep learning problems through propagating the gradient of the objective function to update the parameters of each layer from the output layer to the input layer. This is called back-propagation, an iterative optimization procedure. During each iteration, the updating can be written as: 33

θt+1 = θt + ∆θt (2.4)

∆θt = −λ · ∇θ J(θ; x; y) (2.5)

where, θt is the parameters in t-th iteration, λ is the learning rate that determines how much we are adjusting the weights of the network with respect the loss gradient. Large training data sets improve the recognition ability of deep neural network, but also increase the computational complexity. Different from optimization for traditional machine learning algorithms that process all the training examples simultaneously in a large batch, deep learning models including most CNN models use mini-batch optimization strategy, which computes each update to the parameters based on an expected value of the cost function estimated using only a subset of training data. This is called mini-batch stochastic gradient descent, or it is now common to call them simply stochastic gradient descent (SGD), as expressed by:

(i) (i) ∆θt = −λ · ∇θ J(θ; x ; y ) (2.6) where i stands for a mini-batch sampled from all training data. Gradient descent has often been regarded as slow. For deep learning models that typically have complicated structures and are trained on large data sets, how to improve the speed of training without hurting convergence for the SGD algorithm is very crucial and tricky. In general, vanilla SGD does not guarantee a good and fast convergence due to the following challenges:

• In SGD, choosing an appropriate learning rate is crucial and difficult. A small learning rate leads to slow convergence, while a large learning rate will hinder convergence and cause the loss function to fluctuate around the minimum. Thus, currently learning rate is typically treated as a tunable hyper-parameter. 34

• For high-dimensional parameters, SGD sets the same learning rate for each dimension without considering the fact that each dimension contributes to the overall cost in different ways.

• Another challenge is how to avoid getting trapped in in a local minimum or saddle points where the gradient is close to zero in all dimensions.

Several optimizers have been developed to deal with aforementioned challenges, such as learning rate annealing methods or setting different learning rate for each dimension of the parameters. Another potential direction is to use second order derivatives of the cost function to guide and speed up the gradient descent. However, the computation cost is prohibitive. Therefore, the majority of practical solutions are seeking a way to approximate the second order information. Several commonly used SGD variants are proposed, such as SGD with momentum, AdaGrad, Adadelta, RMSProp and Adam. For complete theory about these optimizers, we refer readers to [50] for more details.

2.2.2 Data augmentation

Optimizing convolutional neural network is a non-convex optimization problem, which is well known to be very difficult. In addition, deep CNN models typically consist of millions of parameters. Thereby, to ensure a good performance as well as generalization of the deep CNN models, a large training data set that is diverse enough to represent the overall data distribution is necessary. However, such conditions are usually difficult to achieve, especially in medical image computing. Thus, one of the most widely used methods to overcome this problem of limited quantity and limited diversity of data is to generate (manufacture) new training data by augmenting existing data. Typically for image data, there are two ways for data augmentation, one is to alternate original images with geometry transformation, such as: rotation, flip, scale, crop, translation and adding noise. 35

More recently, generative models, especially Generative Adversarial Networks (GANs) [51] are also utilized to generate synthetic images due to its extraordinary performance in learning latent representations from the original training data. There are some applications to prove that adding these synthesized images into training can help the training procedure [52]. The following images in Fig. 2.12 are two examples for comparison between original real images and generated synthetic images using cycle GAN [5].

Figure 2.12: Images generated by cycle GAN [5]. Images in first and third columns are real images and the transferred output are in second and fourth columns.

2.2.3 Regularization

A central problem in machine learning is how to make algorithms perform well on new unseen inputs, which is known as the generalization ability of the learned model. This is especially important for training CNN models because they are prone to overfitting. For deep learning models, several regularization strategies that are commonly used for traditional machine learning algorithms still hold true for training CNNs. For example, some methods try to impose restrictions on the parameter values by adding extra terms in the object function, such as sparsity constraints. Sometimes, these terms can also be 36 designed to encode specific kinds of prior knowledge or to express a generic preference for a simpler model. Next, we introduce two specific regularization techniques that are recently designed for deep learning models. Dropout: Dropout is a specific technique to combat the overfitting problem of neural networks. The term ”dropout” refers to randomly drop units (along with their connec- tions) from the neural network during training. This significantly alleviates the parameter co-adaptation problem. During training, different thinned networks are sampled from the original model by removing (zero out) a random fraction of nodes (and corresponding activations). During the testing, all activations are used to generate the prediction, but mul- tiplied by a factor to account for the missing activations during training. Overall, dropout can be viewed as a form of ensemble learning, in which several different classifiers with dif- ferent numbers of parameters are trained separately, and then the final prediction generated by averaging all response of these classifiers are used in test stage, as illustrated in Fig. 2.13.

Figure 2.13: Subnetworks after dropout.

Batch normalization The main idea of batch normalization is inspired by the typical normalization for input training data. It provides a remedy for internal covariate shift problem by applying a normalization for each intermediate layer’s activations of CNN. Specifically, during training, a Z-score normalization is conducted on the output of previous 37 layer by subtracting the batch mean and dividing by the batch standard deviation. In this way, it allows each layer of a network to learn by itself a little bit more independently of other layers, and also allows for much higher learning rates due to its inhibiting effect on abnormal activations. As it introduces some noise in each layer, batch normalization is also regarded as one kind regularization. If used together with dropout, extra attentions should be paid to avoid losing too much useful information in training.

2.2.4 Transfer learning

Figure 2.14: Transfer learning.

In general, transfer learning is a research problem in machine learning that focuses on how to apply the prior knowledge gained from solving one task on a different but related task. This is an extremely valuable technique for deep learning, since most of tasks at hand suffer from the issue of training data scarcity. In deep learning, the most commonly used transfer learning approach is to finetune a pretrained source model on new datasets for new tasks. The pretrained model could come from training on related tasks, or directly from released models that are trained on very large and generalized data sets. For CNN models, the validation of this type of transfer learning is lying on the fact that lower layers of CNN extract very generalized latent information which is not task specific and can be easily to transfer to other tasks. As shown in Fig. 2.14, an example of CNN transfer learning is 38 illustrated: we use the parameters of a pretrained model as the initialization for a different task, and remove the last task-specific classifier layer, then finetune the overall model using data from the new task.

2.3 RNN - recurrent neural network

Figure 2.15: Structures of RNN: Left (a): rolled RNN, Right (b): unrolled RNN [2]

Recurrent neural networks (RNNs) [2] are a family of feedforward neural networks

that are specialized for processing sequential data x1, x2, ..., xn, which are related in time or space and whose order cannot be changed. Different from CNNs that treat data points independently and lose the network states after processing each data, RNNs model data points with temporal or sequential structure, and is able to maintain “memory information” about previous inputs that affects the following network outputs. RNNs have been successfully applied to solve a lot of Natural Language Processing problems [53–55]. Fig 2.15(a) shows a typical directed acyclic RNN model. It takes an input sequence x and produces an output sequence o, and has recurrent connections within hidden units h. Fig 2.15(b) is the corresponding unfolded graph for the same RNN model in (a), which shows more details about how the sequential information passes the model. Generally, the majority of RNNs use the following or similar equation to define their hidden units: h(t) = tanh(b + Wh(t−1) + Ux(t)). 39

U, W are the weight matrix parameters connecting input units with hidden units, and previous hidden units with current hidden units, respectively. tanh is the activation function. The equation is regarded as being recurrent, because the current hidden state h at the current time point t relies on the same definition at the previous time point t − 1 , and the same operation would be applied at each time point of a long sequential data. The final output layer o reads information out of the hidden state h to make predictions,

o(t) = c + Vh(t)

V is the weight matrix parameters connecting hidden units with output units. An apparent difference between traditional deep neural network and RNN is, the same parameters (U, V, W) in RNNs are shared across all steps. This means RNNs are performing the same task at each time step, and the model can take different lengths of input, and also requires fewer parameters to be computed. The time step index t shown in the equations above does not need to be interpreted as the flow of time in real world. It refers only to the position of a sequential data that follows certain kind of order. In addition, RNNs may also be applied in higher dimensions, including spatial domain such as images. 40 3 Literature Review

Medical image segmentation is a key enabling technology for various medical applications such as diagnostics, planning and guidance. Automatic and efficient implementations have long been a research focus in medical image computing community probably since several decades ago. Prior to the emergence of deep learning methods in recent years, medical image segmentation has also witnessed a long and rich history of developments, which could be roughly grouped into three stages, according to [56]. The first stage mainly consists of low level techniques that are still relying on heuristics, such as threshold, region growing, and edge tracing. Later in the second stage, more focuses were shifted to uncertainty models, better optimization methods, and avoidance of heuristics. Many classical and popular models were invented in this stage, including fuzzy C-means clustering [57], active contour [58], graph cut, and some early learning based solutions [59, 60]. In the more recent third stage, more efforts were spent on how to efficiently incorporate high-level knowledge (i.e., a prior shape of the desired object, expert-defined rules) into segmentation models. As introduced previously, atlas-based or patch-based segmentation models start to become the mainstream for different medical segmentation tasks. In this section, due to the broad spectrum of topics and long history of medical image segmentation, we are not aiming to conduct an exhaustive survey in the currently literature, but rather devote the remaining section to more recently proposed deep learning based segmentation solutions, which could provide a sufficient basis for the future discussion of our own models. Specifically, we review some recent CNN or RNN based segmentation models, respectively, following by a related joint model. For a more comprehensive review of classifical medical image segmentation, we refer readers to some survey papers [56, 61, 62]. 41

3.1 CNN based Segmentation

In recent years, the availability of large image datasets, high-performance computing systems, such as GPUs or large-scale distributed clusters made training very deep CNNs possible. Since 2006, many methods [63–66] have been developed to overcome the difficulties encountered in training deep CNNs. In 2012, the AlexNet [35] model proposed by Krizhevsky et al. won ILSVRC-2012 computer vision competition [67] by a significant amount of margin than the other traditional non-CNN methods. Although the AlexNet model is the combination of very old CNN concepts (pooling and convolutional layers, variations on the input data) with several new key insights (very efficient GPU implementation, ReLU neurons, dropout), it very well demonstrated the power of a well- trained deep CNNs. With the success of AlexNet, several other CNN based works including VGGNet [68], GoogleNet [48] and ResNet [66], keep breaking the ILSVRC IMAGENET challenge record, and a lot of other works [36, 69, 70] have also demonstrated the success of deep CNNs in other recognition or classification applications. Therefore, it is natural for researchers to think about how to progress the coarse inference lying in recognition tasks to fine pixel-wise prediction in segmentation tasks. Two approaches are commonly used to apply or extend CNN based recognition methods on segmentation problems [71]. The first one is to train a CNN model on small patches extracted from the training images using the center pixels’ class as the label. And the final segmentation results are obtained by applying the CNN model across all the pixels of the test image. The second is an end-to-end solution, which relies on training a fully convolutional network (FCN) on the whole images or image proposals (part of the image contain single object), and outputs the segmentation result directly from the network. Next, we will introduce several representative works of CNN based segmentation models from those two categories. 42

Figure 3.1: A patch-wise CNN segmentation model for the membrane segmentation of Electron Microscopy images [6].

3.1.1 Patch-wise CNN for segmentation

In [6], a deep convolutional neural network is adopted as a pixel classifier for the membrane segmentation of Electron Microscopy (EM) images. The deep CNN is trained with image patches extracted from the original whole image, and the label for each image patch is decided by its center pixel. Specifically, if the center pixel of an image patch belongs to membrane, this image patch would be assigned with the class of membrane. In this way, the image segmentation task becomes a dense patch-wise classification for each extracted image patch. Given a new test image, the final segmentation result can be obtained by performing the trained patch-wise classifier on each pixel of the test image. Fig 3.1 displays the whole process of segmenting an EM image and the architecture of the patch-wise CNN segmentation model they used. In their work, each pixel was treated independently, and they didn’t consider the relationship between adjacent pixels. Their experimental results showed this method achieved a state-of-the-art performance, and it also won the EM segmentation challenge at ISBI 2012 by a large margin. One apparent drawback of this patch-wise CNN segmentation method is the extremely high time cost during the prediction. Another drawback is that the model would ignore 43 global information due to the constraint of patch size. There is always a tradeoff between the patch size and the degree about the global information they can involve. For larger patch size used, larger degree of global information can be incorporated in but may potentially lose fine local details. Thus, for this point, multiple scale based solutions [72] are proposed to capture different levels of information to further improve the segmentation accuracy.

3.1.2 Fully convolutional network

Figure 3.2: The architecture of FCN [7].

Instead of applying existing successful CNN models as patch-wise classifiers for image segmentation task, researchers were also trying to design more efficient end- to-end CNN-based segmentation architectures. The problem about using the CNN for segmentation tasks is that the spatial information would be significantly or totally abandoned during the stacked convolutional and pooling operations, as well as the final fully connected layer. Although the spatial information might not be very useful for object recognition, it is extremely important for accurate segmentation. To deal with this problem, Long et al. [7] proposed a fully convolutional network (FCN) model which replaces the 44 fully connected layers of CNNs with convolutional layers. In other words, they convert the classification nets into convolutional nets that produce coarse pixel-wise class probability maps. The coarse probability maps are then further upsampled to the same spatial size as the input image by using deconvolution operation. However, the segmentation resulted from the upsampling of only the final conventional layer is pretty coarse and lost lots of details. They addressed this by a skip net architecture which integrates the outputs of different convolutional layers. This combination of lower fine layers and upper coarse layers forces the model respect global structure during making local prediction, which significantly improves the segmentation accuracy. The FCN model is the first work to train a CNN based end-to-end model for pixel-wise image segmentation, and it can also utilize existing supervised pre-trained CNN models. However, the model contains a large size of encoding network and a relatively smaller decoding network. So, it is good at extracting the overall shape of an object but difficult to reconstruct highly nonlinear structures of object boundaries accurately.

3.1.3 SegNet

Figure 3.3: The architecture of SegNet [8].

To ameliorate the aforementioned boundary issues of FCN, some researchers [73, 74] tried to integrate FCN with conditional random fields, and showed some progress. 45

However, this imposes extra computation burden and difficulty on the overall process. In [8], a more elegant and efficient SegNet model was proposed to improve FCN, which adopts an end-to-end segmentation architecture as shown in Fig 3.3. Overall, SegNet is composed of two parts: an encoder network and a decoder network. For the encoder network, the architecture is exactly the same as VGGNet model except removing the final three fully connected layers. By doing so, the encoder network only contains around 10% parameters of the original VGGNet model, which would in certain degree ease the process of training. The decoder network is symmetric to the encoder network: for each layer in the encoder network, there is a corresponding decoder layer in the decoder network. To recover the spatial information lost in the encoder network and reconstruct the original size of activations, upsampling layers are employed in the decoder network which perform the reverse operation of pooling. Similarly, corresponding to convolutional layers used in encoder network, they were also utilized in decoder network to densify the enlarged, yet sparse activations obtained by upsampling layers. By stacking upsampling, convolutional and rectification layers together, a “reverse” network for encoding process, the so called decoder network is constructed with connection to the encoder network, which in total form the overall SegNet. The encoder and decoder networks are more balanced in SegNet than in FCN, and experimental results also showed substantially better performance than FCN.

3.1.4 U-Net

Based on the work of FCN, [9] proposed U-Net model, which is another end-to-end segmentation model that works very well with very few training images and yields more precise segmentation results especially on biomedical images. In Fig 3.4, we can see the architecture is very similar with SegNet. U-Net has a contracting and a symmetric expanding path. The contracting path is used to capture latent high-order features and the 46

Figure 3.4: The architecture of U-Net [9].

expanding path is used to increase the low resolution feature maps from the contracting path. Instead of directly remembering the location of the maximum activation in the max- pooling layers to recover the lost spatial information as in SegNet, high resolution feature maps generated from contracting path were directly concatenated to the corresponding feature maps in the expanding path, which forms skip connections between two paths at different levels. This yields a U-shape architecture as shown in Fig. 3.4. Different from FCN, large number of feature channels are applied in the expanding path to allow the network to propagate context information to higher resolution layers. Experimental results show that U-Net can be trained end-to-end from very few images by using random elastic deformations to achieve data augmentation , which makes it very suitable for medical image segmentation tasks. In fact, U-Net has demonstrated this by winning Cell 47

Tracking Challenge at ISBI 2015 on the two most challenging transmitted light microscopy categories.

3.2 RNN based Segmentation

3.2.1 LSTM

(a) (b)

Figure 3.5: Structure of RNN and LSTM: (a) Standard RNN with a single layer, (b) LSTM with multiple layers. 3

Thanks to the “parameter sharing” property and recurrent operations, RNNs are able to deal with a very long sequential data by using a very “deep” computational graphs. However, due to vanishing/exploding gradient problems, using traditional backpropagation method, it is very difficult to train RNNs to learn dependencies between steps that have long interval. Some methods have been proposed to deal with this problem [75, 76], and Long Short Term-Memory networks (LSTM) [77] are one of the most widely used models among them. Gating mechanism was introduced in LSTMs to reduce the effect of vanishing gradients in RNNs. In LSTMs, a new type of memory cell is used to replace the traditional hidden unit. Fig 3.5 shows the difference between a typical RNN hidden unit and a LSTM memory cell. The memory cell is the essential idea behind LSTMs, and it has the ability to

3 Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 48 reserve different amounts of information within the cell depending on different situations. Thus, LSTMs are able to learn when to remember information for long periods of time and when to just focus on recent accumulated information. Specifically, the memory cell is a composite unit containing several elements. In addition to the input xt and output ot, the memory cell also has 3 regulation gates( ft, it, ot) that control the process of removing or adding information to its memory state ct. The memory state ct is a combination of information from the previous memory and the new input, both are banned through their corresponding gates: forget gate ft and input gate it. The following equations describe this procedure:

it = σ(Wi xt + Uict−1 + bi)

ft = σ(W f xt + U f ct−1 + b f )

ot = σ(Wo xt + Uoct−1 + bo)

ct = ft ◦ ct−1 + it ◦ tanh(Wc xt + Ucct + bc)

ht = ot ◦ tanh(ct)

ft, it, ot represent the forget gate, input gate, and output gate, respectively. They are called gates because the sigmoid function is used within them to squash the values of the input vectors between 0 and 1. By multiplying the output vector from gates with another vector element-wisely, different amounts of information of that vector can be kept after passing through the memory cell. With this mechanism, when LSTMs are trained using backpropagation and the total error value on a set of training sequences are back- propagated from the output, the error becomes trapped in the memory cell. LSTMs are the most commonly used RNN variant, and have been successfully applied on many different tasks [78, 79]. 49

, Ht+1 Ct+1

, Ht Ct Xt+1

t 1, t 1 t H − C − X

Figure 3.6: Inner structure of CLSTM [10].

3.2.2 Convolutional LSTM and its application on segmentation

In spite of the success of LSTMs, the standard LSTMs are explicitly designed to deal with one dimensional sequential data. This means if we want to apply LSTM on multi- dimensional data, the data has to be preprocessed and presented as one-dimensional. Due to the fully connections in input-to-hidden and hidden-to-hidden transitions, conventional LSTMs (FC-LSTM) have some limitation in handling spatio-temporal data and ignores the multi-dimensional spatial information. To solve this problem, several works have been proposed to modify the LSTMs so as to include spatial information [10, 79, 80]. Convolutional LSTM (CLSTM) [10] extends FC-LSTM through the replacement of full connected (matrix multiplications) with convolutional operations on both the input- to-hidden and hidden-to-hidden transitions. The basic idea of CLSTM is to modify the multiplication operations in LSTM into convolutions such that a much more convenient implementation is resulted. The inner structure of CLSTM is shown in Fig. 3.6. All the inputs, cell outputs, hidden states, and gates of the CLSTM are 3D matrices, whose last two dimensions refer to spatial dimensions instead of 1-D in FC-LSTM. The CLSTM is a generalization of FC-LSTM: it has all the benefits of FC-CLSTM, but can also keep the multi-dimensional spatial information of spatio-temporal input due to the convolutional operation. There have been many successful applications of CLSTM. For example, [80] applied CLSTM to 3D medical image segmentation problem by using six CLSTMs. Each 50

CLSTM processes the entire volume in one unique direction. The outputs from the six CLSTMs are combined to create the full context of each pixel. Their model achieved best pixel-wise brain image segmentation results on MRBrainS13 (ISBI 2015 workshop on Neonatal and Adult MR Brain Image Segmentation).

3.3 Segmentation based on combination of CNN and RNN

Previous sections described several image segmentation models based on either only CNNs or only RNNs. CNNs have the advantages of learning salient image features, and there exists a lot of pre-trained CNNs models that are ready for transfer learning [35, 48, 68]. However, when using CNNs for 3D segmentation tasks, the widely-adopted way is to treat 3D image volume as a stack of 2D slices, apply 2D CNN segmentation model on them, and then combine the 2D segmentation results. By doing so, it is inevitable to lose the continuous spatial information along the third dimension, which is not acceptable for 3D medical images. If directly applying 3D CNNs, much more parameters will be needed to describe the 3D convolutional kernels, and more samples are also required for training. Nevertheless, it is generally not possible to have a very large (1000+) 3D image data set in medical area. On the other hand, RNNs or LSTMs have the ability to reserve the temporal or temporal-spatial information. But, RNNs are typically not as powerful as CNNs to extract salient image features, since although they have a very deep network along the temporal direction, each hidden unit is indeed a shallow structure along the spatial dimension. Therefore, some researchers are trying to combine CNNs and RNNs [11, 81, 82], and [11] is one successful application on 3D medical image segmentation.

3.3.1 U-Net + Bi-directional CLSTM

Considering the presence of dimension anisotropism in 3D medical images, [11] proposed a 3D medical image segmentation framework that combines FCN and RNN. The FCN is used to extract 2D slice features, while RNN is used to integrate context information 51

(a) (b)

Figure 3.7: Models proposed in [11]: (a) Structure of kU-Net, (b) Overview of the framework (Deep BDC-LSTM) .

between those 2D slices along the z-axis. Specifically, for FCN part, they proposed a new architecture, kU-Net, which consists of 2 U-Nets in different scales. Fig. 3.7(a) shows the structure of kU-Net. A 2D image slice and its downsampled version are fed into the two U-Nets respectively. In this way, both the low level, detailed information and high level, abstract information are propagated in parallel along the kU-Net, and are eventually combined together by concatenating the outputs of them. For RNN part, they used a stacked bi-direction convolutional LSTM network (BDC-LSTM) to integrate the context information between the 2D outputs generated from kU-Net. BDC-LSTM is an extension of CLSTM, which simply stacks two CLSTM together, but feeds in the same input sequence in opposite direction twice. The contextual information from the two CLSTM are concatenated as the final output, which assures information from both direction along the z-axis are incorporated at the same time. In the end, multiple BDC-LSTM were stacked together to form a deep structure. The overall structure of this model is shown in Fig 3.7(b). Evaluated in two different 3D biomedical image segmentation applications, they showed this new model can achieve the state-of-the-art performance and outperformed known 2D schemes. 52 4 Multi-view Ensemble ConvNet for Hippocampus

Segmentation

In this chapter, we introduce the proposed multi-view ensemble convolutional neural network (ConvNet). Specifically, we are exploring the feasibility of combining 2D slice- based decisions from multiple 2D planar views, and construct a CNN based multi-view ensemble net, in order to solve the issue of limited training samples for 3D medical image segmentation, as well as the loss of 3D spatial context information using only 2D slice- based CNN models. We start with the practical motivation in the context of a clinical task, i.e., hippocampus segmentation in 3D MR images, and then describe in detail the whole framework of the proposed model, including U-Seg-Net and Ensemble-Net. Experimental settings and results for the application of hippocampus segmentation task are presented to demonstrate the effectiveness of our model, with comparison to some state-of-the-art solutions in the same area. The work presented in this chapter has been published in [83].

4.1 Motivation: Hippocampus segmentation

Brain is arguably the most important organ for human beings. It performs like a commander for the human nervous system: collecting signals from human body’s sensory organs and then sending instruction to guide the movement of muscles. In worldwide, brain disorders are the major cause for morbidity, disability, and premature mortality, and it includes any impairment that may affect the brain function either caused by illness, generics or traumatic injuries [84]. For instance, there are over one-quarter of adult Americans who have been diagnosed with mental illness, such as Alzheimer’s Disease (AD), Post-Traumatic Stress Disorder (PTSD), and Major Depressive Disorder (MDD) [84]. Therefore, how to prevent and further improve treatment for brain disorders has long been a worldwide issue and problem. 53

Among different brain cortical and subcortical structures, the hippocampus, located under the cerebral cortex and in the medial temporal lobe, is an important component of limbic system. It plays essential roles in the consolidation of information from short-term memory to long-term memory, and in spatial memory that enables navigation. Neural degeneration and dysfunction of hippocampus has been studied and associated with many brain disorders, including temporal lobe epilepsy, AD, mild cognitive impairment (MCI), schizophrenia, MDD, bipolar disorder, and many others [85]. Accessing the morphometric characteristics and the structural integrity of the hippocampus (HC) is important to diagnosis and monitor these brain disorders. For Alzheimer’s Disease, the atrophy of hippocampus, reflected by structural T1-weighted MRI scans, has been demonstrated as a biomarker that is able to predicting the progression of MCI to AD [86, 87]. In Fig. 4.1, it shows a bounding box around the hippocampus region from coronal slices of the T1-weighted MRI of respectively NC, MCI and AD subjects, in which the atrophy of hippocampus can be clearly observed. In addition, monitoring the volumetric change of the hippocampus for AD patient is also useful for accessing the treatment outcomes for potential drugs. In epilepsy, asymmetry in hippocampal volumes (atrophy of the hippocampus in one hemisphere) has also been utilized as a predictor, and its atrophy are also used to measure the progression of the disease [88, 89]. In many other studies, although not yet been validated a clinical biomarker, the atrophy of hippocampus has also been shown highly relative with some diseases, including: schizophrenia [90], PTSD [91], and bipolar disorder [92]. Given the aforementioned evidence and widespread agreement of the usefulness of HC volumetry, accurate segmentation of hippocampus with appropriate reproducibility is the necessity of its establishment as a biomarker [93]. Thanks to the invention and continuous improvement of Magnetic Resonance Imaging (MRI) techniques, it is now a commonly accepted, non-invasive imaging solution to quantify the volume and access the shape of 54

Figure 4.1: Comparison of the hippocampus region from coronal slices of the T1-weighted MRI of respectively NC, MCI and AD subjects [12].

the hippocampus [94]. Although the development of MRI techniques has led to better quality for brain imaging with sufficient resolution and contrast enabling quantification of HC’s symmetry and atrophy, manual or semi-automated segmentation is still considered as the gold standard for HC assessment and adopted as the clinical routine. However, the manual or semi-automated solutions are very time-consuming and laborious, suffer from both inter- and intra-rater variability, which hinders the effective, large-scale morphological study of hippocampus [95]. To overcome these limitations, fully automatic approaches have been developed for hippocampal segmentation, which can be roughly divided into atlas-based, patch-based, and learning-based methods [96]. We have briefly discussed the pros and cons of these methods in sections 1.2 and chapter 3. Although these methods have demonstrated some promising results, segmentation errors and mislabeled voxels still remain unavoidable. 55

In general, automatic segmentation of the hippocampus in MRI is a challenging task. On one hand, the gray levels of the hippocampus in MRI are very similar as other neighboring structures, including Amygdala, the Caudate Nucleus and the Thalamus, as shown in Fig. 4.2. There is not a very well-defined, clear border to separate hippocampus from its neighboring structures. On the other hand, some MRI intrinsic effects or artifacts, such as partial-volume effect or non-homogeneous intensity, bring another level of difficulty for accurate automatic hippocampus segmentation [96]. To overcome these limitations and further improve existing automatic solutions for hippocampus segmentation, it is of great and specific values to utilize or explore extra information that is also meaningful for clinical operators (i.e., radiologist) when they perform manual segmentation, such as the prior knowledge of the shape and position of the hippocampus, or even the way of their MRI reading/assessment, i.e., reading from three anatomical planes (sagittal, coronal, and axial) in a slice-sliding fashion. Inspired by this, as well as the limitations of existing 2D slice-based image segmentation models as discussed in section 1.3, we propose an automated 3D hippocampus segmentation method in MRI, relying on the utilization of a multi-view ensemble convolutional neural network, and demonstrate its effectiveness by systematic experiments on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data set.

4.2 Method

As shown in Fig. 4.3, our model consists of two major components. First, a series of U-Seg-Net based segmentation networks are employed to conduct 2D segmentation on the slices extracted from nine different views, which produce nine preliminary 3D probability maps for each hippocampus. Second, the Ensemble-Net is trained to fuse these probability maps to generate the final segmentation with improved accuracy. 56

Figure 4.2: Example showing the hippocampus and six other subcortical regions around hippocampus (caudate nucleus, putamen, globus pallidus, nucleus accumbens, thalamus and amygdala) in axial, sagittal, coronal and 3D views (hippocampus-cyan; caudate nucleus-yellow; putamen-magenta; globus pallidus-green; nucleus accumbens- blue; thalamus-red;amygdala-white) [13].

4.2.1 U-Seg-Net

Our work is built on a 2D segmentation network which has similar architecture to U-Net, as shown in Fig. 4.3 (a). As a modification and extension of FCN, U-Net utilizes a symmetric encoding-decoding architecture. The encoding path utilizes several 57

Sagittal Coronal Axial Diag1 Diag2 Diag3 Diag4 Diag5 Diag6

1 32 128 1 32 32+128 128

64 64 64+128 128

Conv 3×3 Relu Max pool 2×2 128 128 Copy De-conv 2×2 Ensemble-Net

(a) U-Seg-Net (b) Model overview

Figure 4.3: Model overview.

convolutional + pooling operations to extract latent representations, while deconvolutional + convolutional layers are adopted to gradually upsample the latent feature maps and learn a semantic segmentation of the input. With the purpose of pixel-wise semantic segmentation, ”bridge connections” is utilized to compensate location information loss caused by max- pooling via concatenating feature maps at different levels in the decoding path with the feature maps in the corresponding encoding path. In the proposed U-Seg-Net, we kept the main ”encoding + decoding + bridge” architecture as U-Net, and made some necessary modifications to accommodate the data and task of our own. In U-Net, each step has two convolution layers before max-pooling or deconvolution, while our U-Seg-Net preserves only one of them and hence reduces total number of parameters. Additionally, in U-Seg-Net padding is used in convolution layers to retain the spatial dimensionality of feature maps, thus it avoids the need of resizing input images which is implemented in U-Net. We also use different number of kernels in convolutional layers in encoding and decoding path, so essentially U-Seg-Net is not a strictly symmetric architecture. The kernels used in the U-Seg-Net all have the same size of 3×3. In the last convolutional layer of U-Seg-Net, the number of the output feature maps 58 is reduced to one, since our segmentation task at hand is binary (foreground hippocampus vs background). And cross-entropy is used as the objective function. Furthermore, dropout and batch normalization is also used to avoid overfitting. In our work, the 3D volume of a hippocampus is viewed as n planar slices of 2D images aligned along the third axis (z-axis), which is fed into a U-Seg-Net as n independent samples. The final segmentation of a 3D hippocampus can be generated by stacking the 2D segmentation results inferred for individual slices.

4.2.2 Ensemble-Net

Figure 4.4: Sagittal, coronal and axial views of hippocampi in an MRI scan.

Several justified concerns for our slice-based 3D segmentation (U-Seg-Net) exist. First, the segmentation results only depend on 2D images without considering their neighbors, and 3D contextual information is neglected in the network. Due to the shape of hippocampus, extracted 2D slices have very different foreground-background ratios at different positions and thus may easily lead to bumpy and inconsistent results. Second, sampling planar slices from different directions/views (sagittal, coronal and axial views, shown in Fig. 4.4) would have different effects on the final segmentation as different types of shape information is contained along different views. For example, slices containing 59 anterior tips of hippocampi sampling along axial view occupy few foreground pixels and would result in bad segmentation results. From sagittal view, however, they are a part of clearly visible structures in several slices and therefore easy to be segmented. To re-establish contextual information for each pixel, integrating multiple decisions from segmentation results along different views would provide a solution and achieve a more 3D-awared segmentation results for the whole structure. Ensemble learning is a justified and straightforward solution for multiple view integration. Although data from different views to certain extent is related, each view carries their unique 3D information. When multiple, independent, and diverse decisions are combined, the random errors cancel each other out, and correct decisions are reinforced.

Table 4.1: Three different configurations for our Ensemble-Net. The convolutional layer parameters are denoted as ”conv[kernel size]-[number of kernels]”.

Ensemble-Net Configuration

Ensemble-Net 1 Ensemble-Net 2 Ensemble-Net 3

Layer 1 conv5-1 conv3-2 conv1-4

Layer 2 ——- conv3-1 conv3-4

Layer 3 ——- ——- concat(Layer 1, Layer 2)

Layer 4 ——- ——- conv1-1

In our work, a multi-view ensemble ConvNet is proposed as the weighted combination of multiple U-Seg-Net decisions from different views. To utilize utmost 3D context for each pixel, nine different views are analyzed, including three orthogonal views(axial, coronal, sagittal), and six diagonal views, as shown in Fig. 4.3(b). In this work, we explore two types of ensemble learning. The first one relies on a 3D CNN-based network, dubbed 60 as Ensemble-Net, which takes each view’s stacked segmentation results (the segmentation probability maps) of a 3D hippocampus image and infers a refined segmentation probability map. To be more specific, as a fully convolutional net the input of the Ensemble-Net is a 4D tensor, of which multiple views are concatenated in channel-wise fashion. It aims to learn a nonlinear combination of probability maps from different views, while the spatial dimension of the input is reserved through padding-enabled convolutions. Table 4.1 shows three network configurations we used for our Ensemble-Net. Alternatively, we also implement pixel-wise majority voting as comparison, for which each view is directly assigned with the same weights.

4.3 Data

In this work, baseline T1-weighted whole brain MRI images and their hippocampus segmentation masks from ADNI database 4 are used. Considering hippocampus has a relatively fixed position and shape in human brain, we use FIRST 5 as a coarse segmentation method to roughly detect the location of hippocampus first, then crops each hippocampus out using a same size of bounding box. This preprocessing step is straightforward but necessary for the subsequent refined segmentation of using our proposed solutions, since training on the original T1 images with the whole brain volume are too computationally expensive. Although the Dice between the segmented hippocampus using FIRST and the ground truth mask is approximately 70%, it is already accurate enough for the purpose of localizing and cropping the hippocampus. After this processing step, hippocampus are cropped along with its neighboring information from T1 images in 3D box, with the size of 26×50×40 for the left side and 30×56×44 for the right. Those 2 boxes are big enough to contain the hippocampus which is as shown in Fig. 4.3 (b).

4 http://www.loni.usc.edu/ADNI 5 https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FIRST 61

4.4 Experimental settings

In each of the following experiments, respective models are trained and evaluated independently for segmenting left and right hippocampus. For each experiment, 10-fold cross-validation are utilized. There is in total 110 subjects, so 99 subjects are for training and 10 for testing. All the models are implemented using MXNet6 deep learning framework. AdaDelta is used as the optimizer, with its default hyperparameter settings: learning rate as 1, rho as 0.95, epsilon (fuzzy factor) as 1e-08 and learning rate decay over each update as 0. We observe the achieved training loss convergence and empirically set early stopping epoch as 50.

4.5 Evaluation measurements

Dice coefficient [97] and Hausdorff distance [98] are used to evaluate the quality of segmentation in this work. Dice coefficient is one of the most widely used similarity metrics for evaluating segmentation results. Given two sets, X and Y, Dice coefficient is defined as

2|X ∩ Y| Dice = (4.1) 2|X| + |Y| where |·| denotes the number of elements in the set, e.g. pixels or voxels. For segmentation, X and Y are typically corresponding to the ground truth binary mask and the predicted binary mask for the target of interest. In this work, for each hippocampus, we calculate Dice coefficient directly in 3D space. Hausdorff distance (HD) measures the distance of two subsets A and B in a metric space. It is defined as

6 https://mxnet.apache.org/ 62

HD(A, B) = max(h(A, B), h(B, A)) (4.2) where directed Hausdorff distance h(· , ·) is calculated as

h(A, B) = max min ka − bk (4.3) a∈A b∈B k · k in 4.3 denotes a norm, which is Euclidean distance in our work. Since HD is a maximum value, generally it is sensitive to outliers [98]. When evaluating segmentation results, it is a good indicator for the stability and consistency of subject-wise segmentation.

4.6 Experiment results

In this section, we evaluate the proposed solution, multi-view Ensemble ConvNet for hippocampus segmentation task, with different configurations. In addition, we also compare our model with some recently reported solutions that also use ADNI database for hippocampus segmentation.

4.6.1 U-Seg-Net on nine views

Since our multi-view Ensemble Net relies on U-Seg-Net on each view. Here, we first present the segmentation performance of 2D slice based U-Seg-Net. 3D Dice coefficients and Hausdorff distances of the segmentation based on nine single views are shown in Table 4.2. Generally speaking, there exist no significant differences among the results from these 9 different views. Specifically, in Coronal view, U-Seg-Net achieves the best results of Dice coefficients (87.36% and 87.62%) for both left and right hippocampus segmentation ; in Diag 2 view, U-Seg-Net achieves the best result of HD (2.88) for left hippocampus segmentation; in Diag3, U-Seg-Net achieves the best result of HD (2.55) for right hippocampus segmentation. 63

Table 4.2: Mean and standard deviation of Dice ratio (%) and HD for the hippocampus segmentation results using nine different single views. (L: left hippocampus, R: right hippocampus).

Sagittal Coronal Axial Diag1 Diag2 Diag3 Diag4 Diag5 Diag6

L 86.64±2.51 87.36±2.06 86.23±2.99 86.79±2.71 86.73±2.90 86.61±3.46 85.99±2.67 85.85±2.21 86.60±2.57 Dice R 86.98±1.60 87.62±2.05 86.52±2.53 86.96±2.34 86.98±1.88 87.16±2.26 86.46±2.07 86.21±2.94 86.44±2.00

L 3.21±0.68 3.56±0.68 3.25±0.76 2.97±0.67 2.88±0.59 3.01±1.05 3.16±0.87 3.36±0.71 2.86±0.44 HD R 3.01±0.68 3.24±0.90 3.02±0.66 3.03±0.58 2.78±0.53 2.55±0.35 2.66±0.26 2.86±0.40 3.04±0.96

4.6.2 Multi-view Ensemble ConvNets

We further apply ensemble learning to combine multiple view results obtained from aforementioned individual U-Seg-Net, to improve the overall segmentation of hippocampus. In Table 4.3, we present results of applying majority voting on 3 views and 9 views respectively, as well as results of applying three types of Ensemble ConvNets on 9 views. It is obvious that all ensemble methods achieve better performance than single view. By combining more views, marginal improvements are obtained. Also, by switching from linear combinations in majority voting to learned nonlinear combination in the proposed ConvNet, we also obtain marginal improvements. However, when the complexity of ConvNets increases, the benefit of nonlinear combination is lost due to the limited number of training data in this work. To further demonstrate the benefit of ensemble learning, Fig. 4.5 shows surface rendering of one segmented hippocampus using different methods: individual U-Seg- Net for 3 orthogonal views respectively, and majority voting (ensemble learning) for 3 views and 9 views separately. It can been observed that, the segmentation results from 3 orthogonal views have apparent deformation and abrupt areas in different places comparing 64 with ground truth. The 3-view combination integrates 3D information from 3 orthogonal views and improves the segmentation accuracy. The 9-view combination further refines the boundary and leads to a more accurate and smoother segmentation result.

Table 4.3: Mean and standard deviation of Dice ratio (%) for the hippocampus segmentation results using different combination methods.

3-View Vote 9-View Vote Ensemble-Net 1 Ensemble-Net 2 Ensemble-Net 3

L 88.92 ± 1.88 89.20 ± 2.31 89.28 ± 1.98 89.01 ± 1.47 89.25 ± 1.54 Dice R 88.92 ± 1.59 89.35 ± 1.83 89.45 ± 1.73 89.05 ± 1.99 88.23 ± 2.45

L 2.45 ± 0.55 2.32 ± 0.44 2.47 ± 0.42 2.32 ± 0.40 3.71 ± 0.92 HD R 2.44 ± 0.39 2.31 ± 0.36 2.26 ± 0.25 2.39 ± 0.32 3.44 ± 0.94

Sagittal Coronal Axial 3-view 9-view GT 91.26% 91.92% 91.65% 93.37% 93.90% GT

Figure 4.5: Surface rendering of segmentation results of U-Seg-Net on the sagittal, coronal, axial views, majority voting of 3 views, 9 views and ground truth in one case. (The Dice ratio for the obtained segmentations are shown under each image). Please refer to text for details.

In the end, we compared our proposed model with some recent work [27, 31, 32, 34, 99] that also focus on hippocampus segmentation. [27] used path-based methods based 65 on the similarity of intensity content between patches and target to obtain segmentation labels. [99] used a hierarchical SVM with automated feature selection method. [32] used the discriminative dictionary learning and sparse coding techniques where dictionaries and classifiers are learned simultaneously from a set of brain atlases. [31] proposed multi- atlas patch-based label fusion methods with new techniques: enforcing each image patch encode both local and semi-local image information and adopting a hierarchical label fusion method that iteratively improves the labeling result. [34] proposed multi-atlas patch- based label fusion methods, and progressively constructed dynamic dictionary learning of an optimal weighting coefficients for the label fusion framework. Please note that we do not intend to make a head-to-head quantitative comparison among these methods, as the studies were conducted and reported based on different datasets, subjects, ground truth setup and evaluation metrics. Nevertheless, best overall (Average) segmentation performance is demonstrated by our methods, as shown in Table 4.4.

Table 4.4: Comparison of the proposed method with other state-of-the-art existing methods on hippocampus segmentation.

..... Method Left Right Average

Morra et al. [99] Ada-SVM 81.40 82.20 81.80 Coupe et al. [27] Nonlocal Patch-based —– —– 88.40 Tong et al. [32] DDLS 87.20 (median) 87.20 (median) 87.20 Wu et al. [31] Hierarchical Multi-Atlas —— —— 88.50 Song et al. [34] Progressive SPBL 88.20 88.50 88.35 Our method 9ViewEnsem-Net1 89.48 89.46 89.47 66

4.7 Discussion

In this work, we present an automated hippocampus segmentation model based on ensembling multi-view convolutional networks. The proposed U-Seg-Net + Ensemble ConvNet is easy to train and achieves state-of-the-art performance in hippocampus segmentation. Compared with the stacked 2D slice-based segmentation from a single view, the proposed model is able to re-establish 3D contextual information and lead to a more 3D-awared segmentation results, which is reflected by the smoother and more accurate segmented hippocampi. However, we also notice that the proposed nonlinear ensemble ConvNet only achieves marginal improvements over linear voting methods, and we believe this is because the benefit of nonlinear combination is degrading with the increased complexity of ConvNets given the limited training data. In addition, exploring an end-to-end training strategy or better structure for this model is also another direction of our future efforts . 67 5 Sequential FCN for Hippocampus Segmentation

In this chapter, we present our another proposed model that combines long short- term memory (LSTM) with fully convolutional network (FCN) for the purpose of improving both the accuracy and slice-wise consistency in 3D medical image segmentation. Specifically, we adopt U-Seg-Net, which is introduced in the previous chapter, as the underlying 2D slice-based model to extract the latent features, and use LSTM to propagate these features along the third dimension in order to compensate the contextual information. This setup allows the most essential features of each slice to be shared and spread along the slice sequence. In the end, we also explore three-view ensemble to supplement the individual segmentations with 3D neighborhood information and further boost the overall performance. Still, we demonstrate the effectiveness of the proposed model on the application of 3D MRI based hippocampus segmentation. The work presented in this chapter has been published in [100].

5.1 Method: U-Seg-Net + CLSTM

Multi-view Ensemble ConvNet integrates separate 3D segmentation results of U-Seg- Net, aiming to re-establish the 3D contextual information to achieve a better and refined results. Nevertheless, the two stage processing in this approach might not be able to fully recover the overall 3D spatial contextual information lost in U-Seg-Net, thus leads to a sub-optimal refinement. To solve this issue, our second proposed solution attempts to directly instill inter-slice contextual information into training of slice-based U-Seg-Net. Specifically, we utilize the sequential learning model CLSTM to combine slice-based U- Seg-Net in order to leverage inter-slice spatial dependency, which is illustrated in Fig. 5.2. In this way, slice-wise 2D semantic information in U-Seg-Net can be propagated to adjacent slices along z-axis simultaneously with the training of U-Seg-Net. 68

The combination of CLSTM and U-Seg-Net can be rather straightforward. In practice, the recurrent units can be inserted at multiple places of U-Seg-Net and result in different architectures for our joint model. In a similar work [11], the authors proposed a method to combine two kU-Nets with RNN by directly constructing the recurrent connections between the last feature maps of their kU-Nets. However, it is noteworthy that this arrangement is the most straightforward solution, but might not be the optimal one. With the consideration that there exists different degrees of spatial information loss at different levels of spatial pooling or upsampling, using multiple level RNNs can propagate both higher level semantic and high resolution location information at same time between slices. More specifically, feature maps close to input and output have the most spatial and location information, and feature maps in the middle part of U-Seg-Net has the most semantic information. Upon this, the recurrent connections can actually be placed in any layer of U-Seg-Net. In our work, we explore four combination options as shown in Fig. 5.1:

(a) between the sequence of high resolution feature maps in the encoding path of the U-Seg-Net;

(b) between the high resolution feature maps in the decoding path of the U-Seg-Net;

(c) between the middle layer feature maps;

(d) at all three aforementioned places.

Fig. 5.2 shows the full architecture of our joint model (d) with 3 recurrent connections in U-Seg-Net. Since there is no natural beginning direction for z-axis in 2D slices of a 3D hippocampus, sequence learning can be rolled out from both opposite directions, denoted as z+ and z−, to avoid information loss. This is fulfilled by a bi-directional CLSTM, in which two CLSTMs process the same input sequence from the opposite directions and 69

128 1 1 32 128 1 1 32 32 32+128 32 32+128 128 128

64 64 128 64 64+128 128 64 64+128

128 128 128 128 (a) (b)

128 1 1 32 128 1 1 32 32 32+128 32 32+128 128 128

64 64 128 64 64+128 128 64 64+128

128 128 128 128 (c) (d)

Figure 5.1: Joint models. The black arrows indicates the recurrent connections.

concatenate their outputs at the corresponding neurons to a new tensor as cell output. Furthermore, multiple repeats (layers) of bi-directional CLSTM can be stacked together for form a deep structure [11] with the output of the first one taken as the input of the second one. In this work, we only explore 1 layer of bi-directional CLSTM considering the size of data set. Overall, the architecture of the joint U-Seg-Net and CLSTM model can be interpreted as following: feature maps learned from the 2D slices of the same volume are extracted independently with 1 convolutional operation first, then are fed into the 1st bi- directional CLSTM to make information propagate through all the slices. After another 2 convolutional + pooling operations, the extracted latent feature maps go into the 2nd bi- directional CLSTM to continue propagating the condensed information between different slices. Further after 3 convolutional + upsampling operations, 3D consistency is ensured again through the 3rd bi-directional CLSTM before the final segmentation probability maps are generated. This whole model is end-to-end trainable. 70

……

Slice&p(1

Slice&p

Slice&p+1

:& CLSTM

……

Figure 5.2: The overall architecture of our proposed model: U-Seg-Net + CLSTM.

View Ensemble The joint U-Seg-Net + CLSTM solution could also benefit from ensemble learning of different views. In this work, we further combine the resulted probability maps of the joint model trained from 3 different views (axial, sagittal, and coronal). To limit the complexity of the overall model, only pixel-wise majority voting is used to perform the ensemble.

5.1.1 3D U-Seg-Net

Here, we introduce 3D U-Seg-Net model, which is a direct extension of 2D U-Seg-Net and will be used to compare our proposed solutions in following experiments. 3D U-Seg- Net is a fully convolutional neural network that directly performs on 3D volumes. Similar with 2D U-Seg-Net structure, 3D U-Seg-Net also comprises of the encoding, decoding and skipping bridge layers, with all the 2D convolutional, pooling and deconvolutional operations replaced by 3D operations, as shown in Fig. 5.3. Theoretically the effective receptive field of 3 layers of convolutional + pooling operations with kernel size 3×3×3 is 72×72×72. Considering this is already large enough to capture the content of our 3D input of hippocampus and our data set is relatively limited, we do not explore more deeper 71 network structures and only use pooling and convolution operations three times, same as in 2D U-Seg-Net.

1 32 128 1 32 32+128 128

64 64 64+128 128 Conv,3×3×3,Relu Max,pool,2×" ×" 128 128 Copy De7Conv,2×"×"

Figure 5.3: The architecture of 3D U-Seg-Net.

5.2 Data and experimental setting

The data we used is same as in Chapter 4. We also evaluated our proposed model with 10-fold cross validation, specifically, with 99 subjects used for training and 11 subjects for testing in each fold. For each cropped hippocampus 3D array of training subjects, three sets of 2D image slices along sagittal, coronal, and axial view were firstly extracted to feed into and train our U-Seg-Net + CLSTM model, respectively. Then, the outputs of three sets of 3D probability maps were combined using majority voting. Note that we conducted the training and testing procedure independently for left and right hippocampus. Evaluation measurements Same as Chapter 4, Dice coefficient [97] and Hausdorff distance [98] are used to evaluate the quality of segmentation in this work. We refer reader for the detailed definition about those two measurements in Chapter 4. The 3D Dice ratio and Hausdorff distance were calculated subject-wise for each view and their combinations. Mean and standard deviation averaged from 10 folds are reported. 72

Optimization All the models are implemented using MXNet7 deep learning framework. AdaDelta is used as the optimizer, with its default hyperparameter settings: learning rate as 1, rho as 0.95, epsilon (fuzzy factor) as 1e-08 and learning rate decay over each update as 0. Due to the different complexities for each model, convergence is achieved at different stages. We observe the training loss and empirically set early stopping epoch differently for each type of model, as shown in in Table 5.1. When training the joint U-Seg-Net + CLSTM models, we initialize the parameters of U-Seg-Net part using the corresponding parameters from the trained individual U-Seg-Net. We find this would provide the joint model with a good starting point and lead to a better performance than training from scratch.

Table 5.1: Early stopping epochs used for different models.

Model U-Seg-Net Ensemble-Net U-Seg-Net+CLSTMs CLSTMs 3D U-Seg-Net

Epoch 50 30 150 120 120

5.3 Experiment results

5.3.1 CLSTMs

The proposed joint U-Seg-Net+ CLSTM solution relies on U-Seg-Net to extract the latent features and CLSTM to propagate the slice-wise information. Thus, similarly as previous section, we also show the hippocampus segmentation performance of only using CLSTM first. Same as [80], we apply CLSTM for 3D hippocampus segmentation using 6 directions (forward, backward sweeps for each of the three orthogonal views). In each setting, CLSTM processes the entire volume in one unique direction. Furthermore, 2-

7 https://mxnet.apache.org/ 73 direction ensemble for each orthogonal view and 6-direction ensemble for all 3 orthogonal views are conducted using voxel-wise averaging. The whole procedure is demonstrated in Fig 5.4. For each of CLSTM model used in the experiments, only two hidden layers are used, with 100 filters in the first layer and 1 filter for the second layer to generate the segmentation probability map. This setting is empirically decided, since we found more filters (i.e., 100) would generate better results comparing with fewer filters (40). The overall segmentation results are shown in Table 5.2. As shown, the 6-direction ensemble achieves better segmentation results than all single directions both in Dice ra- tio and HD. Similar for the 2-direction combination, which achieves better results than its corresponding single direction, except slightly worse performance than backward direc- tion in coronal view for Dice ratio. Generally speaking, the performance of CLSTM is worse than using U-Seg-Net, since CLSTM is shallow, and is not effective to extract latent information. This serves as another reason for our proposed solution of combining FCN with CLSTM. In addition, the experimental results also validate the effectiveness of the bi-directional CLSTM and view ensemble strategies.

!

Figure 5.4: Procedure of CLSTM. 74

Table 5.2: Mean and standard deviation of Dice ratio (%) and HD for the hippocampus segmentation results using 2 layers CLSTM and the combination from different directions.

view forward backward 2-direction 6-direction

sagittal 80.89 ± 4.08 83.21 ± 2.64 83.46 ± 3.91 L coronal 81.25 ± 2.87 84.13 ± 2.00 83.76 ± 2.53 86.04 ± 2.61 axial 76.95 ± 4.51 77.83 ± 3.59 79.49 ± 3.78 Dice sagittal 80.76 ± 2.48 83.62 ± 3.31 83.78 ± 2.92 R coronal 80.57 ± 3.71 84.05 ± 1.63 83.45 ± 2.73 85.89 ± 2.75 axial 74.77 ± 3.46 78.17 ± 2.40 78.79 ± 3.26

sagittal 9.19 ± 2.41 4.54 ± 1.03 3.21 ± 0.83 L coronal 5.52 ± 1.50 4.07 ± 1.01 3.48 ± 0.85 2.83 ± 0.57 axial 5.00 ± 1.16 4.18 ± 0.65 3.79 ± 0.69 HD sagittal 6.03 ± 2.52 3.95 ± 1.42 3.15 ± 0.53 R coronal 6.34 ± 1.62 5.29 ± 1.10 3.83 ± 0.94 2.91 ± 0.65 axial 5.71 ± 1.52 4.73 ± 1.11 3.92 ± 0.69

5.3.2 Joint Model of U-Seg-Net and CLSTMs

In Table 5.3 and Table 5.4, we show the segmentation results of Dice ratio and HD for different configurations of joint U-Seg-Net + CLSTMS models, as well as its corre- sponding baseline performance of using U-Seg-Net alone. Specifically, 4 different choices for integration position are compared, including early integration in the beginning (i.e., Fig. 5.1(a)), in the middle (i.e., Fig. 5.1(b)), at the end (i.e., Fig. 5.1(c)), and at all three po- sitions. In general, all different configurations of joint U-Seg-Net + CLSTM model achieve better performance of both Dice ratio and HD than the corresponding U-Seg-Net model for 75 both left and right hippocampus segmentation in each of the three orthogonal views, as well as the view combination (majority voting). This indicates the proposed joint solution is able to propagate the slice-wise contextual information during 2D slice-based segmenta- tion and lead to better 3D segmentation result. We further visualize the improvement and benefit of adding CLSTM in 2D slice based U-Seg-Net in Fig. 5.5. In Fig. 5.5, we show the surface rendering results of one segmented left hippocampus by using U-Seg-Net and the joint U-Seg-Net + CLSTM models, as well as the ground truth. It is clear that signifi- cant 3D contextual information is lost during 2D slice-based segmentation and even leads to a broken part in CA1 area when using U-Seg-Net model alone, while adding CLSTM eliminates the uncertainty between slices and results in a more integrated and accurate 3D segmentation result.

U-Seg-Net U-Seg-Net + CLSTM Ground Truth

Figure 5.5: Surface rendering of segmentation results of U-Seg-Net, U-Seg-Net + CLSTM and ground truth of one case from the axial view. (The Dice ratio for the obtained segmentations are 71.80% and 81.73%). Please refer to text for details.

In terms of the best integration position for the joint model, we can see that: for all three orthogonal views, the last choice of ”joint full” configuration at all three 76 positions achieve better results than the other configurations for Dice ratio. One exception is in Sagittal view for right hippocampus segmentation, where the choice of ”Joint end” configuration slightly outperforms the choice of ”Joint full”. In addition, for HD performance, the choice of ”Joint full” at all three positions is the best in Sagittal and Coronal views for left hippocampus segmentation, and the choice of ”Joint end” achieves better results in axial view for left hippocampus segmentation and in all three views for right hippocampus segmentation.

Table 5.3: Mean and standard deviations of Dice ratio for the hippocampus segmentation results by put the connection in different positions.

Dice Sagittal Coronal Axial Voting

U-Seg-Net 86.64 ± 2.51 87.36 ± 2.06 86.23 ± 2.99 88.92 ± 1.88 Joint begin 87.48 ± 2.11 87.93 ± 2.08 86.91 ± 3.76 89.23 ± 1.98 L Joint middle 87.07 ± 2.10 87.78 ± 1.95 86.87 ± 4.02 89.12 ± 1.85 Joint end 87.61 ± 2.33 87.95 ± 2.01 87.32 ± 4.10 89.21 ± 2.06 Joint full 88.16 ± 2.14 88.10 ± 2.04 87.68 ± 3.67 89.33 ± 1.85

U-Seg-Net 86.98 ± 1.60 87.62 ± 2.05 86.52 ± 2.53 88.92 ± 1.59 Joint begin 87.84 ± 1.66 88.16 ± 1.96 87.37 ± 2.44 89.36 ± 1.65 R Joint middle 87.50 ± 1.61 88.11 ± 1.69 87.12 ± 2.73 89.19 ± 1.55 Joint end 88.13 ± 1.43 88.16 ± 1.92 87.70 ± 2.40 89.37 ± 1.53 Joint full 88.07 ± 1.85 88.30 ± 1.75 88.43 ± 2.08 89.47 ± 1.69

Regarding the results of combined views using majority voting, the choice of ”Joint full” at all three positions has the best overall performance for both right and left hippocampus segmentations. Overall, these observations to certain extent proves our 77 hypothesis that 3D contextual information is lost at different levels during 2D slice-based segmentation using U-Seg-Net, and using recurrent connection to propagate 2D slice-wise latent feature information at different levels of FCN would help restore 3D information lying in adjacent slices back to our 2D segmentation model U-Seg-Net.

Table 5.4: Mean and standard deviations of Hausdorff distances for the hippocampus segmentation results by put the connection in different positions.

HD Sagittal Coronal Axial Voting

U-Seg-Net 3.21 ± 0.68 3.56 ± 0.68 3.25 ± 0.76 2.45 ± 0.55 Joint begin 3.05 ± 0.64 3.14 ± 0.72 2.98 ± 0.73 2.38 ± 0.44 L Joint middle 2.83 ± 0.50 3.04 ± 0.81 2.90 ± 0.76 2.40 ± 0.54 Joint end 2.64 ± 0.45 2.97 ± 0.79 2.74 ± 0.84 2.34 ± 0.46 Joint full 2.61 ± 0.58 2.79 ± 0.73 2.74 ± 1.08 2.24 ± 0.39

U-Seg-Net 3.01 ± 0.68 3.24 ± 0.90 3.02 ± 0.66 2.44 ± 0.39 Joint begin 2.72 ± 0.46 2.85 ± 0.52 2.86 ± 0.85 2.31 ± 0.20 R Joint middle 2.89 ± 0.78 2.86 ± 0.61 3.03 ± 1.22 2.28 ± 0.22 Joint end 2.63 ± 0.72 2.55 ± 0.41 2.64 ± 0.73 2.27 ± 0.28 Joint full 2.93 ± 0.89 2.72 ± 0.67 2.96 ± 0.90 2.15 ± 0.22

5.3.3 Comparison with other methods

In this section, we first show the segmentation results of 3D U-Seg-Net in Table 5.5. 3D U-Seg-Net directly takes the input of 3D volume and outputs the 3D segmentation results, in which the 3D contextual information should be preserved all the way. However, the performance of 3D U-Seg-Net is slightly worse than our proposed two 2D slice-based solutions: multi-view ensemble ConvNet and the joint 2D U-Seg-Net + CLSTM, in both 78 left and right hippocampus segmentation. We believe this is partially due to the fact that the power of 3D U-Seg-Net is not fully explored because of its complexity and the limited training samples. This, on the other hand, may serve to clarify the values of our proposed two solutions, especially in most 3D medical image segmentation tasks that suffer from limited samples. We can see from Fig 5.6 that, even though 3D U-Seg-Net captures the whole shape of hippocampus rather accurately, it still generates some critical outliers.

Table 5.5: Mean and standard deviation of Dice ratio (%) for the hippocampus segmentation results using 3D-U-Seg-Net with different settings for filter size.

kernel size 3×3 5×5 7×7

L 88.67 ± 1.95 87.98 ± 2.03 86.97 ± 1.79 Dice R 88.24 ± 1.91 87.49 ± 2.34 87.33 ± 1.50

L 2.61 ± 0.69 2.44 ± 0.45 2.85 ± 0.65 HD R 2.46 ± 0.57 2.54 ± 0.63 2.82 ± 0.78

3D U-Seg-Net Ground Truth

Figure 5.6: Surface rendering of segmentation results of 3D U-Seg-Net, and ground truth of one case. The Dice ratio and HD for the obtained segmentations is 91.35% and 12.57). Please refer to text for details. 79

Furthermore, in Table 5.6, we compare our proposed two models with some recent segmentation models [27, 31, 32, 34, 99] focusing on automatic hippocampus segmentation. Among these works, our proposed two models achieve the highest Dice ratio for both left and right hippocampus segmentation. Note that we are not aiming at a direct, head-to-head quantitative comparison, since different datasets and experimental setups are utilized. Thus, higher Dice ratio or better results over a competing solution ought to be interpreted as indirect evidence of the model efficacy, rather than the proof of superiority for head-to-head competitions [101].

Table 5.6: Comparison of the proposed method with other state-of-the-art methods on hippocampus segmentation (in 3D Dice ratios (%)).

..... Method Left Right Average

Morra et al. [99] Ada-SVM 81.40 82.20 81.80 Coupe et al. [27] Nonlocal Patch-based —– —– 88.40 Tong et al. [32] DDLS 87.20 (median) 87.20 (median) 87.20 Wu et al. [31] Hierarchical Multi-Atlas —— —— 88.50 Song et al. [34] Progressive SPBL 88.20 88.50 88.35 Our method Joint model full 89.33 89.47 89.40 Our method Ensem-Net1 on 9-view 89.28 89.45 89.36

5.4 Discussion

Our proposed solutions target at the issue of 3D contextual information loss in most 2D slice-based methods. Specifically in this chapter, we propose a joint end-to-end trainable model, which relies on the fully convolutional neural network to extract latent and semantic information, and integrates sequence learning to propagate slice-wise information 80 between neighboring slices. Different from the model in last chapter, this model directly preserves the 3D contextual information during the learning. Through the application of hippocampus segmentation, we demonstrate the efficacy of our solution, and also show their advantages in the condition of limited training samples, compared with 3D FCN solutions. 81 6 Application on Multi-class Brain Tumor Segmentation

In this chapter, we further demonstrate the effectiveness of our proposed models by applying them on another important clinical task, brain tumor segmentation. Technically, our models are extended from hippocampus binary segmentation toward multiple class brain tumor (glioma) segmentation using multi-modality 3D MRIs. Compared with hippocampus segmentation, accurate segmentation of glioma from multi-modality MRIs is more challenging not only because it involves with multiple sources of semantic information, but also due to the fact that the shape and location variability of glioma among patients is very high. In the following sections, we will first briefly introduce the clinical background of gaioma along with previous work done in the area. We then illustrate the configurations of models we use to tackle the task, including both multi-view ensemble net and joint U-Seg- Net + CLSTM net. In addition, we further extend our U-Seg-Net + CLSTM model with squeeze-and-excitation blocks in order to fully explore the multi-contrast information from different modalities. Experimental settings are provided, including dataset, implementation details, and evaluation measures. In the end, we present the experimental results, compare our models with other state-of-the-art solutions, and conclude this chapter with a discussion.

6.1 Motivation: Glioma segmentation

Although brain tumors are very rare and contribute to only 3% of new cancer diagnoses worldwide, they are considered to be one of the most lethal cancers and associated with significant morbidity and mortality [102]. According to the origin, brain tumors can be categorized into two groups: primary ones, whose origin are brain tissue cells, and metastatic ones, for which the cells become cancerous at other places and then spread into brain. Among different types of brain tumors, gliomas are the most common 82 ones, accounting for 30% of all brain/CNS tumors and 80% of all malignant brain tumors [103]. This type of brain tumors arise from glial cells that support nerve cells, and tend to aggressively grow and invade the central nervous system (CNS) [14]. Based on the rate of their growth, gliomas can be classified as low-grade gliomas (LGG) including astrocytomas, oligodendrogliomas, and high-grade gliomas (HGG), i.e., glioblastoma multiforme (GBM), with the latter being more aggressive and infiltrative than the former. In general, patients that are diagnosed with LGG typically have a life expectancy of several years, while for those with HGG the median survival rate is less than two years, and only 5% of them can survive five years after diagnosis [104]. In general, chemotherapy, radiotherapy and surgery are the most common techniques, usually adopted in combination, to treat gliomas [105]. Even though the overall survival of GBM patients has increased in recent years along with introduction of novel chemotherapeutic treatments, better surgical techniques, and more extensive treatment strategies [106], early diagnosis of gliomas still plays the most important role in improving the possibilities of treatment, and significantly affects the survival rate. Considering the fact that gliomas typically contain various heterogeneous sub-regions and exhibit highly variable clinical prognosis, multi-modal MR scans are regarded as the standard imaging technique among different options. Generated by altering excitation and prepetition times during image acquisition, different multi-modal MRI scans produce different types of tissue contrast images with different intensity profiles, thus enabling the depiction of sub-structural information of tumors and reflecting differences of tumor biology [107]. It is also evident that:

Quantitative analysis of imaging feature extracted from multi-modal MRI, ..., through advanced computational algorithms, leads to advanced image-based tumor phenotyping. Such phenotyping may enable assessment of reflected 83

biological processes and assist in surgical and treatment planning. Further- more, its correlation with molecular characteristics established radiogenomic research, leading to improved predictive, prognostic and diagnostic imaging biomarker [108].

Therefore, accurate and reproducible quantification of the various tumor sub-regions is the key and the necessity of these advanced image-based phenotyping, which may potentially improve the early diagnosis, monitor of progression, and evaluation of the treatment for gliomas.

Figure 6.1: Four tumor substructures, edema (yellow), non-enhancing solid core (red), necrotic/cystic core (green), enhancing core(blue) reflected by multi-modal MRI: (A) T2- FLAIR, (B) T2, (C) T1-Gd, (D) overall combination [14].

A glioma is usually composed by four types of tissues: necrosis, edema, non-enhancing and enhancing tissues [109], as shown in Fig. 6.1. In clinical practice, four standard MR imaging modalities are often used for mapping tumor-induced tissue changes in these tis- sues, including T1-weighted MRI (T1), T2-weighted MRI (T2), T1-weighted MRI with gadolinium contrast enhancement (T1-Gd) and Fluid Attenuated Inversion Recovery (FLAIR), 84 as shown in Fig. 6.2. Generally, T1 images are used to detect abnormality from healthy tis- sues, T2 images are further used to delineate the edema region which produces high inten- sity signal (bright core) on the image. Due to the accumulated contrast agent (gadolinium ions) in the active cell region of the tumor tissue, the tumor border can be distinguished in T1-Gd images. On the contrary, necrotic cells do not interact with the contrast agent, thus generating hypo-intense signal in T1-Gd images which makes it distinguishable from the active cell. For FLAIR images, since the signal of water molecules is suppressed, a clearer visualization of the cerebral edema regions is allowed to be separated from the Cere- brospinal Fluid (CSF). Overall, these four modalities complement each other by revealing different types of biological information, and consist of a 4-dimensional (4D) contextual volume to describe the whole tumor along with these sub-regions.

Figure 6.2: Four different MRI modalities showing a high grade glioma, each enhancing different subregions of the tumor. From left; T1, T1-Gd, T2, and FLAIR [15].

Similar as in hippocampus segmentation, manual segmentation is still considered as the gold standard for brain tumor segmentation. It requires radiologists to process the multi-modal information presented in the 4D MRI volumes along with anatomical and physiological prior knowledge on brain and gliomas. This is obviously time consuming and prone to inter- and intra-rater variability [110]. Thereby, automatic and accurate 85 tumor segmentation on multi-modal MRI is highly needed, which can provide robust and quantitative measurements of tumor depiction, and greatly aid in the clinical management of brain tumors.

6.1.1 Prior work

Automatic glioma segmentation is a very challenging task, due to the following reasons [15]: 1) The size, shape and location of gliomas can vary dramatically from patient to patient, different from hippocampus segmentation that could utilize its prior shape or location information. 2) Tumor boundaries from healthy tissues are often irregular and fuzzy with discontinuities, let alone the sub-region boundaries within the tumor. 3) Multi-modal MRI scans increase another level of complexity for information processing. 4) Intensity inhomogeneity from different devices and scan protocols poses additional difficulties. For the past two decades, there has been a huge research interest on automatic brain tumor segmentation [15, 24], with a recent focus shifted toward sub-region tumor segmentation from multi-modal MRI. Methodologically, many earlier state-of-the-art algorithms for brain tumor segmentation directly adopt methods which are proposed at first for other brain tissue segmentation tasks, i.e, white matter lesion segmentation [111]. Broadly, these early methods can be classified as: generative and discriminative methods [14]. For generative models, explicit probabilistic models are constructed by using detailed prior information on the appearance and spatial distribution of different brain tissue types, and then tumor segmentation would be modeled as an outlier detection procedure utilizing the constructed atlases of healthy tissues [112–115]. In spite of good generalization, the generative models typically require significant efforts for transforming an arbitrary semantic interpretation of the tumor images into appropriate probabilistic models, which might not be even feasible considering the complexity of task. On the 86 contrary, discriminative solutions attempt to directly differentiate the appearance of lesion from other tissues by deriving useful image characteristics from the labeled training images [14]. Image artifacts, intensity and shape variations will significantly affect the efficacy of such methods, thus massive amounts of training data are required to reduce such impact, and great efforts are needed on engineering image features [116–118] and classification algorithms [116, 119, 120]. For more details regarding these early methods, we refer reader to [14, 15, 22, 24, 121]. More recently, with the emergence of deep learning methods as well as the availability of large public datasets, such as the well-accepted BRATS benchmark, we see a significantly increased research interest on deep learning based automatic brain tumor segmentation. Specifically, the majority of these deep learning methods rely on the feature engineering power of convolutional neural network (CNN) to automatically learn representative complex features for both healthy brain tissues and tumor tissues from multi- modal MR images, and achieved significantly better performance than earlier methods. Generally adapted from existing state-of-the-art CNN models proposed originally for other computer vision tasks, these methods include both 2D CNN and 3D CNN work [72, 122– 124]. As discussed in Section 1.2, similar advantages and disadvantages are applicable for them. Additionally, due to the multi-modality nature, most of the above CNN tumor segmentation methods adopt patch-wise or sub-volume-wise training due to the limitation of GPU memory. Thus, some recent efforts and attempts [123, 124] are focused on bringing back the local appearance and spatial consistency. Particularly, they integrated the Conditional Random Fields (CRF) into the network to explore the local consistency of segmentation results either as a post-processing step like in [125] or formulated as a whole like in [126]. Although these methods share some similarities with ours, their purpose is to find a remedy for local inconsistency resulted from the patch-based solution, and the 87

CRF technique they used is generally formulated as fully connected layers which is not as efficient as our adopted LSTM solution.

6.2 Proposed models

……

Slice&p(1 T1/T1Gd/T2/Flair

Slice&p T1/T1Gd/T2/Flair

Slice&p+1 :& CLSTM …… ……

Figure 6.3: The architecture of U-Seg-Net + CLSTM network for brain tumor segmenta- tion.

Since each of the sub-regions in gliomas contains unique tumor biology information and is valuable for its diagnosis, our task at hand becomes a multiple class segmentation problem. In addition, the input of multi-modality MRI is essentially a 4D semantic volume. Herein, we adopt the aforementioned multi-view ensemble ConvNets and U-Seg-Net + CLSTM models, with necessary modifications to accommodate these new requirements. Specifically, for the multi-view model, 2D segmentation based on U-Seg-Net are performed on three views (sagittal, coronal and axial) separately, and the resulted segmentation probability maps are fused with majority voting. Different from previous work, the input for U-Seg-Net is multi-channel 2D slices (similar as RGB images), and the output probability 88 maps of the last layer are also in multi-channel in order to have an independent binary segmentation mask for each sub-area. For U-Seg-Net + CLSTM model, we select the model that has recurrent connections between the middle layer feature maps of U-Seg-Net, which is shown in Fig. 5.1 (c). This choice is with the consideration of the overall model’s complexity and size under the limitation of GPU-RAM. The expanded full architecture of U-Seg-Net + CLSTM is shown in Fig. 6.3.

6.2.1 U-Seg-Net+CLSTM with Squeeze-and-Excitation Unit

In our U-Seg-Net + CLSTM model, the multiple modalities of image slices are treated equally and fed into the model as independent channels. This so-called early concatenation method for fusing multi-modal information relies on the model itself to learn the underlying, potential nonlinear relationship between different feature channels. This process can be modeled as follows. Given the typical convolution operation denoted as

H0×W0×C0 H×W×C U = Fconv(X), X ∈ R , U ∈ R , and the learned filters V = [v1, v2, ..., vC], we can decompose the output feature tensor of Fconv as U = [u1, u2, ..., uC], where

PC0 s s uc = vc ∗ X = s=1 vc ∗ x . This indicates that, for multi-channel feature input, the channel- wise convolutional operation is identical to conducting 2D spatial convolution separately for each channel, and then outputting the summation through all channels. In this way, the channel dependencies are implicitly modeled, but limited within a local receptive field. The early concatenation of different modalities corresponds to this process, thereby we are concerned with its efficiency and sensitivity in the U-Seg-Net + CLSTM model to more informative features. To explicitly modeling the interdependencies between different channels in feature maps with a more global sense, the squeeze and excitation network [16] is proposed, which uses gating mechanism to re-weight different channels of feature maps using global feature channel statistics. This procedure will selectively enhance useful features and suppress less 89

F"#(%&, ()

F (%) *+ 1×1×1 1×1×1 4 45!

F*,-."(%&,%) 2×3 2×3

1 1

Figure 6.4: Squeeze and excitation block [16]. Fsq and Fex are squeeze and excitation

operations. Fscale is the feature map recalibration operation.

useful ones by learning a weighting layer that controls how much information are explicitly utilized from certain channel. This mechanism is fulfilled through two main steps, Squeeze and Excitation (SE). The detailed SE block is shown in Fig. 6.4, which contains:

• Squeeze: a channel-wise statistics aggregation step. Squeeze operation applies the global average pooling on the input feature to generate channel-wise statistics among spatial dimensions H × W.

H W 1 X X z = F (u ) = u (i, j) c sq c H × W c i j

• Excitation: a feature recalibration step. Excitation operation explicitly applies different weighting on each channel of feature maps, with weights coming from the

global feature statistics for each channel and adjusted by the learned W, or(W1, W2). This final scaling factor is defined as

s = Fex(z, W) = σ(g(z, W)) = σ(W2δ(W1z))

In practice, the excitation operation uses two fully connected layers with a non- linearity activation to generate output channel-wise activation. W1 is parameters for the 90

first FC dimensionality-reduction layer, followed by a ReLu, and then W2 is the second FC dimensionality-increasing layer. σ is ReLu function.

The final weighted feature maps X˜ is obtained by rescaling input feature maps uc with the channel-wise activation sc, as written:

X˜ = Fscale(uc, sc) = sc · uc

The SE unit can be used as a routine network block and inserted after any non-linearity activation following a convolution. In our tumor segmentation, we want to use SE block to leverage information from multi-modal input data. Although there might be different de- sign choices for extension of U-Seg-Net + CLSTM with SE block, we insert the SE block after the first convolution of the network. Overall, our proposed SE-U-Seg-Net + CLSTM network is shown in Fig. 6.5.

!

Global'pooling

Slice&p(1 FC T1/T1Gd/T2/Flair CLSTM ReLU

FC

Sigmoid

Slice&p Scale T1/T1Gd/T2/Flair !" CLSTM

Slice&p+1 Output&p+1 :&&SE&block …… ……

Figure 6.5: The SE-U-Seg-Net + CLSTM model.

6.3 Data

The data used in this section is obtained from ”Multimodal Brain Tumor Segmentation Challenge 2017” [14, 108]. The multi-contrast MRI scans of glioma patients in the data set 91 were acquired from different institutions, and the corresponding ground truth sub-region labels were manually revised by experienced neuroradiologists. Overall, we used 210 subjects of HGG in our experiment. For each subject, there are four modalities of 3D images: native T1, T2, FLAIR and post-contrast T1 (T1-Gd). The ground truth is the multi-class tumor mask for the areas of enhancing tumor (label 4), the peritumoral edema (label 2), and the necrotic and non-enhancing tumor (label 1). Fig. 6.6 shows different views of a tumor in different modalities, as well as its ground truth segmentation mask for 3 sub-regions. Enhancing tumor core (blue) has hyperintensity in T1-Gd compared to T1; The necrotic and non-enhancing tumor (red) has hypo-intensity in T1-Gd compared to T1. All the labeled substructures have hyperintensity in FLAIR. All the data were homogenized/standardized into a 3D volume in size of 240×240×155 with 1 mm isotropic resolution in all three dimension, through a sequence of image processing, including co- registration, image resampling, and skull-stripping. In order to fully utilize the GPU computation power, we further remove the redundant boundary padding by cropping a 160 × 192 × 152 volume from center. In addition, a volume-wise min-max normalization is performed on each modality, separately.

6.4 Experimental settings

All the experiments are conducted with 5-fold cross validation: In each training stage, a network is trained using 168 subjects, and then applied to the remaining 42 subjects for performance evaluation. All 5-fold experiment results are averaged to report the final overall performance. In multi-view ensemble experiments, this procedure is repeated three times, with training and testing image slices extracted along three different views, sagittal, coronal and axial. Same as before, all the models are implemented using MXNet framework. For U-Seg- Net, the batch size is set as 4, training epoch is 40. For U-Seg-Net + CLSTM, the batch 92 size is set as 4, the sequence length is set as 8, and training epoch is 60. Still, training epochs are empirically decided based on the observation of the convergence for the train- ing loss. For all different model configurations, we use Adadelta as the optimizer with its default settings: learning rate as 1, decay rate as 0.95, and epsilon as 1e-8 (small value avoid division by 0). For U-seg-Net, the kernel weights are initialized with random val- ues uniformly sampled from [-0.7, 0.7], which is a default setting in MXNet. U-Seg-Net + CLSTM is trained using a two-step fine-tuning strategy, with corresponding parameters initialized from the pre-trained U-Seg-Net weights.

(a) FLAIR (b) T1 (c) T1-Gd (d) T2 (e) GT

Figure 6.6: Slices extracted from different views and different modalities. Slices extracted from sagittal, coronal and axial views are shown in first, second and third row accordingly.

6.5 Quantitative evaluations

Same as hippocampus segmentation experiments, two performance metrics are adopted to evaluate our models, which are, Dice ratio and Hausdorff distance. The 93 definitions of those measurements can be found in previous chapters. To facilitate the comparison with state-of-the-art solutions from the challenge, we evaluate these two measures separately on the provided three sub-regions of glioma: the enhancing tumor (ET) (blue part in Fig. 6.6(e)), the tumor core (TC) (includes ”ET”, and the necrotic and non-enhancing tumor, blue and red in Fig. 6.6(e)), the whole tumor (WT) (indicating the complete extent of the disease, blue+red+green in Fig. 6.6(e)).

6.6 Experiment results

The quantitative evaluations of tumor segmentation using our proposed multi-view ensemble ConvNet and U-Seg-Net + CLSTM models, measured in Dice ratio and Hausdorff distance (in the 95th percentile) [127], are shown in Table 6.1 and Table 6.2 separately. From Table 6.1, it can be observed that, by integrating segmentation results from three views, we obtained 1%-2% improvements for Dice ratio on all three sub-regions. More significant improvements are obtained for HD measurements, where we can see that ensemble results of three views have decreased 2-5 mm HD for the single view results. Similar general trends can also be observed from Table 6.2, when comparing single view U-Seg-Net + CLSTM results with ensembled 3-view U-Seg-Net + CLSTM results. Note that, for the sub-region of enhancing tumor, HD is not reported. This is because for some patients enhancing tumor only occupies very small area, and our system failed to detect this sub-structure on 1 or 2 subjects. Thus we couldn’t calculate the averaged HD between the predicted (empty) ET mask and the ground truth ET mask. Instead, number of failure cases is shown in both Tables. We also notice that, from the direct comparison of Dice ratios between U-Seg-Net segmentation results in Table 6.1 and corresponding U-Seg-Net + CLSTM segmentation results in Table 6.2, our proposed sequential learning strategy does improve the overall segmentation accurate, which is reflected in higher Dice ratios. 94

Table 6.1: Mean and standard deviation of subject Dice ratios (%) and Hausdorff distance (1mm) on three glioma sub-regions using U-Seg-Net and ensemble.

Dice ratio (%) Hausdorff distance (1mm)

Enhancing Core Complete Enhancing Core Complete

Sagittal 77.47 ± 2.44 82.72 ± 2.23 83.24 ± 1.87 – (1) 7.04 ± 2.84 9.79 ± 2.41 Coronal 77.15 ± 1.21 82.08 ± 2.61 82.96 ± 1.84 – (2) 5.59 ± 1.93 6.78 ± 1.85 Axial 76.26 ± 1.68 81.52 ± 1.69 82.83 ± 2.45 – (2) 6.22 ± 4.76 7.83 ± 4.21 Ensemble 78.77 ± 1.99 83.53 ± 2.23 83.89 ± 2.57 – (1) 3.40 ± 0.70 4.61 ± 0.95

Table 6.2: Mean and standard deviation of subject Dice ratios (%) and Hausdorff distance (1mm) for the tumor segmentation results using U-Seg-Net+CLSTM and ensemble.

Dice ratio (%) Hausdorff distance (1mm)

Enhancing Core Complete Enhancing Core Complete

Sagittal 78.04 ± 2.48 83.54 ± 2.46 84.45 ± 2.19 – (2) 4.94 ± 1.47 6.86 ± 0.62 Coronal 77.05 ± 2.01 82.33 ± 2.78 84.32 ± 1.48 – (2) 5.08 ± 1.32 6.24 ± 1.63 Axial 77.35 ± 2.44 83.19 ± 2.02 84.63 ± 2.15 – (1) 6.50 ± 2.24 7.83 ± 1.47 Ensemble 79.03 ± 2.29 84.12 ± 2.60 85.10 ± 2.31 – (2) 3.13 ± 0.65 4.77 ± 0.92

In addition, even though the two proposed models lead to similar averaged HD performance for both single views and ensemble view results, sequential models do achieve much smaller variances, which indicates the very ”bizarre” outliers in segmentation are significantly reduced when utilizing sequential learning to propagate the contextual information along the slice-wise U-Seg-Net. This can be further demonstrated by Axial view that generally has the worst performance for both Dice and HD: the improvements 95 resulted from adding sequential learning is the most significant. Overall, when comparing ensemble results from two models, our sequential + ensemble model outperformed the multi-view ConvNet + ensemble on both measurements. Other than quantitative evaluations, some segmentation results are shown in Fig. 6.7 to qualitatively compare the results using the FCN component alone and the results of com- bining U-Seg-Net and CLSTM. In general, both methods are able to capture and segment the main components within single slices. But, adding CLSTM component further sup- presses very small isolated ”outliers” by maintaining inter-slice consistency, and leads to more consistent prediction in ambiguous areas by leveraging the 3D context.

FLAIR T1 T1-Gd T2 GT U-Seg-Net Joint

Figure 6.7: Slices extracted from single view segmentation results using U-Seg-Net and joint U-Seg-Net + CLSTM models. Sagittal, coronal and axial views are shown in first, second and third row accordingly. Images are cropped and resized for better visualization. Please refer to text for details.

In the end, we compare our methods with some state-of-the-art solutions [128–131] that utilize similar strategies and also focus on the tumor segmentation using the same 96 data set, as shown in Table 6.3. For [131] method, it needs to be pointed out that it uses the ground truth masks released in 2015, and all the other competing work and ours use the latest ones released in 2017. Thus, we are not aiming at an head-to-head direct comparison. Nevertheless, we can still see that our proposed models obtain rather competitive results, especially in the sub-regions of enhancing tumor and tumor core.

Table 6.3: Comparison with other methods

Dice ratio (%) Hausdorff distance (1mm)

Enhancing Core Complete Enhancing Core Complete

Ours (U-Seg-Net + Ensemble) 78.77 83.53 83.89 —- 3.40 4.61 Ours (U-Seg-Net + CLSTM + Ensemble) 79.03 84.12 85.10 —- 3.13 4.77 Islam [128] 68.90 76.10 87.60 12.94 12.36 9.82 Jesson [129] 68.20 78.90 88.60 6.58 8.11 7.11 Kamnitsas [130] 75.70 82.00 90.20 4.22 6.11 4.56 Tseng et.al [131] 68.35 68.77 85.22 —- —- —-

6.6.1 SE-U-Seg-Net + CLSTM

As an extra exploration on the efficient fusion strategy of multimodality information for tumor sub-region segmentation, we present here the comparison results by replacing the early concatenation strategy in previous setting with SE unit. In order to directly demonstrate the efficiency of SE unit in an univariate setting, we choose to compare the models on the Sagittal view, which leads to the best overall single-view performance based on previous experiments. Specifically, experiments are conducted to compare four models: U-Seg-Net vs. SE- U-Seg-Net, and our joint model vs. SE-joint, on the same Brats 17 tumor data set and setting. Segmentation results are shown in Table 6.4. We observed performance 97 improvements induced by equipment with SE units in U-Seg-Net for all three substructures. In particular, the whole tumor segmentation gained 1% improvement on Dice ratio. For the joint model, improvements can be observed on the Dice for enhancing tumor and whole tumor segmentation, but not as significant as using U-Seg-Net. For both models, we could not achieve consistent improvements on Hausdorff distance, and sometimes even observe degraded results, i.e., for SE-U-Seg-Net + CLSTM vs U-Seg-Net + CLSTM.

Table 6.4: Mean and standard deviation of subject Dice ratios (%) and Hausdorff distance (1mm) for the tumor segmentation results using network with SE block on sagittal view.

Dice ratio (%) Hausdorff distance (1mm)

Enhancing Core Complete Enhancing Core Complete

U-Seg-Net 77.47 ± 2.44 82.72 ± 2.23 83.24 ± 1.87 – (2) 7.04 ± 2.84 9.79 ± 2.41 SE-U-Seg-Net 77.97 ± 1.94 83.17 ± 2.17 84.28 ± 2.06 – (2) 7.59 ± 3.45 8.93 ± 2.36

U-Seg-Net + CLSTM 78.04 ± 2.48 83.54 ± 2.46 84.45 ± 2.19 – (1) 4.94 ± 1.47 6.86 ± 0.62 SE-U-Seg-Net + CLSTM 78.26 ± 2.14 83.26 ± 2.64 85.22 ± 1.72 – (2) 6.35 ± 1.39 8.85 ± 1.00

6.7 Discussion

In this chapter, we applied our multi-view ensemble ConvNet and joint U-Seg-Net + CLSTM model on brain tumor segmentation task, which is a multi-class segmentation problem. The challenge of this task is the high variability of shape and location of brain tumors among different subjects, which is intrinsically different from the hippocampus segmentation task as presented in previous chapters. In addition, the information about different subregions of tumor is from different MRI contrasts, adding another level of difficulty to the segmentation task. To fully utilize the whole contextual information as well the multi-modality contrasts, the direct 3D FCN based solution is prohibitive for us 98 due to the limitation of GPU RAM size. Nevertheless, this also servers as a side evidence for the practicability of our proposed remedies for 2D slice-based segmentation. In a nutshell, our proposed two solutions both achieve competitively accurate segmentation performance in this task. Specifically for the U-Seg-Net + CLSTM model, we further improved the segmentation consistency, which is reflected in both the smaller variances of HD measures and the reduced number of isolated outliers. Multi-view ensemble learning is also proved again a powerful and plug-and-play tool to compensate contextual information. Since the whole purpose of this application is to further demonstrate the effectiveness of our proposed solutions, we simply keep the basic architectures of the models similar as in hippocampus segmentation, and do not fully explore other possibilities of model structure design in the very beginning. For example, we treat different imaging modalities directly as image channels, and this is the most straightforward way but might not be the optimal fusion solution. For this purpose, we conducted an extra exploration on the efficient feature fusion strategies for tumor sub-region segmentation. Specifically, we extended our models with the squeeze and excitation unit [16], which are originally proposed to explicitly model the interdependencies between different channels in feature maps with a more global sense. From our results, this extension does help improve our baseline 2D model U-Seg-Net, but falls apart for our proposed joint U-Seg-Net + CLSTM model. We think this is because our U-Seg-Net + CLSTM model is already complex, and adding the SE unit (extra complexity) further impedes its efficiency in processing multi-modal MRIs, especially considering the size of our training data set is still limited. To investigate better ways of fusing different modalities is our future task to further improve the overall performance. Also, we are interested in seeking a new loss function that better account for the imbalance nature among the areas of different sub-regions. 99 7 Conclusion and Discussions

Image segmentation is one of the most important image processing techniques to extract middle-level features from images. In medical image computing, automatically and accurately partitioning 3D image/volume into clinically meaningful structures is the key to enable further clinical analysis, including surgery planning, precision treatment, and therapy evaluation. In recent years, deep learning has become a very powerful tool in solving a lot of computer vision tasks including segmentation. Even though one can not say deep learning method outperforms the traditional methods on all tasks, it indeed demonstrates significant improvements on most tasks, especially on image based classification, detection problems. The success lies in fast growing GPU computability and available large training data sets. However, this is not true in the area of medical image computing. Specifically for 3D medical image segmentation task, it is very rare to have a data set that is in the same scale as that of data sets in other computer vision areas. To overcome the data scarcity in medical image segmentation, specifically in 3D, most recently proposed deep learning work for 3D medical image segmentation fall into two categories: The work in the first category mainly rely on the 2D slice-based segmentation model, and stack/combine the 2D segmentation results in a meaningful way to produce the 3D results; the other category of work seek a direct 3D solution via utilizing 3D kernels. For both of the two categories, U-Net, a fully convolutional neural network, gains good popularity and reputation, and is often utilized as the foundation model for further modifications: it can be either used as the baseline 2D slice-based solution, or easily extended to segment 3D images directly by replacing 2D convolution operation with 3D convolution. Although several recent work have demonstrated rather inspiring results for 3D image segmentation, some issues remain. For the first categories, although decomposing 3D segmentation task into a series of 2D problems could take full advantage of training data, it abandons the 3D contextual information, and may lead to suboptimal overall 3D 100 segmentation results. For the second categories, to solve 3D segmentation directly using 3D kernel inevitably increases the model complexity, and is difficult to train considering the data scarcity reality. In addition, using isotropic kernel to process 3D volumes that typically have anisotropic resolutions will be another potential problem. In this dissertation, we provide some different perspectives for 3D image segmenta- tion, and attempt to solve the aforementioned issues in existing 3D image segmentation models. Also replying on U-Net 2D segmentation model, we explored two different ways to instill 3D contextual information back into 2D slice based models. In the first work, we present an automated stage-wise 3D segmentation model relying on ensembling multi-view 2D convolutional networks. The proposed U-Seg-Net + Ensemble-Net is easy to train and achieves state-of-the-art performance in hippocampus segmentation. With a minor modification of U-Net, the proposed U-Seg-Net model was trained and used to segment 2D slices sampled from different views separately. Then in the ensemble part (either through majority voting or using CNN based ensemble-net) the refined 3D segmentation results were derived from independent results of different views, with the purpose of canceling out random errors in each view and reinforcing the commonly correct decisions. The second work is an end-to-end trainable 3D segmentation network which integrates FCN and RNN models into a sequential FCN model. More specifically, we utilized CLSTM to propagate the inter-slice neighboring semantic information along a series of 2D U-Seg-Net models, in order to restore 3D contextual information back. In addition, different integration positions for the joint model and different training strategies are explored and compared. Our experimental results show that, without significant increase in the complexity (trainable parameters), the proposed model indeed achieves our design goals of bringing back the 3D contextual information and generating consistent and smooth segmentation results. 101

Application-wise, we demonstrate the efficacy of the two solutions through the application of MRI hippocampus segmentation. By using the MRI image data from ADNI, our experiments show the advantages of our proposed models in the condition of limited training samples, compared with 3D FCN solutions, as well as other recently published models. To the best of our knowledge, our proposed models achieve excellent performance of overall segmentation accuracy and inter-slice consistency that are comparable to state- of-the-art solutions. In addition, we further test our sequential FCN model on a multi-class segmentation task, brain tumor segmentation, and achieves promising results. This further verifies the robustness of our proposed model. One specific point to note: for the brain tumor segmentation task, the 12 GB GPU we utilized couldn’t load even one volume when we attempted to use 3D U-Net for comparison study. This in certain extent demonstrates the computation advantage and the practical applicability of our proposed model. The proposed 3D segmentation models are inspired from specific issues or require- ments in the area of medical image computing. However, they are both very generic meth- ods that can be potentially applied on some other computer vision tasks, such as road scene segmentation, video surveillance, and human gait analysis. Technically, we are also in- terested in exploring other more compact sequential models to replace CLSTM. This will help improve the scalability of the proposed sequential FCN model. Also, exploring an end-to-end training strategy or better structure for the ensemble model is another possible direction for our future work. 102 References

[1] Yann LeCun, Yoshua Bengio, et al., “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, pp. 1995, 1995. [2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep learning, MIT Press, 2016. [3] Vincent Dumoulin and Francesco Visin, “A guide to convolution arithmetic for deep learning,” arXiv preprint arXiv:1603.07285, 2016. [4] Donglai Wei, Bolei Zhou, Antonio Torralba, and William T Freeman, “mneuron: A plugin to visualize neurons from deep models,” Massachusetts Institute of Technology, 2017. [5] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros, “Unpaired image-to- image translation using cycle-consistent adversarial networks,” . [6] Dan Ciresan, Alessandro Giusti, Luca M Gambardella, and Jurgen¨ Schmidhuber, “Deep neural networks segment neuronal membranes in electron microscopy images,” in Advances in neural information processing systems, 2012, pp. 2843– 2851. [7] Jonathan Long, Evan Shelhamer, and Trevor Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440. [8] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis & Machine Intelligence, , no. 12, pp. 2481–2495, 2017. [9] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 234–241. [10] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” in Advances in Neural Information Processing Systems, 2015, pp. 802–810. [11] Jianxu Chen, Lin Yang, Yizhe Zhang, Mark Alber, and Danny Z Chen, “Combining fully convolutional and recurrent neural networks for 3d biomedical image segmentation,” in Advances in Neural Information Processing Systems, 2016, pp. 3036–3044. 103

[12] Olfa Ben Ahmed, Jenny Benois-Pineau, Michelle Allard, Gwena´ elle¨ Catheline, Chokri Ben Amar, Alzheimer’s Disease Neuroimaging Initiative, et al., “Recogni- tion of alzheimer’s disease and mild cognitive impairment with multimodal image- derived biomarkers and multiple kernel learning,” Neurocomputing, vol. 220, pp. 98–110, 2017. [13] Syu-Jyun Peng, Tomor Harnod, Jang-Zern Tsai, Ming-Dou Ker, Jun-Chern Chiou, Herming Chiueh, Chung-Yu Wu, and Yue-Loong Hsin, “Evaluation of subcortical grey matter abnormalities in patients with mri-negative cortical epilepsy determined through structural and tensor magnetic resonance imaging,” BMC neurology, vol. 14, no. 1, pp. 104, 2014. [14] Bjoern H Menze, Andras Jakab, Stefan Bauer, Jayashree Kalpathy-Cramer, Keyvan Farahani, Justin Kirby, Yuliya Burren, Nicole Porz, Johannes Slotboom, Roland Wiest, et al., “The multimodal brain tumor image segmentation benchmark (brats),” IEEE transactions on medical imaging, vol. 34, no. 10, pp. 1993, 2015. [15] Ali Is¸ın, Cem Direkoglu,˘ and Melike S¸ah, “Review of mri-based brain tumor image segmentation using deep learning methods,” Procedia Computer Science, vol. 102, pp. 317–324, 2016. [16] Jie Hu, Li Shen, and Gang Sun, “Squeeze-and-excitation networks,” arXiv preprint arXiv:1709.01507, vol. 7, 2017. [17] Lilly Spirkovska, “A summary of image segmentation techniques,” 1993. [18] Vivek Dey, Yun Zhang, and Ming Zhong, A review on image segmentation techniques with remote sensing perspective, na, 2010. [19] Dzung L Pham, Chenyang Xu, and Jerry L Prince, “Current methods in medical image segmentation 1,” Annual review of biomedical engineering, vol. 2, no. 1, pp. 315–337, 2000. [20] Lay Khoon Lee, Siau Chuin Liew, and Weng Jie Thong, “A review of image segmentation methodologies in medical image,” in Advanced computer and communication engineering technology, pp. 1069–1080. Springer, 2015. [21] Saleha Masood, Muhammad Sharif, Afifa Masood, Mussarat Yasmin, and Mudassar Raza, “A survey on medical image segmentation,” Current Medical Imaging Reviews, vol. 11, no. 1, pp. 3–14, 2015. [22] Nelly Gordillo, Eduard Montseny, and Pilar Sobrevilla, “State of the art survey on mri brain tumor segmentation,” Magnetic resonance imaging, vol. 31, no. 8, pp. 1426–1438, 2013. [23] Waseem Khan, “Image segmentation techniques: A survey,” Journal of Image and Graphics, vol. 1, no. 4, pp. 166–170, 2013. 104

[24] Stefan Bauer, Roland Wiest, Lutz-P Nolte, and Mauricio Reyes, “A survey of mri- based medical image analysis for brain tumor studies,” Physics in Medicine & Biology, vol. 58, no. 13, pp. R97, 2013.

[25] Owen T Carmichael, Howard A Aizenstein, Simon W Davis, James T Becker, Paul M Thompson, Carolyn Cidis Meltzer, and Yanxi Liu, “Atlas-based hippocampus segmentation in alzheimer’s disease and mild cognitive impairment,” Neuroimage, vol. 27, no. 4, pp. 979–990, 2005.

[26] Fedde van der Lijn, Tom den Heijer, Monique MB Breteler, and Wiro J Niessen, “Hippocampus segmentation in mr images using atlas registration, voxel classification, and graph cuts,” Neuroimage, vol. 43, no. 4, pp. 708–720, 2008.

[27] Pierrick Coupe,´ Jose´ V Manjon,´ Vladimir Fonov, Jens Pruessner, Montserrat Robles, and D Louis Collins, “Patch-based segmentation using expert priors: Application to hippocampus and ventricle segmentation,” NeuroImage, vol. 54, no. 2, pp. 940–954, 2011.

[28] Franc¸cois Rousseau, Piotr A Habas, and Colin Studholme, “A supervised patch- based approach for human brain labeling,” IEEE transactions on medical imaging, vol. 30, no. 10, pp. 1852–1862, 2011.

[29] Hongzhi Wang, Jung W Suh, Sandhitsu R Das, John B Pluta, Caryne Craige, and Paul A Yushkevich, “Multi-atlas segmentation with joint label fusion,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 3, pp. 611– 623, 2013.

[30] Guorong Wu, Qian Wang, Daoqiang Zhang, Feiping Nie, Heng Huang, and Dinggang Shen, “A generative probability model of joint label fusion for multi-atlas based brain segmentation,” Medical image analysis, vol. 18, no. 6, pp. 881–890, 2014.

[31] Guorong Wu, Minjeong Kim, Gerard Sanroma, Qian Wang, Brent C Munsell, Dinggang Shen, Alzheimer’s Disease Neuroimaging Initiative, et al., “Hierarchical multi-atlas label fusion with multi-scale feature representation and label-specific patch partition,” NeuroImage, vol. 106, pp. 34–46, 2015.

[32] Tong Tong, Robin Wolz, Pierrick Coupe,´ Joseph V Hajnal, Daniel Rueckert, Alzheimer’s Disease Neuroimaging Initiative, et al., “Segmentation of mr images via discriminative dictionary learning and sparse coding: Application to hippocampus labeling,” NeuroImage, vol. 76, pp. 11–23, 2013.

[33] Yan Deng, Anand Rangarajan, and Baba C Vemuri, “Supervised learning for brain mr segmentation via fusion of partially labeled multiple atlases,” in Biomedical Imaging (ISBI), 2016 IEEE 13th International Symposium on. IEEE, 2016, pp. 633– 637. 105

[34] Yantao Song, Guorong Wu, Quansen Sun, Khosro Bahrami, Chunming Li, and Dinggang Shen, “Progressive label fusion framework for multi-atlas segmentation by dictionary evolution,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 190–197.

[35] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.

[36] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.

[37] Alexander de Brebisson and Giovanni Montana, “Deep neural networks for anatomical brain segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2015, pp. 20–28.

[38] Pim Moeskops, Jelmer M Wolterink, Bas HM van der Velden, Kenneth GA Gilhuijs, Tim Leiner, Max A Viergever, and Ivana Isgum,ˇ “Deep learning for multi-task medical image segmentation in multiple modalities,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2016, pp. 478–486.

[39] Ozg¨ un¨ C¸ic¸ek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger, “3d u-net: learning dense volumetric segmentation from sparse annotation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2016, pp. 424–432.

[40] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3D Vision (3DV), 2016 Fourth International Conference on. IEEE, 2016, pp. 565–571.

[41] Yann LeCun et al., “Generalization and network design strategies,” Connectionism in perspective, pp. 143–155, 1989.

[42] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551, 1989.

[43] Ronald Newbold Bracewell and Ronald N Bracewell, The Fourier transform and its applications, vol. 31999, McGraw-Hill New York, 1986.

[44] Steven B Damelin and Willard Miller Jr, The mathematics of signal processing, vol. 48, Cambridge University Press, 2012. 106

[45] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 2009, pp. 248–255.

[46] Fisher Yu and Vladlen Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.

[47] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei, “Deformable convolutional networks,” CoRR, abs/1703.06211, vol. 1, no. 2, pp. 3, 2017.

[48] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.

[49] Franc¸ois Chollet, “Xception: Deep learning with depthwise separable convolutions,” arXiv preprint, pp. 1610–02357, 2017.

[50] Sebastian Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016.

[51] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.

[52] Per Welander, Simon Karlsson, and Anders Eklund, “Generative adversarial networks for image-to-image translation on multi-contrast mr images-a comparison of cyclegan and unit,” arXiv preprint arXiv:1806.07777, 2018.

[53] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.

[54] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu, “Exploring the limits of language modeling,” arXiv preprint arXiv:1602.02410, 2016.

[55] Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya, “Multilingual language processing from bytes,” arXiv preprint arXiv:1512.00103, 2015.

[56] Daniel J Withey and Zoltan J Koles, “A review of medical image segmentation: methods and available software,” International Journal of Bioelectromagnetism, vol. 10, no. 3, pp. 125–148, 2008. 107

[57] Young Won Lim and Sang Uk Lee, “On the color image segmentation algorithm based on the thresholding and the fuzzy c-means techniques,” Pattern recognition, vol. 23, no. 9, pp. 935–952, 1990.

[58] Chenyang Xu and Jerry L Prince, “Snakes, shapes, and gradient vector flow,” IEEE Transactions on image processing, vol. 7, no. 3, pp. 359–369, 1998.

[59] James C Bezdek, LO Hall, and L P Clarke, “Review of mr image segmentation techniques using pattern recognition,” Medical physics, vol. 20, no. 4, pp. 1033– 1048, 1993.

[60] John L Johnson and Mary Lou Padgett, “Pcnn models and applications,” IEEE transactions on neural networks, vol. 10, no. 3, pp. 480–498, 1999.

[61] Neeraj Sharma and Lalit M Aggarwal, “Automated medical image segmentation techniques,” Journal of medical physics/Association of Medical Physicists of India, vol. 35, no. 1, pp. 3, 2010.

[62] Mohd Ali Balafar, Abdul Rahman Ramli, M Iqbal Saripan, and Syamsiah Mashohor, “Review of brain mri image segmentation methods,” Artificial Intelligence Review, vol. 33, no. 3, pp. 261–274, 2010.

[63] Geoffrey E Hinton, “Learning multiple layers of representation,” Trends in cognitive sciences, vol. 11, no. 10, pp. 428–434, 2007.

[64] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.

[65] Dan Claudiu Cires¸an, Ueli Meier, Luca Maria Gambardella, and Jurgen¨ Schmid- huber, “Deep, big, simple neural nets for handwritten digit recognition,” Neural computation, vol. 22, no. 12, pp. 3207–3220, 2010.

[66] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

[67] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.

[68] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[69] Matthew D Zeiler and Rob Fergus, “Visualizing and understanding convolutional networks,” in European conference on computer vision. Springer, 2014, pp. 818– 833. 108

[70] Andrej Karpathy and Li Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.

[71] Baris Kayalibay, Grady Jensen, and Patrick van der Smagt, “Cnn-based segmentation of medical imaging data,” arXiv preprint arXiv:1701.03056, 2017.

[72] Mohammad Havaei, Axel Davy, David Warde-Farley, Antoine Biard, Aaron Courville, Yoshua Bengio, Chris Pal, Pierre-Marc Jodoin, and Hugo Larochelle, “Brain tumor segmentation with deep neural networks,” Medical image analysis, vol. 35, pp. 18–31, 2017.

[73] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille, “Deeplab: Semantic image segmentation with deep convo- lutional nets, atrous convolution, and fully connected crfs,” arXiv preprint arXiv:1606.00915, 2016.

[74] Fayao Liu, Guosheng Lin, and Chunhua Shen, “Crf learning with cnn features for image segmentation,” Pattern Recognition, vol. 48, no. 10, pp. 2983–2992, 2015.

[75] Tsungnan Lin, Bill G Horne, Peter Tino, and C Lee Giles, “Learning long-term dependencies in narx recurrent neural networks,” IEEE Transactions on Neural Networks, vol. 7, no. 6, pp. 1329–1338, 1996.

[76] Kyunghyun Cho, Bart Van Merrienboer,¨ Dzmitry Bahdanau, and Yoshua Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” arXiv preprint arXiv:1409.1259, 2014.

[77] Sepp Hochreiter and Jurgen¨ Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[78] Alex Graves, Santiago Fernandez,´ Faustino Gomez, and Jurgen¨ Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.

[79] Alex Graves, Santiago Fernandez,´ and Jurgen¨ Schmidhuber, “Multidimensional recurrent neural networks,” in Proceedings of the international conference on artificial neural networks, 2007.

[80] Marijn F Stollenga, Wonmin Byeon, Marcus Liwicki, and Juergen Schmidhuber, “Parallel multi-dimensional lstm, with application to fast biomedical volumetric image segmentation,” in Advances in Neural Information Processing Systems, 2015, pp. 2998–3006. 109

[81] Mohsen Fayyaz, Mohammad Hajizadeh Saffar, Mohammad Sabokrou, Mahmood Fathy, Reinhard Klette, and Fay Huang, “Stfcn: Spatio-temporal fcn for semantic video segmentation,” arXiv preprint arXiv:1608.05971, 2016.

[82] Sepehr Valipour, Mennatullah Siam, Martin Jagersand, and Nilanjan Ray, “Re- current fully convolutional networks for video segmentation,” arXiv preprint arXiv:1606.00487, 2016.

[83] Yani Chen, Bibo Shi, Zhewei Wang, Pin Zhang, Charles D Smith, and Jundong Liu, “Hippocampus segmentation through multi-view ensemble convnets,” in Biomedical Imaging (ISBI 2017), 2017 IEEE 14th International Symposium on. IEEE, 2017, pp. 192–196.

[84] Nathan A DeCarolis and Amelia J Eisch, “Hippocampal neurogenesis as a target for the treatment of mental illness: a critical evaluation,” Neuropharmacology, vol. 58, no. 6, pp. 884–893, 2010.

[85] Azar Zandifar, Vladimir Fonov, Pierrick Coupe,´ Jens Pruessner, D Louis Collins, Alzheimer’s Disease Neuroimaging Initiative, et al., “A comparison of accurate automatic hippocampal segmentation methods,” NeuroImage, vol. 155, pp. 383– 393, 2017.

[86] A Convit, MJ De Leon, C Tarshish, S De Santi, W Tsui, H Rusinek, and A George, “Specific hippocampal volume reductions in individuals at risk for alzheimers disease,” Neurobiology of aging, vol. 18, no. 2, pp. 131–138, 1997.

[87] Clifford R Jack, Ronald C Petersen, Yue Cheng Xu, Peter C OBrien, Glenn E Smith, Robert J Ivnik, Bradley F Boeve, Stephen C Waring, Eric G Tangalos, and Emre Kokmen, “Prediction of ad with mri-based hippocampal volume in mild cognitive impairment,” Neurology, vol. 52, no. 7, pp. 1397–1397, 1999.

[88] Fernando Cendes, Frederick Andermann, Pierre Gloor, A Evans, M Jones-Gotman, C Watson, D Melanson, A Olivier, T Peters, I Lopes-Cendes, et al., “Mri volumetric measurement of amygdala and hippocampus in temporal lobe epilepsy,” Neurology, vol. 43, no. 4, pp. 719–719, 1993.

[89] Fernando Cendes, “Progressive hippocampal and extrahippocampal atrophy in drug resistant epilepsy,” Current opinion in neurology, vol. 18, no. 2, pp. 173–177, 2005.

[90] Michael D Nelson, Andrew J Saykin, Laura A Flashman, and Henry J Riordan, “Hippocampal volume reduction in schizophrenia as assessed by magnetic reso- nance imaging: a meta-analytic study,” Schizophrenia Research, vol. 24, no. 1, pp. 153, 1997.

[91] J Douglas Bremner, Penny Randall, Eric Vermetten, Lawrence Staib, Richard A Bronen, Carolyn Mazure, Sandi Capelli, Gregory McCarthy, Robert B Innis, 110

and Dennis S Charney, “Magnetic resonance imaging-based measurement of hippocampal volume in posttraumatic stress disorder related to childhood physical and sexual abusea preliminary report,” Biological psychiatry, vol. 41, no. 1, pp. 23–32, 1997.

[92] Hilary P Blumberg, Joan Kaufman, Andres´ Martin, Ronald Whiteman, Jane Hongyuan Zhang, John C Gore, Dennis S Charney, John H Krystal, and Bradley S Peterson, “Amygdala and hippocampal volumes in adolescents and adults with bipolar disorder,” Archives of general psychiatry, vol. 60, no. 12, pp. 1201– 1208, 2003.

[93] Clifford R Jack Jr, Frederik Barkhof, Matt A Bernstein, Marc Cantillon, Patricia E Cole, Charles DeCarli, Bruno Dubois, Simon Duchesne, Nick C Fox, Giovanni B Frisoni, et al., “Steps to standardization and validation of hippocampal volumetry as a biomarker in clinical trials and diagnostic criterion for alzheimers disease,” Alzheimer’s & Dementia, vol. 7, no. 4, pp. 474–485, 2011.

[94] GB Frisoni, “Structural imaging in the clinical diagnosis of alzheimer’s disease: problems and tools,” 2001.

[95] Marie Chupin, A Romain Mukuna-Bantumbakulu, Dominique Hasboun, Eric Bardinet, Sylvain Baillet, Serge Kinkingnehun,´ Louis Lemieux, Bruno Dubois, and Line Garnero, “Anatomically constrained region deformation for the automated segmentation of the hippocampus and the amygdala: Method and validation on controls and patients with alzheimers disease,” Neuroimage, vol. 34, no. 3, pp. 996–1019, 2007.

[96] Vanderson Dill, Alexandre Rosa Franco, and Marcio´ Sarroglia Pinho, “Automated methods for hippocampus segmentation: the evolution and a review of the state of the art,” Neuroinformatics, vol. 13, no. 2, pp. 133–150, 2015.

[97] Lee R. Dice, “Measures of the amount of ecologic association between species,” Ecology, vol. 26, no. 3, pp. 297–302, 1945.

[98] Abdel Aziz Taha and Allan Hanbury, “Metrics for evaluating 3d medical image segmentation: analysis, selection, and tool,” BMC Medical Imaging, vol. 15, no. 29, 2015.

[99] Jonathan H Morra, Zhuowen Tu, Liana G Apostolova, Amity E Green, Arthur W Toga, and Paul M Thompson, “Comparison of adaboost and support vector machines for detecting alzheimers disease through automated hippocampal segmentation,” IEEE transactions on medical imaging, vol. 29, no. 1, pp. 30, 2010.

[100] Yani Chen, Bibo Shi, Zhewei Wang, Tao Sun, Charles D Smith, and Jundong Liu, “Accurate and consistent hippocampus segmentation through convolutional lstm 111

and view ensemble,” in International Workshop on Machine Learning in Medical Imaging. Springer, 2017, pp. 88–96. [101] Bibo Shi, Yani Chen, Pin Zhang, Charles D Smith, Jundong Liu, Alzheimer’s Disease Neuroimaging Initiative, et al., “Nonlinear feature transformation and deep fusion for alzheimer’s disease staging analysis,” Pattern recognition, vol. 63, pp. 487–498, 2017. [102] Marion Pineros,˜ Monica´ S Sierra, M Isabel Izarzugaza, and David Forman, “Descriptive epidemiology of brain and central nervous system cancers in central and south america,” Cancer epidemiology, vol. 44, pp. S141–S149, 2016. [103] McKinsey L Goodenberger and Robert B Jenkins, “Genetics of adult glioma,” Cancer genetics, vol. 205, no. 12, pp. 613–621, 2012. [104] David N Louis, Arie Perry, Guido Reifenberger, Andreas Von Deimling, Dominique Figarella-Branger, Webster K Cavenee, Hiroko Ohgaki, Otmar D Wiestler, Paul Kleihues, and David W Ellison, “The 2016 world health organization classification of tumors of the central nervous system: a summary,” Acta neuropathologica, vol. 131, no. 6, pp. 803–820, 2016. [105] S Bielack, D Carrle, PG Casali, and ESMO Guidelines Working Group, “Osteosar- coma: Esmo clinical recommendations for diagnosis, treatment and follow-up,” An- nals of Oncology, vol. 20, no. suppl 4, pp. iv137–iv139, 2009. [106] Nicole Porz, Stefan Bauer, Alessia Pica, Philippe Schucht, Jurgen¨ Beck, Rajeev Ku- mar Verma, Johannes Slotboom, Mauricio Reyes, and Roland Wiest, “Multi-modal glioblastoma segmentation: man versus machine,” PloS one, vol. 9, no. 5, pp. e96873, 2014. [107] Antonios Drevelegas and Nickolas Papanikolaou, “Imaging modalities in brain tumors,” in Imaging of Brain Tumors with Histological Correlations, pp. 13–33. Springer, 2011. [108] Spyridon Bakas, Hamed Akbari, Aristeidis Sotiras, Michel Bilello, Martin Rozycki, Justin S Kirby, John B Freymann, Keyvan Farahani, and Christos Davatzikos, “Ad- vancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features,” Scientific data, vol. 4, pp. 170117, 2017. [109] Adriano Pinto, Sergio´ Pereira, Deolinda Rasteiro, and Carlos A Silva, “Hierarchical brain tumour segmentation using extremely randomized trees,” Pattern Recognition, 2018. [110] Duncan RR White, Alexander S Houston, William FD Sampson, and Graham P Wilkins, “Intra-and interoperator variations in region-of-interest drawing and their effect on the measurement of glomerular filtration rates,” Clinical nuclear medicine, vol. 24, no. 3, pp. 177–181, 1999. 112

[111] Martin Styner, Joohwi Lee, Brian Chin, M Chin, Olivier Commowick, and H Tran, “3d segmentation in the clinic: A grand challenge ii: Ms lesion segmentation,” 2008.

[112] Marcel Prastawa, Elizabeth Bullitt, Sean Ho, and Guido Gerig, “A brain tumor segmentation framework based on outlier detection,” Medical image analysis, vol. 8, no. 3, pp. 275–283, 2004.

[113] Meritxell Bach Cuadra, Claudio Pollo, Anton Bardera, Olivier Cuisenaire, J-G Villemure, and J-P Thiran, “Atlas-based segmentation of pathological mr brain images using a model of lesion growth,” IEEE transactions on medical imaging, vol. 23, no. 10, pp. 1301–1314, 2004.

[114] Bjoern H Menze, Koen Van Leemput, Danial Lashkari, Marc-Andre´ Weber, Nicholas Ayache, and Polina Golland, “A generative model for brain tumor segmentation in multi-modal images,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2010, pp. 151– 159.

[115] Ali Gooya, Kilian M Pohl, Michel Bilello, Luigi Cirillo, George Biros, Elias R Melhem, and Christos Davatzikos, “Glistr: glioma image segmentation and registration,” IEEE transactions on medical imaging, vol. 31, no. 10, pp. 1941– 1954, 2012.

[116] Michael Goetz, Christian Weber, Josiah Bloecher, Bram Stieltjes, Hans-Peter Meinzer, and Klaus Maier-Hein, “Extremely randomized trees based brain tumor segmentation,” Proceeding of BRATS challenge-MICCAI, pp. 006–011, 2014.

[117] T Logeswari and M Karnan, “An improved implementation of brain tumor detection using segmentation based on soft computing,” Journal of Cancer Research and Experimental Oncology, vol. 2, no. 1, pp. 006–014, 2009.

[118] Jens Kleesiek, Armin Biller, Gregor Urban, U Kothe, Martin Bendszus, and F Hamprecht, “Ilastik for multi-modal brain tumor segmentation,” Proceedings MICCAI BraTS (Brain Tumor Segmentation Challenge), pp. 12–17, 2014.

[119] Su Ruan, Stephane´ Lebonvallet, Abderrahim Merabet, and Jean-Marc Constans, “Tumor segmentation from a multispectral mri images by using support vector machine classification,” in Biomedical Imaging: From Nano to Macro, 2007. ISBI 2007. 4th IEEE International Symposium on. IEEE, 2007, pp. 1236–1239.

[120] Hongming Li and Yong Fan, “Label propagation with robust initialization for brain tumor segmentation,” in Biomedical Imaging (ISBI), 2012 9th IEEE International Symposium on. IEEE, 2012, pp. 1715–1718.

[121] Elsa D Angelini, Olivier Clatz, Emmanuel Mandonnet, Ender Konukoglu, Laurent Capelle, and Hugues Duffau, “Glioma dynamics and computational models: a 113

review of segmentation, registration, and in silico growth algorithms and their clinical applications,” Current Medical Imaging Reviews, vol. 3, no. 4, pp. 262– 276, 2007.

[122] Darko Zikic, Yani Ioannou, Matthew Brown, and Antonio Criminisi, “Segmentation of brain tumor tissues with convolutional neural networks,” Proceedings MICCAI- BRATS, pp. 36–39, 2014.

[123] Konstantinos Kamnitsas, Christian Ledig, Virginia FJ Newcombe, Joanna P Simpson, Andrew D Kane, David K Menon, Daniel Rueckert, and Ben Glocker, “Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation,” Medical image analysis, vol. 36, pp. 61–78, 2017.

[124] Xiaomei Zhao, Yihong Wu, Guidong Song, Zhenye Li, Yazhuo Zhang, and Yong Fan, “A deep learning model integrating fcnns and crfs for brain tumor segmentation,” Medical image analysis, vol. 43, pp. 98–111, 2018.

[125] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” arXiv preprint arXiv:1412.7062, 2014.

[126] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr, “Conditional random fields as recurrent neural networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1529–1537.

[127] Daniel P. Huttenlocher, Gregory A. Klanderman, and William J Rucklidge, “Comparing images using the hausdorff distance,” IEEE Transactions on pattern analysis and machine intelligence, vol. 15, no. 9, pp. 850–863, 1993.

[128] Mobarakol Islam and Hongliang Ren, “Fully convolutional network with hypercolumn features for brain tumor segmentation,” in Proceedings of MICCAI workshop on Multimodal Brain Tumor Segmentation Challenge (BRATS), 2017.

[129] Andrew Jesson and Tal Arbel, “Brain tumor segmentation using a 3d fcn with multi- scale loss,” in International MICCAI Brainlesion Workshop. Springer, 2017, pp. 392–402.

[130] Konstantinos Kamnitsas, Wenjia Bai, Enzo Ferrante, Steven McDonagh, Matthew Sinclair, Nick Pawlowski, Martin Rajchl, Matthew Lee, Bernhard Kainz, Daniel Rueckert, et al., “Ensembles of multiple models and architectures for robust brain tumour segmentation,” in International MICCAI Brainlesion Workshop. Springer, 2017, pp. 450–462.

[131] Kuan-Lun Tseng, Yen-Liang Lin, Winston Hsu, and Chung-Yang Huang, “Joint sequence learning and cross-modality convolution for 3d biomedical segmentation,” 114 in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017, pp. 3739–3746. ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Thesis and Dissertation Services ! !