<<

Using Convolutional Neural Network to Generate Neuro Image Template

A Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

By

Songyue Qian,

Graduate Program in Department of Electrical and Computer Engineering

The Ohio State University

2019

Master’s Examination Committee:

Prof. Bradley D. Clymer, PhD, Co-Advisor Assistant Professor Barbaros Selnur Erdal, PhD, Co-Advisor c Copyright by

Songyue Qian

2019 Abstract

Machine learning techniques implemented by Convolutional Neural Network (CNN) have become the state-of-the-art technique for automatic neuro image classification because of their outstanding computing ability. The template model method, a tech- nique used to produce a stable pattern from the position of different anatomical shapes in the brain, is considered as another automatic method that can segment whether a patient is healthy or not. However, most machine learning techniques only focus on detecting specific diseases but overlook the patients health. The template model’s architecture may overlook the pattern of the entire brain. Therefore, a bet- ter automatic classification technique that focuses on the patient’s health status is in demand. The purpose of this research is to design an efficient CNN architecture sensitive on normal case diagnosis instead of disease detection.

In this research, we propose Hybrid-CNN-Siamese-Network(HCSNet), a CNN ar- chitecture which is used for 2D neuro image classification. HCSNet combines Incep- tion ResNet V2 and a Siamese network together to achieve better performance than those two methods alone. In order to illustrate why this architecture performs well on neuro imaging, a comprehensive overview of existing techniques for CNN analysis and experiments about comparing different CNNs’ performance on the same neuro image dataset are provided [1]. We merge all the abnormal cases to one abnormal

ii class so that there are only two classes (normal/abnormal) in the dataset. Experimen- tal results demonstrate the effectiveness of our proposed CNN architecture. HCSNet obtains an overall 92% accuracy and 94.4% accuracy for detecting normal cases. We further discuss the potential future work that would be done based on this CNN architecture.

iii This is dedicated to my parents.

iv Acknowledgments

I would like to thank my two thesis advisors Prof. Bradley D. Clymer of the

Electrical and Computer Engineering at the Ohio State University and Dr. Barbaros

Selnur Erdal of the Department of the Radiology at the Ohio State University. Dr.

Clymer, thank you for the guiding and patience to me. Dr. Erdal, it’s my honor to work with you, you are the best advisor. Thank you.

I would also like to thank all of my friends who have helped me to improve the writing. It is impossible for me to finish this thesis without you guys’ assistance and support.Thank you.

Songyue Qian

v Table of Contents

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

List of Tables ...... viii

List of Figures ...... x

1. Introduction ...... 1

1.1 Motivation ...... 1 1.2 Problem Statement ...... 2 1.3 Organization ...... 3

2. Background ...... 5

2.1 template ...... 5 2.2 Neural Network Architecture ...... 7 2.3 Contribution ...... 9

3. Methodology ...... 11

3.1 Overview of Convolutional Neural Network(CNN) ...... 11 3.2 Neural Network Training ...... 14 3.2.1 Backpropagation ...... 14 3.2.2 Training sets setting ...... 15 3.3 Neural Network Layers ...... 16 3.3.1 Convolutional Layers ...... 16 3.3.2 Pooling Layers ...... 21

vi 3.3.3 Activation Layers ...... 23 3.3.4 Dropout ...... 27 3.3.5 Normalization Layers ...... 27 3.3.6 Loss Layers ...... 29 3.4 Neural Network design ...... 30 3.4.1 LeNet ...... 30 3.4.2 AlexNet ...... 31 3.4.3 VGGNet ...... 34 3.4.4 GoogleNet ...... 37 3.4.5 ResNet ...... 41 3.4.6 Siamese Network ...... 45 3.4.7 Hybrid CNN-Siamese network(HCSNet) ...... 48 3.5 Pre-processing ...... 50 3.6 Analysis Techniques ...... 51 3.6.1 Qualitative Analysis by example ...... 51 3.6.2 Confusion Matrices ...... 53 3.6.3 Learning Curves ...... 55 3.6.4 Others ...... 55

4. Experiment and Result ...... 57

4.1 Parameter setting ...... 57 4.1.1 Implementation on scratch network ...... 58 4.1.2 Implementation on pre-trained network ...... 59 4.2 Result and on Experiment I ...... 60 4.3 Result and Statistics on Experiment II ...... 66

5. Conclusion and Future Discussion ...... 72

5.1 Conclusion ...... 72 5.2 Future work and discussion ...... 73 5.2.1 Datasets size ...... 73 5.2.2 Siamese network parameter setting ...... 73 5.2.3 Segmentation region ...... 73 5.2.4 3D object processing ...... 74

Bibliography ...... 75

vii List of Tables

Table Page

3.1 Example of Confusion matrix ...... 54

4.1 DataSet distribution on Experiment I ...... 59

4.2 DataSet distribution on experiment II ...... 60

4.3 Result of LeNet after 5000 steps in set 1 ...... 62

4.4 Result of LeNet after 5000 steps in set 2 ...... 62

4.5 Result of AlexNet after 5000 steps in set 1 ...... 62

4.6 Result of AlexNet after 5000 steps in set 2 ...... 62

4.7 Result of VGG-16 after 5000 steps in set 1 ...... 63

4.8 Result of VGG-16 after 5000 steps in set 2 ...... 63

4.9 Result of Inception v3 after 5000 steps in set 1 ...... 63

4.10 Result of Inception v3 after 5000 steps in set 2 ...... 63

4.11 Result of Inception v4 after 5000 steps in set 1 ...... 64

4.12 Result of Inception v4 after 5000 steps in set 2 ...... 64

4.13 Result of Inception ResNet after 5000 steps in set 1 ...... 64

4.14 Result of Inception ResNet after 5000 steps in set 2 ...... 64

viii 4.15 Result of Siamese Network after 5000 steps in set 1 ...... 65

4.16 Result of Siamese Network after 5000 steps in set 2 ...... 65

4.17 Result of HCSNet after 5000 steps in set 1 ...... 65

4.18 Result of HCSNet after 5000 steps in set 2 ...... 65

4.19 Result of Inception V3 in Experiment II ...... 67

4.20 Result of Inception ResNet V2 in Experiment II ...... 68

4.21 Result of Inception V3 and Siamese Network in Experiment II . . . . 69

4.22 Result of HCSNet in Experiment II ...... 70

ix List of Figures

Figure Page

3.1 Multiple Neurons with single directions node ...... 11

3.2 Multiple Neurons with single directions node ...... 12

3.3 Multiple Neurons with single directions node ...... 13

3.4 Convolutional Process ...... 18

3.5 Convolutional Matrix ...... 18

3.6 Convolutional Matrix with padding=2 ...... 20

3.7 Max and Average pooling ...... 22

3.8 Sigmoid activation function ...... 24

3.9 Derivative of Sigmoid Activation Function ...... 24

3.10 tanh activation function ...... 25

3.11 Derivative of tanh Activation Function ...... 25

3.12 ReLU activation function ...... 25

3.13 Derivative of ReLU Activation Function ...... 25

3.14 Leaky ReLU Activation ...... 26

3.15 Before and after applying dropout ...... 28

x 3.16 Dropout processing equation[2] ...... 29

3.17 Architecture of LeNet-5[3] ...... 31

3.18 Layer structure of AlexNet[4] ...... 32

3.19 Parameter number and setting of AlexNet ...... 33

3.20 Architecture of VGGNet[5] ...... 35

3.21 Inception module of GoogleNet ...... 38

3.22 Inception V2 module ...... 40

3.23 Inception V3 module ...... 41

3.24 Residual unit of ResNet ...... 42

3.25 Compare VGG and ResNet structure ...... 44

3.26 Residual unit of two and three layer ResNet ...... 45

3.27 Siamese Architecture [6] ...... 46

3.28 Workflow of HCSNet ...... 49

4.1 Different network validation accuracy vs. number of steps in set 1 . . 61

4.2 Different network validation accuracy vs. number of steps in set 2 . . 61

4.3 Validation accuracy vs. epoch Inception V3 Experiment II ...... 67

4.4 Loss vs. epoch Inception V3 Experiment II ...... 68

4.5 Validation accuracy vs. epoch Inception ResNet V2 Experiment II . . 69

4.6 Loss vs. epoch Inception ResNet V2 Experiment II ...... 70

xi Chapter 1: Introduction

1.1 Motivation

While the usefulness of is constantly leading to better image quality and higher accuracy of segmentation results, manual segmentation is a time- consuming and laborious process. Hence, automatic segmentation is a demanded technique which is mostly done by quantitative research algorithms. In general, most of the quantitative research about neuroimaging focuses on aligning one or several anatomical templates to the target image (via a linear or nonlinear registration pro- cess) and transferring segmentation labels from the templates to the image. However, such methods may not be capable of capturing the full anatomical variability of tar- gets because the generated model may only focus on the major components of the brain structure.

One of the quantitative research algorithms generates a template model for a normal brain. Although template model is not a common method of medical image classification for all organs of the body, it produces a stable pattern from the position of different anatomical shapes in the brain. The similarity between this template and the testing image can be calculated to reveal how ‘normal’ the test case is.

1 Machine learning which uses images to train a predictive model that assigns class probabilities to each pixel/ is another widely used technique. Machine learning with its state-of-the-art ability in segmenting brain structures [7] has the capability to process large datasets and implement classification or segmentation far beyond the capabilities of human perception.

Most of the neuroimaging machine learning implementations are applied in the two methods introduced above, the template model and machine learning. The former method generates a brain template model to segment each component of the brain. It observes the situation of each brain component because the top concern is to find out which part matches to which component while the pattern of the entire brain may be overlooked. The machine learning method has been used to generate a training set for a specific disease. It causes a potential pitfall of any new disease in test image set may be wrongly classified as normal case. Therefore, a new method which can combine the two methods’ advantage is desired.

1.2 Problem Statement

In order to find out the best CNN for normal/abnormal classification design, Ex- periment I explores each neural network from the layer’s perspective and demonstrates a performance evaluation using different networks with same hardware conditions.

The training model is not for specific disease detection or component matching. It tells whether the test image belongs to the normal set or not. Hence, as long as the outcome is that the test image does not belong to the normal, the test image is treated as abnormal case and will be left to the radiologist for further investigation.

2 Another goal of this research is to change the usage of CNN for neuroimaging.

A Siamese Network is added as a post processing step in which a CNN applies clas- sification by comparing the similarity between two images. Because the similarity is calculated by comparing the entire image, it is capable of generating a template without splitting the image into different components, fixing the pitfall of template model technique with this method.

In the strategy of this research, all abnormal cases are counted in one single set. The classification process is not to diagnose the disease but instead determines whether the situation of the patient is normal or not. This strategy not only enhances the detectability of the normal cases compared with the component matching method, but it also avoids the possible incorrect diagnosis results from the specific disease classification method. Once the result determines that the image does not belong to the normal group, it will be put in the list of abnormal cases for further examination.

In Experiment II, the entire CNN architecture implements an image classification using the template generated from the training sets.

1.3 Organization

In Chapter 2, the background of the neuroimaging template and machine learn- ing techniques are discussed in order to illustrate the major components for this research. In Chapter 3, the common machine learning network designs are discussed.

This chapter also investigates the component and structure of neural networks and discusses the pros and cons of various machine learning neural network methods.

Chapter 4 compares the performance of each neural network for generating a normal

3 template from the training sets. Chapter 5 offers the conclusion of this research and an overview of planned future research.

4 Chapter 2: Background

2.1 Neuroimaging template

The purpose of creating a neuroimaging template is to make a region-by-region comparison between the templates and the test images. Once a template has been constructed, it models the unseen data from the same population, the template, provides an ideal reference image for the data and for the statistical analysis. For instance, the abnormal volumes and the shapes of certain anatomical regions of the brain have been found to be associated with a series of brain disorders, including

Alzheimers and Parkinson’s disease [8][9].

In general, the prior-art methods for generating a neuro image template can be divided into four main categories: atlas-based methods [10][11], statistical models

[12][13], deformable models [14] and machine learning based classifiers [7][10].

Atlas-based method aligns one or several anatomical templates to the target image by using either linear or non-linear registration process, transfers segmenting labels from the templates to the image. This kind of method heavily relies on a registration step, in which the atlases are non-linearly registered to the query image. If the registration step is complex, it will take too much time for processing and aligning.

Moreover, the atlas-based methods may not be able to capture the full anatomical

5 variability of target subjects and would fail in the cases of large misalignments or deformations.

Statistical model method utilizes the training data to learn a parametric model describing the variability of the specific brain features such as, shape, texture, etc.

These approaches might result in overfitting the data if the number of parameters to be learned is too small, which negatively affects the results. The robustness of this kind of statistical approach might also be affected by the noise in the training data. The parameters are updated iteratively by searching in the vicinity of the current solution. Therefore, an accurate initialization is required for converging the parameters to a correct structure.

Unlike atlas-based method and statistical model method, segmentation technique using deformable models is highly flexible and does not require any training data or prior knowledge. Besides, deforming in this method provides a capability for distinguishing the input structures. The disadvantage of this method is that the deformable models are heavily related on the initialization of the segmentation contour and the stopping criteria. It makes the method highly related to the characteristic of each case.

The fourth method is machine learning application[10][15], which uses the training images to learn a predictive model that assigns class probabilities to each pixel/voxel.

These probabilities are sometimes used as unary potentials in the standard regular- ization techniques such as graph cuts [16]. Nevertheless, this type of approach usu- ally involves heavy algorithmic design, with some carefully engineered, application- dependent features, and meta-parameters and limits the applicability to different brain structures and modalities.

6 2.2 Neural Network Architecture

Machine learning has emerged as a powerful tool in numerous applications of patterns and speech recognition. Unlike the hand-crafted methods, machine learning techniques make it possible to learn the hierarchical features which represent different levels of abstraction in a data-driven manner. Among the different types of learning approaches, the CNNs [3][4] have shown their potential of solving the problems in computer vision. The networks of this type are established by multiple convolution, pooling and fully-connected layers, and the parameters of which are learned using the back-propagation. The network surpasses traditional architectures on two prop- erties: local-connectivity and parameter sharing. The units in the hidden layers of a CNN are only connected to a small number of units, corresponding to a spatially localized region. This process reduces the number of parameters in the net, the mem- ory/computational burden, and the risk of overfitting. Moreover, CNNs also reduce the number of learned parameters by sharing the same basis function (i.e., convolution

filters) across different image locations.

CNNs have seen many applications in neuroimaging applications among many others in medical imaging. [17][18][19][20]. Ciresan et al. [17] have shown the CNN allows accurate segmentation of the neuronal membranes in electron microscopy im- ages. In the same study, a sliding-window strategy is applied to predict the class probabilities of each pixel by using the patches centered at the pixels as input to the network. CNN based methods make it possible to segment three brain tissues

(white matter, gray matter and cerebrospinal fluid) from the multi-sequence Magnetic

Resonance Images (MRI) of infants [18]. An important drawback of this strategy is

7 that its label prediction is based on the localized information. Also, this strategy is typically slow since the prediction must be carried out for each pixel.

2D image corresponding to a single plane can be another type of input of CNN.

In medical imaging, 2D images are obtained as a slice from 3D medical images. Deep

CNNs have been investigated for the glioblastoma tumor segmentation [19], under an architecture with several pathways, which modeled both local and global-context features. It is also feasible to apply a different CNN architecture for the segmenting of brain tumors in MRI data, exploring the use of small convolution kernels [20].

Similarly, several recent studies investigated CNNs for segmenting subcortical brain structures [16][21][22][23][24]. Lee et al. [21] presented a CNN-based approach to learn discriminative features from the expert-labelled MR images. Moeskops et al.

[22] applied CNNs to segment brain structures from five different datasets, and listed performance for the subjects in different age groups (ranging from pre-term infants to older adults). A multiscale patch-based strategy is used to improve these results, where the patches of different sizes were extracted around each pixel as the inputs to the network.

Although medical images are often in the form of 3D volumes (e.g., MRI or com- puted tomography scans), most of the existing CNN approaches use a slice-by-slice analysis of 2D images. An obvious advantage of a 2D approach, compared to one using

3D images, is its lower computational and memory requirements. Furthermore, 2D inputs accommodate using pre-trained networks, either directly or via transfer learn- ing. However, an important drawback of this approach is that anatomic context in orthogonal directions to the 2D plane is completely discarded. As discussed recently

8 in Millertari et al. [23], considering 3D MRI data directly, instead of slice-by-slice, we improve the performance of a segmentation method.

To incorporate 3D contextual information, there is also study about using 2D

CNNs on images from the three orthogonal planes [24]. The memory requirements for fully 3D networks are avoided by extracting large 2D patches from multiple image scales, and combining them with small single-scale 3D patches. All patches are as- sembled into eight parallel network pathways to achieve a high-quality segmentation of 134 brain regions from whole brain MRI. Recently, a research proposing a CNN scheme based on 2D convolutions to segment a set of subcortical brain structures [16].

The segmentation of the whole volume is first achieved by processing each 2D slice independently. To impose volumetric homogeneity, it constructs a 3D conditional random field (CRF) using scores from the CNN as unary potentials in a multi-label energy minimization problem.

3D CNNs have been largely avoided due to the computational and memory re- quirements of running 3D convolutions during inference. Most of the 3D CNN object is implemented by converting image to multiple 2D images then processed slice by slice. The structure of processing a 3D object [25] is different than the structure of processing multiple 2D slice images. But 3D CNN topics are not covered in this background section because this research only implemented 2D-slice images.

2.3 Contribution

This research aims to implement medical image classification applied by machine learning techniques and template model. In the implementation, Siamese Network is used to train a template of normal cases of the brain. The CNN architecture discussed

9 in this thesis consists of two components, pre-trained CNN and CNN for template model. The output of this CNN architecture is the similarity between the test image and the training set. According the accuracy, whether the test image is an abnormal case might be determined. These comprehensive experiments applied by different neural networks are implemented in the same hardware environment.

The dataset of this research includes 2750 consecutive head CT examinations which is the same dataset used in Luciano et al.[1] The CNN designed in this research is a combination of Siamese Network and Inception ResNet V2. It shows a better performance than simple CNN when the training set is small (less than 400). It improves 7 % accuracy and 10% positive predicted value. For the larger datasets, it provides a good result which reaches to 92% accuracy and over 94% accuracy for detecting normal case.

Moreover, the methodology section demonstrates the analysis and functionality of each layer components and each neural network. Another purpose of this thesis is to analyze which CNN architecture will be more sensitive in neuroimaging template case. It will benefit the future research for any neural network design for CT neuro imaging. The result provides survey information of how different neural networks affect the classification results.

10 Chapter 3: Methodology

3.1 Overview of Convolutional Neural Network(CNN)

Convolutional Neural Network (CNN) uses a variation of Multi-Layer Perceptions

(MLP) designed to achieve minimal pre-processing. To describe CNN, we will start from the simplest possible neural network, which comprises a single neuron as Fig- ure 3.1.

The “neuron” is a computational unit that takes inputs x1, x2, x3 to produce a single output, which is similar to the neuron’s structure of the human. That is the reason why it is called as “artificial neuron”. The core part of a neuron can be treated

Figure 3.1: Multiple Neurons with single directions node

11 Figure 3.2: Multiple Neurons with single directions node

as a function which corresponds to the input-output mapping defined by the logistic regression. Therefore, we can generate a neural network model by combing multiple

“neurons”.

The network architecture in Figure 3.2 can be treated as an example of MLP.

(Note that the orientation in between nodes can be recurrent in other designs) In this model, we can assume that each parameter x has a weight value w and a threshold

value b used for computation in each neuron. However, the top concern is how to get

the value of w and b. In general, to find a correct value of w and b, we will use the

“control variates method” used in used in Monte Carlo methods [26]. Observe the

outputs difference while changing w and b to δw and δb. Let the machine repeats this

process until it find out an ideal match value for w and b. This process is called as

the training process of CNN.(Figure 3.3)

The order of CNN training can be concluded as below:

12 Figure 3.3: Multiple Neurons with single directions node

• Define the input and output

• Find one or more algorithm to get output from input

• Based on classified datasets to train the model, used for finding parameters

• Once new case’s input is given, based on the parameter calculated from training

process to generate output

If the training object is a computer image, CNN models would use the fact that the pixels of the images are listed in order. The computer image is processed by learning image filters which are called the convolutional layers. While MLPs vectorize the input, the input of a layer in a CNN are feature maps. A feature map is a matrix m ∈ Rw+h,(w, h represents width and height) but typically the width equals the

height (w = h). For an RGB input image, the number of feature maps is d = 3.

Each color channel is a feature map. Since AlexNet [4] almost halved the error in the

13 ImageNet challenge, CNN is a state-of-the-art technique in various computer vision machine learning tasks.

3.2 Neural Network Training

3.2.1 Backpropagation

In CNN, Backpropagation[27] is an essential method to compute descending rate or gradient of the loss function while the network weights are calculated. In other words, backpropagation is the process to calculate the w and the b referred in Chapter

3.1.

Backpropagation adjusting the weight of neurons by calculating the gradient of the loss function is a complement of the gradient descent optimization algorithm.

The loss and effect value from the function is considered as an evaluation of the training performance. Backpropagation requires the derivative of the loss function with respect to the network output to be known, which typically means that a desired target value is known.

The origin of Backpropagation is from an older and more general technique called automatic differentiation. This technique is also sometimes called backward propa- gation of errors, because the error is calculated at the output and distributed back through the network layers. It is commonly used to train deep neural networks, a term referring to neural networks with more than one hidden layer.

Backpropagation is also a generalization of the delta rule to multi-layered feedfor- ward networks, made possible by using the chain rule to iteratively compute gradients for each layer. In Chapter 3.3, we introduce the definition and the structure of dif- ferent CNN layers, which is the objects of Backpropagation.

14 3.2.2 Training sets setting

As Chapter 3.1 states, the input of CNN will be vectored with multiple features to get a single value output. Because of this architecture, most of the implementations made by CNN are used for classification. Even though there are multiple implemen- tations [16][21][22][23][24] which focus on generating a model template, most of them only focus on detecting a specific region based on image template but do not apply segmenting. Unlike other research focus on specific disease or specific region, this thesis aims to compare testing case with the normal template to judge whether the testing case is normal or not. The output for the test image will be the similarity between the test image and different classes’ templates. If the test image is classified as an abnormal case, it means that the system recommends the doctor do further segmentation for this test case. The design of this CNN model has different orienta- tion compared with the traditional CNN experiment because it focuses on the normal case instead of the disease cases. In other words, the outcome of the research switches from “Does the patient have disease A?” to “Is the patient healthy?” The training set we used in this research is the same dataset used in the previous AI classification project [1]. The abnormal training set consists of multiple types of brain diseases.

The normal training set includes the normal neuro image. To simplify the problem, this project only focuses on 2D image instead of original 3D CT image. The 2D slices on the training sets are collected from similar z-axis from the CT images.

There is another method can be used to generate a template for the normal cases.

An image can be converted to a matrix made of vectors, which can be used as input of the CNN. However, the output does not have to be a single value. A standard template can also be generated from the datasets, which is similar to the design of

15 FaceNet[28]. They represent the structure of a face by two hundred points. Then they use those points positions to generate a template from one class. After that, they compare the similarity between multiple face template to judge that whether it belongs to this class or not. In this architecture, the output is a set of points converted from original images (matrices). Similar to facial image, components in human brain have a similar shape and located in a similar position. Therefore, it is possible to apply the template model method on neuro image CNN application.

This method can be implemented by Siamese network which is used for comparing two images’ similarity by CNN. The comparing process includes a convolution matrix result which captures more vectors characteristics compared with traditional pixel comparing method. In this method, only the normal training set is used because the key point is to compare the similarity between the test image and all the normal cases.

The output value is a floating point value between 0 to 1 instead of Boolean output.

In this thesis research, the classification result of HCSNet is calculated by combining the normal CNN’s result and the template model’s result. Due to the classification process implemented by Inception ResNet and Siamese Network are separated, there is no particular pre-processing required for images in the training set.

3.3 Neural Network Layers

3.3.1 Convolutional Layers

In CNN, the convolutional layer computes the output volume by computing, the dot product between all filters and image patch. Convolutional layers pass the pro- cessed outcome to the next layer by applying a convolution operation to the input.

The layer’s parameters consist of a set of filters, which have a receptive field extending

16 through the full depth of the input volume. Each filter is convolved across the width and height of the input volume to compute the dot product between the entries of the

filter and the input.(Figure 3.4) The output is a 2-dimensional feature map of that

filter which indicates if there is any specific feature detected at some spatial position in the input.

The output volume of the convolution layer is built by combining the feature maps for all filters along the depth. Every entry in the output volume can be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same feature map.

When convolutional layer processing images, it is unworkable to connect neurons to all neurons in the previous volume since the spatial structure of the data is not included in the network architecture. Convolutional networks use spatially local cor- relation to implement a local connectivity pattern between neurons of adjacent layers: each neuron is connected to only a small region of the input volume as Figure 3.5.

A 7 × 7 input matrix is convoluted to a 3 × 3 matrix. The red square marked in the output volume is convoluted from the top left region of the input volume. Because of the convoluted input is divided by input matrices’ size, the connections are local along width and height and extend along the entire depth of the input volume. This architecture ensures that trained filters generate response to a spatially local input pattern.

There are three hyper-parameters control the size of the output volume of the convolutional layer: the depth, stride and zero-padding.

17 Figure 3.4: Convolutional Process

Figure 3.5: Convolutional Matrix

18 The depth of the output volume controls the number of neurons in a layer that connect to the same region of the input volume. These neurons learn to activate for different features (oriented edges, color, etc) in the input.

Stride controls how depth columns around the spatial dimensions are allocated.

When the stride is 1 then we move the filters one pixel at a time. This leads to heavily overlapping receptive fields between the columns, and also to large output volumes.

When the stride is 2 then the filters jump 2 pixels at a time as they slide around.

Figure 3.6 is an example with padding = 2. The receptive fields overlap less and the resulting output volume has smaller spatial dimensions.

Zero padding is to pad the input with zeros on the border of the input volume which is a good supplement of the stride control (strided area filled by 0). The size of this padding is a third hyper-parameter. Padding provides control of the output volume spatial size which preserve the spatial size of the input volume.

Figure 3.6 is an example to implement stride control and zero padding. The input volume is 32 ∗ 32 ∗ 3. If we imagine two borders of zeros around the volume, this give us a 36 ∗ 36 ∗ 3 volume. Then, when we apply our convolutional layer with our three

5 ∗ 5 ∗ 3 filters and a stride of 1, then we will also get a 32 ∗ 32 ∗ 3 output volume.

Assumed W1 and H1 are the width and height before convoluted, F is the size of

filter. W2 and H2 are the width and height after convoluted. In general, setting zero

Padding is P = (F −1)/2 when the stride is S = 1 ensures that the input volume and output volume will have the same size spatially. Relations between each parameter are listed in equations 3.1, 3.2 and 3.3. W − F + 2P W = 1 + 1 (3.1) 2 S H − F + 2P H = 1 + 1 (3.2) 2 S 19 Figure 3.6: Convolutional Matrix with padding=2

20 K − 1 P = (3.3) 2

Parameter sharing scheme is another key feature implemented in convolutional layers to control the number of free parameters. It means that denoting a single

2-dimensional slice of depth as a depth slice, the architecture constrains the neurons in each depth slice and region to use the same weights and bias.

Parameter sharing contributes to the translation invariance of the CNN archi- tecture. Because all neurons in a single depth slice and region exploit the same parameters, the forward pass in each depth slice of the convolutional layer can be computed as a convolution of the neuron’s weights with the input volume. The sets of weights are used as parameters for the filter to process convolution. The result of this convolution is a feature map. The set of feature maps for different filters are stacked along the depth dimension to produce the output.

However, when we expect completely different features from the input image to be learned on different spatial locations, the parameter sharing assumption may not work. For example, when the inputs are faces that have been centered in the image: we might expect different eye-specific or hair-specific features to be learned as different components of the image. In that case, it is common to call the layer a locally connected layer instead of a parameter sharing scheme.

3.3.2 Pooling Layers

Pooling is a non-linear down-sampling architecture in the CNN. There are several non-linear functions to implement pooling, such as max pooling, average pooling. It partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, generates the required output. For example, if there is a max pooling

21 Figure 3.7: Max and Average pooling

method implemented, it will post the max value in the sub-region as the output.

Pooling layers’ are implemented under an assumption that the output value generated after computation is able to represent the characteristic of the sub-region. To prevent overfitting and to reduce the number of parameters and amount of computation in the network, the pooling layer reduces the spatial size of the representation. In general, a pooling layer is inserted between successive convolutional layers in a CNN architecture. The pooling operation provides another form of translation invariance.

22 Pooling summarizes a p×p area of the input feature map. Similar to convolutional layers, pooling can be used with a stride of S ∈ N>1. Usually, p ∈ {2, 3, 4, 5} and s = 2 is the most common one which implemented in AlexNet [4] and VGG-16 [29].

As Thoma (2017)[5] concluded, pooling is applied for three reasons: To get local translational invariance, to get invariance against minor local changes and, most

1 important, for data reduction to S2 th of the data by using strides of S > 1. Examples of Max Pooling and Average pooling are listed in Figure 3.7.

3.3.3 Activation Layers

Activation layer is used to increase non-linearity of a network without affecting the receptive fields of the convolution layers. There are five non-linear activation functions mostly discussed.

Sigmoid

Sigmoid method takes a real-valued number and squashes it into a range between

0 and 1. It converts large negative numbers to 0 and large positive numbers to 1.

Mathematically it is written as:

1 σ(x) = 1 + e−x

Figure 3.8 and Figure 3.9 show the Sigmoid function and its derivative graphi- cally.

During backpropagation through the network with Sigmoid activation, the gradi- ents in neurons whose outputs are near 0 or 1 are nearly 0. These neurons are called saturated neurons. These neurons’ values cannot make significant change because of the extremely small value, which is called as vanishing gradient. If there was a

23 Figure 3.9: Derivative of Sigmoid Acti- Figure 3.8: Sigmoid activation function vation Function

large network comprising of Sigmoid neurons and there are in a saturated region, the network will not be able to implement backpropagation. Meanwhile, the out- puts of Sigmoid are not zero-centered and the exponent function is computationally expensive. A better activation method is required.

Tanh

Tanh function takes a real-valued number but converts it into a range between -1 and 1. Unlike Sigmoid, tanh outputs are zero-centered because the output range is between -1 and 1. The negative inputs considered as strongly negative, zero input values mapped near zero, and the positive inputs regarded as positive. The only disadvantage of tanh is that the tanh function also suffers from the vanishing gradient problem.

Figure 3.10 and Figure 3.11 show the tanh function and its derivative graphically:

Rectified Linear Unit (ReLU)

Unlike methods introduced above, ReLU method’s output region is not converted to a certain scope. When the input x < 0, the output is 0. When the input x > 0, the output is x. Mathematically, it can be represented as:

24 Figure 3.11: Derivative of tanh Activa- Figure 3.10: tanh activation function tion Function

Figure 3.13: Derivative of ReLU Activa- Figure 3.12: ReLU activation function tion Function

y = max(0, x)

Figure 3.12 and Figure 3.13 show the activation function and derivative of ReLU

activation function.

ReLU implementes faster network converge than other methods. It is resistant to

the vanishing gradient problem when x > 0. It ensures that backpropagation can be implemented at least in half of their regions.

However, when x is not a positive number, the network does not learn. In detail, if x < 0, the neuron remains inactive and it kills the gradient during the backward pass. If x = 0 the slope is undefined, instead the value is picking either the left or the

25 Figure 3.14: Leaky ReLU Activation

right gradient. Considered the undefined values may be slightly incorrect because of picking nearest gradient, it is counted as a drawback of ReLU method.

Leaky and Parametic ReLU

Leaky and Parametic ReLU are activation functions used to fix the vanishing gradient issue happened when ReLU activation function processed x < 0. Mathe- matically, the leaky ReLU function can be represented as:

y = max(0.1x, x)

The concept of leaky ReLU is: when x < 0, it will have a small positive slope of

0.1. This function somewhat eliminates the dying ReLU problem but not consistent.

In leaky ReLU, 0.1 is used as a slope parameter to avoid vanishing gradient issue.

The parameter can also be changed to an arbitrary hyper-parameter α. This α can be learned since you can backpropagate into it. This gives the neurons the ability to choose which slope is the best in the negative region. This method is Parametic

ReLU.

26 The parametic ReLU function is given by:

y = max(αx, x)

In summary, PReLU is a more stable and popular choice for current CNN field for non-linear activation layers.

3.3.4 Dropout

Dropout is a technique used to reduce overfitting and co-adaptations on traning data. It was introduced in Hinton et al. [30] and Srivastava et al.[31]. As Thoma’s survey paper concluded[5], a Dropout layer can be implemented as follows: For an input in of any shape s, a tensor of the same shape D ∈ {0, 1}S is sampled, where each element di is sampled independently from a Bernoulli distribution. p represents the dropout probability for every value of the input. Mathematically, dropout method can be written as:

out = D in with di ∼ B (1, p) (3.4)

In general, Dropout is used with p = 0.5. Layers closer to the input usually have a lower dropout probability than later layers. The output of a dropout layer is multiplied with 1/(1 − p) when dropout is enabled[32]. Dropout is only applied after fully connected layers to prevent overfitting and max-pooling layers to impelment data augmentation. Figure 3.15 is an example of implementing Dropuout on CNN.

3.3.5 Normalization Layers

While the parameters of layers close to the output are adapted to some input pro- duced by lower layers, those lower layers parameters are also adapted. This problem

27 Figure 3.15: Before and after applying dropout

is called as internal covariate shift, which leads to the parameters in the upper layers

being worse and a low learning rate.

Normalizing mini-batches [2] is a way to approach this problem. Normalization

layer is the layer implement Normalizing mini-batches on CNN which improves the

gradient flow and allows higher learning rates. Moreover, it can also reduce depen-

dency on initialization and implement regulation [5].

In general, most of the batch normalization layer applied the following normaliza-

tion. xk − E xk xˆk = pV AR (xk) The output of the batch norm layer has the γ,β as parameters. Those parameters will be learned to best represent your activation. Those parameters allow a learnable

(scale and shift) factor.

yk = γk · xˆk + βk

Overall, the operation can be summarized as 3.16[2].

28 Figure 3.16: Dropout processing equation[2]

3.3.6 Loss Layers

The loss is used to determine how a training process penalizes the deviation be- tween the predicted and true labels. Function used in the loss layer is called loss function which describes how far off the result that the CNN produced is from the expected result. Loss function indicates the magnitude of error the CNN made on its prediction.

Before loss layer, there is usually another layer called fully-connected(FC) layer which converts the previous calculation result to a simpler format for loss layer to pro- cess. For example, if we implemented VGG16[29] without using fully connected layer, the loss layers has 4096 nodes need to process result from last pooling layer which has

25088 nodes. The transmission requires 4096 × 25088 weights, which requires a large amount of memory. Instead, the full connection layer can be regarded as a special

29 case of the convolution layer. The Pooling layer (POOL2) to Fully connected (FC1) layers are fully connected, and the output nodes of POOL2 are arranged by vector, that is, there are 25088 dimensions, each dimension is 1*1. The convolution kernel can be seen as numf ilters = 4096, channel = 25088, kernelsize = 1, stride = 1, which has a faster speed than previous design.

Result on loss function can then take that error and ’backpropagate’ it through the CNN model, adjusting its weights and making it get closer to the expected result in the next time.

Various loss functions are appropriate for different tasks. For example, Softmax loss is used for predicting a single class of K mutually exclusive classes. Sigmoid cross-entropy loss is used for predicting K independent probability values in [0, 1].

Euclidean loss is used for regressing to real-valued labels.

3.4 Neural Network design

3.4.1 LeNet

One of the first CNNs used was LeNet-5 [3]. LeNet-5 uses two times the common pattern of a single convolutional layer with tanh as a non-linear activation function followed by a pooling layer and three fully connected layers. One fully connected layer is used to get the right output dimension, another one is necessary to allow the network to learn a non-linear combination of the features of the feature maps.

Architecture of LeNet is shown in Figure 3.17.

In LeNet’s architecture, tanh function is applied after layers 1, 3, 5 and 6. Soft- max function is applied after layer 7. The characteristic is that the convolutional

30 Figure 3.17: Architecture of LeNet-5[3]

layer requires fewer parameters but has an order of magnitude more floating point operations(FLOP) per parameter than fully connected layers.

LeNet was designed for MNIST which achieves a 0.8% test error rate. How- ever, compared with other neural networks introduced below, LeNet is a very simple pipeline processing network. The limitation of layer functionality and structure leads to a low performance while processing medical images.

3.4.2 AlexNet

The first CNN which achieved major improvements on the ImageNet dataset was

AlexNet [4]. Its architecture is shown in Figure 3.18 which has about 60 × 106 pa- rameters. Parameter setting is in Figure 3.19. Convolutional Layers are followed by pooling layers multiple times and ended by a fully connected network which is a similar processing structure inherited from LeNet.

In AlexNet, all the convolution layers size is small after the max-pooling layer, either 5 × 5 or 3 × 3 with stride equal to 1 which means it will scan all the pixels in the image. However, the max-pooling layers stride is 2. Therefore, it reveals

31 Figure 3.18: Layer structure of AlexNet[4]

that: in the first few convolutional layers, although the amount of calculation is very large, the amount of parameters is very small, and they are all around 1M or even smaller, which only accounts for a very small part of AlexNet’s total parameter amount. This convolutional layer’s architecture can extract valid features with a small amount of parameters. If the first few layers directly use the fully connected layer, then the amount of parameters and the amount of calculation will become astronomical. Although each convolution layer occupies less than 1% of the entire network’s parameter amount, if you remove any convolutional layer, the classification performance of the network will be greatly reduced.

To implement AlexNet on GPU, common solution is to reduce the number of parameters and allow parallel computation on separate GPUs. It provides more space to handle and implement data augmentation. To make the architecture easier to compare, this grouping was ignored for the parameter count, which implements the dropout functionality noted above.

32 Figure 3.19: Parameter number and setting of AlexNet

33 3.4.3 VGGNet

VGGNet[29] is a deep convolutional neural network developed by researchers from

the Visua Group at Oxford University and Google DeepMind. VGGNet

explored the relationship between the depth of convolutional neural networks and

their performance. Through repeated stacking of 3 × 3 small convolution kernels and

2 × 2 of the largest pooled layer, VGGNet succeeded in constructing 16 to 19 deep

convolutions kernels. Compared with the previous state-of-the-art network structure,

VGGNet’s error rate has dropped significantly, and it has achieved second place in

the ILSVRC 2014 competition classification project and first place in the positioning

project. VGGNet is highly scalable. The entire network uses the same size of convo-

lution kernel 3 × 3 and maximum pool size (2 × 2). So far, VGGNet is often used to extract image features. VGGNet trained model parameters are open sourced on its official website and can be used to retrain domain-specific image classification tasks

(equivalent to providing very good initialization weights) and are therefore used in many places.

All of the VGGNet papers use 3×3 convolution kernels and 2×2 pooled nucleuses

to improve performance by continuously deepening the network structure. Figure

3.20 shows the VGGNet’s network structure at each level. Figure 3.20 shows the

parameters for each level. From the 11th floor network to the 19th floor network,

there are detailed performance tests. Although each level of network from A to E

gradually becomes deeper, the amount of parameters of the network does not increase

much. This is because the amount of parameters is mainly consumed in the last three

full-connection layers. Although the previous part of the convolution is very deep, the

amount of parameters consumed is not large, but the more time-consuming part of

34 Figure 3.20: Architecture of VGGNet[5]

the training is still convolution, because of its larger calculation. Among them, D, E is what we often say VGGNet-16 and VGGNet-19. C is very interesting when compared to B, there are several 1×1 convolutional layers. The significance of 1×1 convolution

is mainly linear transformation. The number of input channels and output channels

are unchanged, and no dimension reduction occurs.

VGGNet has 5 segments of the convolution layers, with 2 to 3 convolution layers

in each segment, and a maximum pooling layer is connected to the end of each

segment to reduce the size of the picture. The number of convolutional nuclei in each

segment is the same, and the more contiguous the convolution nuclei are in the later

segment: 64 128 256 512 512. There are often cases where multiple identical 3 × 3

convolutional layers are stacked together. The two 3 × 3 convolutional layers in series

correspond to a 5 × 5 convolutional layer, which means one pixel will be associated

with the surrounding 5 × 5 pixels. It can be said that the receptive field size is

5 × 5. The effect of three 3 × 3 convolutional layers in series is equivalent to a 7 × 7

35 convolutional layer. In addition, three concatenated 3 × 3 convolution layers have fewer parameter parameters than a 7 × 7 convolution layer. Moreover, the three 3 × 3 convolutional layers have more nonlinear transformations than a 7 × 7 convolutional layer (the former can use three ReLU activation functions and the latter only one time), making the CNN pair characteristic which increase the learning ability.

VGGNet has a little trick in the training process. First, it trains a simple network of level A, and then reuses the weight of network A to initialize the following several complex models, so that the training converges faster. In the prediction, VGG uses the Multi-Scale method to scale the image to a size Q and enter the image into the convolutional network. Then, the sliding window is used for classification prediction in the last convolution layer, and the classification results of different windows are averaged, and then the results of different size Q are averaged to obtain the final result, which can improve the utilization rate of the image data and improve the prediction accuracy. Meanwhile, VGGNet also uses the Multi-Scale method to do data enhancement, the original image is scaled to different sizes S, and then randomly crop 224 × 224 pictures, this can increase a lot of data which can potentially increase the training set’s reliability.

Compared with AlexNet and previous network, VGGNet provides a more solid performance. In general, VGGNet has a lower error rate than AlexNet in the same processing time. However, the training constraint bring a doubt that the LRN(Local

Response Normalization) layer may not be quite useful. The convolutional 3 × 3 layer corresponding to LRN may increase the error rate because of the noise possibly brought from LRN which causes an unstable performance. This is a disadvantage for using VGGNet.

36 3.4.4 GoogleNet

Computational cost is a big problem while CNN is processing thousands of images because of the large number of parameters and operations is a problem when such models should get applied in practice to thousands of images. To reduce the com- putational cost while maintaining the classification quality, GoogleNet[32] and the

Inception module were developed. The Inception module essentially only computes

1 × 1 filters, 3 × 3 filters and 5 × 5 filters in parallel, but applied bottleneck 11 filters before to reduce the number of parameters.

Let’s look at the basic structure of the Inception Module in Figure 3.21, which has four branches: The first branch convolves the input with 1 × 1, which is actually an important structure proposed on GoogleNet. The 1 × 1 convolution is a very good structure. It can organize information across channels, improve the expressiveness of the network, and at the same time can upscale and reduce the output channels.

It can be seen that the 4 branches of the Inception Module use a 1 × 1 convolution to perform cross-channel feature conversion at a low cost (a much smaller amount of computation than 3 × 3). The second branch first uses a 1 × 1 convolution and then a

3×3 convolution, which is equivalent to performing two feature transformations. The third branch is similar, first a 1×1 convolution and then a 5×5 convolution. The last branch is to directly use 1 × 1 convolution after 3 × 3 maximum pooling. We can find that some branches only use 1×1 convolution, and some branches use 1×1 convolution when using other size convolutions, because 1 × 1 convolution is very cost-effective.

A small amount of computation can add a layer of feature transformation and non- linearization. The four branches of the Inception Module are finally merged by an aggregate operation (aggregated in the number of output channels). The Inception

37 Figure 3.21: Inception module of GoogleNet

Module contains three different sizes of convolutions and one maximum pooling, which increases the adaptability of the network to different scales. This section is similar to the Multi-Scale idea. In the early research of computer vision, inspired by the primate neural visual system, the author used different size Gabor filters to process different size pictures. Inception V1 borrowed this idea. The Inception V1 paper pointed out that Inception Module can efficiently expand the depth and width of the network, improve accuracy and avoid overfitting.

Inception V2[39] inspired by VGGNet which is replacing the large convolution of

5 × 5 with two 3 × 3 convolutions (to reduce the amount of parameters and cover the overfitting), and the well-known Batch Normalization (BN) method. BN is a very effective regularization method that can speed up the training of large-scale convolutional networks by many times. At the same time, the classification accuracy after convergence can be greatly improved. When BN is used in a neural network layer, it normalizes the internals of each mini-batch data, normalizes the output to a normal distribution of N(0,1), and reduces Internal Covariate Shift (internal

38 nerves). Meta-distribution changes). BN’s paper points out that during the training of traditional deep neural networks, the distribution of the input at each layer is changing, which makes training difficult. We can only solve this problem with a very small learning rate. After using BN for each layer, we can effectively solve this problem. The learning rate can be increased many times. The number of iterations required to achieve the previous accuracy rate is only 1/14, and the training time is greatly reduced. After achieving the previous accuracy rate, it is possible to continue training and eventually achieve far better performance than the Inception V1 model

- the top-5 error rate is 4.8%, which is better than the human eye level. Because

BN also plays a role of regularization in some sense, Dropout can be reduced or eliminated, simplifying the network structure.

The gain obtained by simply using BN is not obvious, and some corresponding adjustments are needed: increasing the learning rate and accelerating the learning decay speed to apply the BN normalized data; removing Dropout and reducing the

L2 regularity (because BN has played the role of regularization); removal of LRN; shuffle the training sample more thoroughly; reduce the optical distortion of the data during data enhancement (because BN training is faster, each sample is trained less often) After using these measures, Inception V2 is 14 times faster in training to achieve

Inception V1 accuracy, and the model has a higher accuracy rate when it converges.

The Inception V3 network has two major transformations compared with previ- ous design. The first is the introduction of Factorization into small convolutions, which splits a larger two-dimensional convolution into two smaller one-dimensional convolutions, such as 7 × 7 volumes. The product is split into 1 × 7 convolutions and

7 × 1 convolutions or the 3 × 3 convolution is divided into 1 × 3 convolutions and

39 Figure 3.22: Inception V2 module

3 × 1 convolutions, as shown in Figure 3.21. On the one hand, saving a lot of pa-

rameters, speeding up calculations and reducing over-fitting (more than reducing the

7 × 7 convolution into 1 × 7 convolution and 7 × 1 convolution, it is more economical than splitting into 3 3 × 3 convolutions), while adding a layer of non-linear expansion model expression capabilities. The paper points out that the result of this asymmetric convolutional structure splitting is more obvious than splitting into several identical small convolution kernels symmetrically, which can handle more and more abundant spatial features and increase feature diversity.

On the other hand, Inception V3 optimizes the structure of the Inception Module.

Inception Module now has three different configurations of 35 × 35, 17 × 17 and 8 × 8, as shown in Figure 3.22. These Inception Modules only appear in the back of the network, and the front is a normal convolutional layer. And Inception V3 uses branching in the branch (8 × 8 structure) in addition to branching in the Inception

Module, and it can be said Network in Network in Network. (Figure 3.23)

40 Figure 3.23: Inception V3 module

Inception-v4 as described in [33] consists of four main building blocks: The stem,

Inception A, Inception B and Inception C. To quote the authors: Inception-v4 is a deeper, wider and more uniform simplified architecture than Inception-v3. The stem,

Reduction A and Reduction B use max-pooling, whereas Inception A, Inception B and

Inception C use average pooling. The stem, module B and module C use separable convolutions.

3.4.5 ResNet

ResNet (Residual Neural Network) [33] was proposed by Kaiming He of Microsoft

Research, successfully training 152 deep neural networks through the use of Residual

41 Figure 3.24: Residual unit of ResNet

Units (Figure 3.24), and won the championship in ILSVRC 2015, achieving 3.57%

error rate which is top-5 performance in this research. At the same time, the amount of

parameters is lower than VGGNet, and the effect is very prominent. The structure of

ResNet can speed up the training of ultra-deep neural networks very quickly compared

with general AlexNet and VGGNet-19, and the accuracy of the model is also superior.

Assume that the input of a certain neural network is x and the expected output

is f(x), if we pass the input x directly to the output as the initial result, then the goal

we need to learn at this time is t. As shown in the figure, this is a residual unit of

ResNet. ResNet is equivalent to changing the learning target. It is no longer learning a complete output, but only the difference between the output and the input, that is, the residual.

Figure 3.25 shows a comparison of VGGNet-19, a 34-layer deep ordinary con- volutional network, and a 34-layer deep ResNet network. It can be seen that the biggest difference between ordinary directly connected convolutional neural networks and ResNet is that ResNet has many bypass branches connecting the input directly

42 to the following layers, so that the following layers can directly learn the residuals.

This structure is also called Shortcut or skip connections.

When the traditional convolutional layer or fully connected layer transmits infor- mation, there are more or less problems such as information loss. ResNet solves this problem to some extent. By directly passing the input information to the output and protecting the integrity of the information, the entire network only needs to learn the difference between the input and output, simplifying the learning objectives and difficulty.

In ResNet’s paper, in addition to the proposed two-level residual learning unit, there are three levels of residual learning units (Figure 3.26). The two-level residual learning unit contains 3 × 3 convolutions with the same number of output channels

(because the residual is equal to the target output minus the input, i.e. the input and output dimensions must be consistent); and the 3-level residuals The network uses the 1 × 1 convolution product of Network In Network and Inception Net, and uses a 1 × 1 convolution before and after the convolution of the middle 3 × 3, and has the operation of reducing the dimension first and then increasing the dimension.

In addition, if there are different cases of input and output dimensions, we can do a linear mapping transformation on x and connect it to the next layer.

Overall, advantage of ResNet can be concluded as three points: (1) The residual network does not have direct advantages in model characterization. ResNets does not better characterize a certain aspect, but ResNets allows more models to be repre- sented in depth. (2) The residual network makes the feed-forward/back-propagation algorithm very smooth. To a large extent, the residual network makes it easier to optimize deeper models. (3) ”shortcut” joins neither generate extra parameters nor

43 Figure 3.25: Compare VGG and ResNet structure

44 Figure 3.26: Residual unit of two and three layer ResNet

increase computational complexity. Quick Connect simply performs identity mapping and adds their output to the overlay’s output. By backpropagating SGD, the entire network can still be trained in an end-to-end fashion

3.4.6 Siamese Network

Unlike other neural networks, Siamese network[6] is seldom used for classifica- tion. The outcome of Siamese Network is the similarity between two different images.

Siamese network is a similarity measure method. When the number of categories is large, but the number of samples per category is small, it can be used for category identification and classification. The traditional classification method for distinguish- ing is to know exactly which class each sample belongs to, and it needs to have an exact label for each sample. When the number of categories is too large and the number of samples per category is relatively small, these methods are less applicable.

The Siamese network learns a similarity measure from the data and uses this learned metric to compare and match samples of new unknown classes. This method can

45 Figure 3.27: Siamese Architecture [6]

be applied to classification problems where there are many categories or the entire training sample cannot be used for previous method training.

The architecture of Siamese Network can be represented as Figure 3.27. Let X1

and X2 be a pair of images shown to our learning machine. Let Y be a binary label of

the pair, Y = 0 if the images X1 and X2 belong the same object and Y = 1 otherwise.

Let W be the shared parameter vector that is subject to learning and let GW (X1)

and GW (X2) be the two points in the low-dimensional space that are generated by mapping X1 and X2. Then our system can be viewed as a scalar energy function

46 EW (X1,X2) that measures the compatibility between X1 and X2. It is defined as:

EW (X1,X2) = kGW (X1) − GW (X2)k (3.5)

We can define the loss function as:

P X  i L(W ) = L W, (Y,X1,X2) (3.6) i=1

 i  i  i L W, (Y,X1,X2) = (1 − Y ) LG EW (X1,X2) + (Y ) LI EW (X1,X2) (3.7)

i (Y,X1,X2) means the ith sample which is grouped by an image and a label. LG is to calculate the loss function for same class, LI is for different class. P represents the number of training samples. Based on this design, the difference between two classes are more obvious. It can be converted to a function written as EW with condition

0 that ∃m > 0, such that EW (X1,X2) + m < EW (X1,X2). Based on this condition, we can get the loss function as:

G I  G  I  H EW ,EW = LG EW + LI EW (3.8)

For single pair:

L (W, Y ) = (1 − Y )LG(EW ) + YLI (EW ) (3.9)

In Siamese Network architecture, loss function decreases the priority of label and changes the sample to pairs. Even though more training sets will lead to a better result, the design of only focusing on the difference between two images will be useful for small datasets. When compared with other classification network, it is a more efficient design for small datasets. The return value is 0 or 1 if there are the same

47 or not. The training model can be duplicated because all the images, in the training sets can be paired up together which increases the training sets size from n to n2.

Besides, considering neuro images structure and shape are more regular, the difficulty of facial detection and orientation of face can be eliminated in this experiment sets.

3.4.7 Hybrid CNN-Siamese network(HCSNet)

As Chapter 3.4.6 mentioned, Siamese network’s input unit is pairs of images.

Therefore, if there is only one sample image given, instead of putting this image to

Siamese network, Siamese Network’s architecture would set the input as a pair of image. The input pairs will be one test image and one image on the training sets.

However, the output of Siamese network can also be the similarity between each pair and represented as float number between 0 and 1. In this project, we propose the

Hybrid CNN-Siamese network(HCSNet), which uses the similarity result calculated from Siamese Network to improve the result calculated from other CNNs.

There is no residual unit in Siamese network and Siamese Network’s convolutional layers are not deep enough to handle complex calculation. Traditional CNN like

Inception ResNet V2 provides solid computation performance when the training set is large enough because it has multiple nested convolution layer and an efficient design of residual unit. Those are the advantages that Siamese Network does not have.

However, Siamese Network is able to provide a similarity comparison result even though the number of images in the training sets is small. HCSNet aims to combine the advantage of Siamese network (Template model technique) and common CNN

(machine learning technique).

48 Figure 3.28: Workflow of HCSNet

There are two CNN components included in HCSNet, template model CNN com- ponet and machine learning CNN component. In this research, the template model

CNN component is applied by Siamese Network. Inception ResNet V2 is selected as the machine learning CNN component because its architecture combines Inception

V3 and ResNet together. Because of Inception ResNet V2’s outstanding computing capability, the classification result calculated from Inception ResNet V2 should play a more important role. Siamese Network is more like a filter processing after Inception

ResNet finishes testing.

Figure 3.28 is the workflow of what HCSNet should look like. The input will be sent to both Inception ResNet V2 and Siamese Network. For Inception ResNet V2, if the test image is classified as normal, the value ’init’ will be set to 0.65, else it will be set to 0.35. For Siamese Network, first it will eliminate any case which has similarity

49 lower than 60% to the test image. This step is to exclude the training neuro images

which are not in the same depth (z-axis) with test image. After this step, test image

will pair up and compute the similarity with the rest of the images in the normal

and abnormal sets. The average similarity between the test image and the remaining

normal sets is set as ’np’. The average similarity between the test image and the

remaining abnormal sets is set as ’ap’. After np and ap are calculated, we can get the

final output value f by: f = init + np − ap. If f > 0.5, the test image is treated as a normal case, else, it is treated as an abnormal case.

Because the threshold to compare is set to 60%, the value of np and ap is in

range 0.6 and 1.0. To change the result calculated by Inception ResNet V2, the

difference between np and ap has to be larger than 0.15 and the Siamese Network’s

result needs to be different from Inception ResNet V2. It ensures that the result

calculated from machine learning CNN part will only be affected when the pattern’s

similarities between test image and two classes are large. This design not only inherits

the calculation result from Inception ResNet V2, but also ensures the performance

when the dataset is small because of Siamese Network’s design.

Since the calculation parts for the two networks are isolated, it means that the

machine learning CNN part of HCSNet can switch to other CNNs. In Chapter 4,

we present a comparison architecture which is combined Inception V3 and Siamese

Network with the same parameter setting.

3.5 Pre-processing

Unlike other image experiments for CNN, data augmentation is not an appropri-

ate method for this approach, because the position of each component in the images

50 matters. Normal data augmentation methods, such as rotating or flipping will gen- erate an incorrect template model because positions of brain’s structure are wrong.

Because of the high similarity in between neuro image, data augmentation may lead to a result that increases the error rate.

However, the drawback of data augmentation is somehow eliminated by the Siamese

Network. It creates multiple pairs of studies from the given training sets which means it create square of training sets (pairs) than before. That is also a key reason why the Siamese Network can be helpful to HCSNet, especially when the dataset is small.

The training set used in this experiment is the same as what we used in [1].

Considered the value of z-axis matter in neuro-image, all the image selected in training sets is in similar position on z-axis. We used color-maps for preprocessing the image to show the contrast in between different components inside neuro-images.

3.6 Analysis Techniques

3.6.1 Qualitative Analysis by example

The most basic analysis technique is directly observing examples which the net- work correctly predicted with a high certainty. At the same time, it will reveal what the classifier got wrong with a high certainty. Those examples can be arranged by applying t-SNE[34]. As the design illustrated on Figure 3.2, the input of CNN can be vector with multiple features and the output is a single value. Because of this design structure, most of the implementations made by CNN is used for classifica- tion. Even though there are multiple implementations [16][21][22][23][24] which focus on generating model template, most of them only focus on detecting specific region based on image template but not segmentation of it. Therefore, the mode in here is

51 unlike other research [8] which focus on specific disease, this implementation aims to compare with the normal template to judge the test case is normal or not.

The training set is made by normal cases, therefore, the output for the test image will be the similarity in between test image and normal neuro image. Note that the definition of normal here depends on the training set. It does not mean that the test case is abnormal as long as the network outcome shows that it does not belong to the normal set. However, this indicates that the system recommends a doctor to do further segmentation for this test case.

The training set illustrated above is for traditional CNN method as introduced in section 3.1 which provides an output result for classification. The classification performance generated by multiple CNN introduced in 3.4 can be evaluated and compared by analysis method illustrated in section 3.6. However, there is also another method that can be used to generate template for normal case. An image can be converted to matrix made by vectors, it can be used as input in CNN. However, the output doesn’t have to be a single output value. Instead a standard template can be generated from datasets, which is similar to the design of FaceNet[28]. They represent the structure of a face made by hundreds of points. Then they used the points position to generate a template from one dataset. And then they compare the similarity in between multiple template to judge that whether it belongs to this dataset or not. Because there are certain common structure in the human brain, it is also possible to generate a template from given datasets.

The method can be implemented by Siamese network which is used for comparing two images similarity by CNN. Because CNN is implemented here, the comparing

52 process include convolution matrix result which will capture more vectors charac-

teristic compared with traditional pixel comparing method. For this method, only

normal training set is used because the key point is to compare the similarity between

test image and normal case. The output value will be a float value in between 0 to 1

instead of classification output. Based on the output, the classification result can be

generated by adding a threshold for the output[34].

Even though this can show differences in the distribution of validation data which

are not covered by the training set and thus indicate the need to collect more data, it

might reveal errors in the training data. Most of the time, training data is manually

labeled by humans who make mistakes. If a model is fit to those errors, the possibility

to return a wrong result increases.

3.6.2 Confusion Matrices

A confusion matrix is a specific table layout that allows of the per-

formance of machine learning algorithm[35]. Mathematically, a confusion matrix can

K×K be represented as a matrix cAB ∈ N≥0 , where K ∈ N≥2 is the number of classes

including all correct and wrong classification results[5]. The item cAB is the number

of times items of class A were classified as class B. This means the correct clas-

sification is on the diagonal cAA and all wrong classifications are of the diagonal. PK PK The sum A=1 B=1 cAB is the total number of samples which were evaluated and P A=1 cAA PK PK is the accuracy. A=1 B=1 cAB In brief, a confusion matrix is a table with two rows and two columns that reports the number of false positives(FP), false negatives(FN), true positives(TP), and true negatives(TN). The result provides more information about predictive ability of each

53 Predicted: Normal Predicted: Abnormal Total Actual: Normal 90 (TP) 10 (FP) 100 Actual: Abnormal 50 (FN) 50 (TN) 100 Total 140 60 200 Accuracy 0.70 Positive predicted value 0.90 (TP/(TP+FP))

Table 3.1: Example of Confusion matrix

classes. Table 3.1 is an example of confusion matrix. If we only focus on the overall accuracy, we may draw a conclusion that the neural network performance is not good because of the accuracy is only 0.7. However, if we display the information as confusion matrix, we will find out the neural network do a better job while detecting normal cases. This is the information may be overlooked if we do not display the statistic result as a confusion matrix.

In this research, positive predicted value and accuracy are two outputs that need to get top concern. Accuracy shows how efficient the network to do classification.

Positive predicted value is the dependent variable which shows the performance of this network model about detecting normal case correctly. (In this project, positive represents normal case) As Section 1 described above, this project is to find out how

’normal’ a patient is. The positive predicted value is calculated by number of True

Positive divided by all the cases detected as Positive. It shows that how sensitive the network to detect normal case correctly.

54 3.6.3 Learning Curves

A learning curve shows the validation and training score of an estimator for varying numbers of training samples. Learning curve is used to find out how much we benefit from adding more training data and whether there is a bias error existed. There are multiple ways to draw learning curve. In this thesis, we drew the learning curve by a plot where the horizontal axis displays the number of times the training process go through (epoch) and the vertical axis displays the accuracy and loss. The learning curve for the validation set is an indicator if more training data will improve the networks performance or not. In this thesis, we compare the performance of each

CNN when the training set is small. Learning curve is also a helpful tool to directly compare learning rate and error rate between CNNs. Therefore, learning curve is selected as a analysis technique on this thesis.

The error on the validation set should never be expected to be significantly lower than the error on the training set. If the error on the training set is too high, then more data will not help. Instead, the model or the training algorithm need to be adjusted. In other words, Learning curves reveals the efficiency of CNN.

3.6.4 Others

“Input-feature based model explanations” [36][37] is another widely used analysis techniques for CNN implementation but its architecture is designed for segmentation application. Therefore, “Input-feature based model explanations” is not an appropri- ate analysis technique for this research.

55 Validation curve[38] is also a technique to verify neural network’s performance.

Unlike learning curve, validation curve focuses on finding relation between the pa- rameters on CNN and the CNN itself[39]. In this thesis, our purpose is to compare each CNN’s performance on neuro image normal case but not adjusting one CNN.

Therefore, validation curve is not selected as analysis technique on this thesis.

There are other analysis techniques heavily dependent on the inner architecture of CNN such as the Argmax method [40][41][42] and Feature map reconstructions

[41][43]. Because HCSNet is a CNN architecture built by Inception ResNet V2 and

Siamese Network, there is no need for adding reconstruction section. Therefore,

Confusion matrix is considered as the best result analysis technique in this research.

56 Chapter 4: Experiment and Result

4.1 Parameter setting

There are two experiments included in this thesis paper. The first experiment aims to find out which network has the best performance on neuro image classification when the size of training dataset is small. The training and testing dataset comprised of 320 and 109 images respectievely. All the networks in this experiment are are initialized with random weights.

The second experiment is implemented in two pre-trained CNN models. Experi- ment I reveals the difference of each networks efficiency while processing neuro image.

However, untrained model’s low efficiency is not qualified to be a widely-used applica- tion. That’s why the second experiment is necessary. Experiment II shows the results of Inception V3 (method used in Lucciano et al.[1]), Inception ResNet V2(performs best in the Experiment I), Inception V3 with Siamese Network post processing filter and HCSNet.

To evaluate the performance of CNNs, accuracy is one of the most important criteria. Moreover, as Chapter 1 described, this thesis also aims to the higher positive predicted value in confusion matrix. Hence, all the experiment data displays as confusion matrix and posts both accuracy and positive-predicted value on it.

57 4.1.1 Implementation on scratch network

All the CNNs are running on an Ubuntu machine and implemented on TensorFlow

slim package. Parameters setting in the experiments are:

• Weight decay=0.00004

• Learning rate=0.001

• Batch size = 32

The weight decay is a parameter to prevent over-fitting. It will not affect the result when the value is small enough. In this experiment, all the weight decays are set to 0.00004.

The learning rate is defined as how quickly a network abandons old model weights for new ones. For CNN containing too many layers and parameters to handle, if the learning rate is too high, it will not have enough time to go through an epoch1 in each iteration. In this experiment, all CNNs can be applied properly with a 0.001 learning rate. To fairly compare those networks, all the learning rate is set to 0.001.

The batch size is a parameter for utilizing epoch. In TensorFlow slim package, instead of using epoch, all the processes are recorded as number of steps. With the certain batch size, it will ensure that the number of steps can be a measurable parameter like epoch. In general,

epoch = number of steps / batch size.

There are two training sets used in this experiment. One is the same dataset as

Luciano et al. [1] which includes images in any position of the brain (Marked as set1).

1Epoch is defined as one time the machine goes through all the training sets.

58 Training Testing Training Testing (Set1) (Set1) (Set2) (Set2) Normal 160 59 120 59 Abnormal 160 50 120 50

Table 4.1: DataSet distribution on Experiment I

The other only focuses on the neuro images have similar z-axis, which is a subset of the original dataset (Marked as set2). The same setting has also been used for the validation sets. The validation sets are isolated from the training sets. All the images are 2-D slices from CT images.

There are two reasons why there are more abnormal images than normal images in the validation sets. (1) The normal cases are stable whereas the abnormal cases have various types. (2) The purpose of the project is not to diagnosis which disease a patient may have. The most concern is whether the patient is healthy or not.

Increasing the number of positive cases can let us focus not only the overall accuracy but also the accuracy of the normal cases. Since the dataset is relatively small, adding more abnormal validation case can make the difference of each networks normal case detectability more obvious.

4.1.2 Implementation on pre-trained network

Experiment II is implemented on a pre-trained network instead of a scratch net- work. The size of dataset used for training in this experiment is 2750 images while the size of testing set is 300. The detail is listed in Table 4.2. The implementation platform in this experiment is Keras in Tensorflow backend. Both experiments iterate by 100 epochs.

59 Training Sets Validation Sets Testing Sets Sum Normal 1382 248 166 1796 Abnormal 1383 248 161 1792 Sum 2765 496 327 3588

Table 4.2: DataSet distribution on experiment II

4.2 Result and Statistics on Experiment I

From Figure 4.1 and Figure 4.2, it is clear to see that the HCSNet does a better job than other networks. It is the only network reaching 70% accuracy. LeNet reveals an overfitting issue in both training sets even though the weight decay is small enough.

The validation accuracy actually decreased as number of steps increased. AlexNet shows a poor performance on both datasets. Even though the overall accuracy is relatively low, VGG and Inception V4 show a good learning curve as the number of steps increases. They are expected to perform better with larger datasets or more training time. Compared with other common CNN, ResNet is the best choice and shows a good learning curve and stable accuracy. Since Siamese Network is imple- mented by comparing similarity but not in a learning epoch in this experiment. It is not included in this comparison.

Table 4.3 to Table 4.18 show the confusion matrix of each network. Based on the analysis on Chapter 3, confusion matrix is selected to be the measurement method.

The dataset size may not be large enough to implement some complex observing method like the feature map reconstruction. Moreover, confusion matrix achieves one of the primary purposes of this research-to find out the accuracy of detecting normal case.

60 Figure 4.1: Different network validation accuracy vs. number of steps in set 1

Figure 4.2: Different network validation accuracy vs. number of steps in set 2

61 Predicted: Normal Predicted: Abnormal Total Actual: Normal 26 24 50 Actual: Abnormal 24 35 59 Total 50 59 109 Accuracy 0.559 Positive predicted value 0.52

Table 4.3: Result of LeNet after 5000 steps in set 1

Predicted: Normal Predicted: Abnormal Total Actual: Normal 28 22 50 Actual: Abnormal 29 30 59 Total 57 52 109 Accuracy 0.532 Positive predicted value 0.56

Table 4.4: Result of LeNet after 5000 steps in set 2

Predicted: Normal Predicted: Abnormal Total Actual: Normal 22 28 50 Actual: Abnormal 29 30 59 Total 51 58 109 Accuracy 0.477 Positive predicted value 0.44

Table 4.5: Result of AlexNet after 5000 steps in set 1

Predicted: Normal Predicted: Abnormal Total Actual: Normal 22 28 50 Actual: Abnormal 34 25 59 Total 56 53 109 Accuracy 0.431 Positive predicted value 0.44

Table 4.6: Result of AlexNet after 5000 steps in set 2

62 Predicted: Normal Predicted: Abnormal Total Actual: Normal 29 21 50 Actual: Abnormal 32 27 59 Total 61 48 109 Accuracy 0.513 Positive predicted value 0.58

Table 4.7: Result of VGG-16 after 5000 steps in set 1

Predicted: Normal Predicted: Abnormal Total Actual: Normal 28 22 50 Actual: Abnormal 26 33 59 Total 54 55 109 Accuracy 0.559 Positive predicted value 0.56

Table 4.8: Result of VGG-16 after 5000 steps in set 2

Predicted: Normal Predicted: Abnormal Total Actual: Normal 24 26 50 Actual: Abnormal 24 35 59 Total 48 61 109 Accuracy 0.513 Positive predicted value 0.48

Table 4.9: Result of Inception v3 after 5000 steps in set 1

Predicted: Normal Predicted: Abnormal Total Actual: Normal 24 26 50 Actual: Abnormal 24 35 59 Total 48 61 109 Accuracy 0.541 Positive predicted value 0.48

Table 4.10: Result of Inception v3 after 5000 steps in set 2

63 Predicted: Normal Predicted: Abnormal Total Actual: Normal 30 20 50 Actual: Abnormal 25 34 59 Total 55 54 109 Accuracy 0.587 Positive predicted value 0.60

Table 4.11: Result of Inception v4 after 5000 steps in set 1

Predicted: Normal Predicted: Abnormal Total Actual: Normal 31 19 50 Actual: Abnormal 27 32 59 Total 58 51 109 Accuracy 0.577 Positive predicted value 0.622

Table 4.12: Result of Inception v4 after 5000 steps in set 2

Predicted: Normal Predicted: Abnormal Total Actual: Normal 32 18 50 Actual: Abnormal 22 37 59 Total 54 55 109 Accuracy 0.642 Positive predicted value 0.64

Table 4.13: Result of Inception ResNet after 5000 steps in set 1

Predicted: Normal Predicted: Abnormal Total Actual: Normal 32 18 50 Actual: Abnormal 22 37 59 Total 54 55 109 Accuracy 0.633 Positive predicted value 0.64

Table 4.14: Result of Inception ResNet after 5000 steps in set 2

64 Predicted: Normal Predicted: Abnormal Total Actual: Normal 37 13 50 Actual: Abnormal 27 32 59 Total 64 45 109 Accuracy 0.633 Positive predicted value 0.74

Table 4.15: Result of Siamese Network after 5000 steps in set 1

Predicted: Normal Predicted: Abnormal Total Actual: Normal 37 13 50 Actual: Abnormal 32 27 59 Total 69 40 109 Accuracy 0.587 Positive predicted value 0.74

Table 4.16: Result of Siamese Network after 5000 steps in set 2

Predicted: Normal Predicted: Abnormal Total Actual: Normal 39 11 50 Actual: Abnormal 23 36 59 Total 62 47 109 Accuracy 0.688 Positive predicted value 0.74

Table 4.17: Result of HCSNet after 5000 steps in set 1

Predicted: Normal Predicted: Abnormal Total Actual: Normal 40 10 50 Actual: Abnormal 22 37 59 Total 62 47 109 Accuracy 0.706 Positive predicted value 0.80

Table 4.18: Result of HCSNet after 5000 steps in set 2

65 From the tables above, most of the other networks have similar negative-accuracy performance in between two training sets. Even though the overall accuracy is re- stricted by the CNN architecture and layer functionality, the performance of detecting the negative case is not as respected. However, Siamese Network has an outstanding performance of normal case detection. The size of Set 2 may be a reason why the performance is important because the structures of normal case are similar. Yet,

Siamese Network has a higher accuracy so that it is a suitable complement for other

CNNs. Therefore, the combination of ResNet (Best performance in all the general networks) and Siamese provides a better performance outcome, especially on negative case accuracy.

4.3 Result and Statistics on Experiment II

As the results shown in Figure 4.3 and Figure 4.5, the validation accuracy of identification obtained from either Inception V3 or Inception ResNet V2 is higher than 90%. Inception ResNet’s accuracy is slightly higher and reaches to around 95% after 30 epochs. Compared with performance in scratch network model and smaller dataset which has only 60-70% accuracy, the accuracy obtained from pre-trained model and 9 times larger dataset improves by 30%.

Testing accuracy for these two networks are listed below in Table 4.21 and Table

4.22. Inception V3 has a lower accuracy than Inception ResNet V2. However, it is interesting to note that the Positive Predicted Value of inception V3 is 94.4% which is higher than the Inception ResNet V2.

The implementation functions and parameters are the same as introduced in sec- tion 3.4.6. The results are listed below.

66 Figure 4.3: Validation accuracy vs. epoch Inception V3 Experiment II

Predicted: Normal Predicted: Abnormal Total Actual: Normal 152 9 161 Actual: Abnormal 23 133 166 Total 175 142 327 Accuracy 0.871 Positive predicted value 0.944

Table 4.19: Result of Inception V3 in Experiment II

67 Figure 4.4: Loss vs. epoch Inception V3 Experiment II

Predicted: Normal Predicted: Abnormal Total Actual: Normal 140 21 161 Actual: Abnormal 11 155 166 Total 151 176 327 Accuracy 0.902 Positive predicted value 0.869

Table 4.20: Result of Inception ResNet V2 in Experiment II

68 Figure 4.5: Validation accuracy vs. epoch Inception ResNet V2 Experiment II

Predicted: Normal Predicted: Abnormal Total Actual: Normal 154(+2) 7(-2) 161 Actual: Abnormal 19(-4) 137(+4) 166 Total 173 144 327 Accuracy 0.889 Positive predicted value 0.956

Table 4.21: Result of Inception V3 and Siamese Network in Experiment II

69 Figure 4.6: Loss vs. epoch Inception ResNet V2 Experiment II

Predicted: Normal Predicted: Abnormal Total Actual: Normal 148(+8) 13(-8) 161 Actual: Abnormal 11(+0) 155(+0) 166 Total 159 168 327 Accuracy 0.926 Positive predicted value 0.919

Table 4.22: Result of HCSNet in Experiment II

70 We notice that both Inception V3 and Inception ResNet V2 show better result than the case after processed by Siamese Network; but the high accuracy is mostly relying on the performance of CNN in the first step. Since the first step already reaches to an over 90% validation accuracy, images that need to be double-checked by Siamese network are mostly difficult cases. It is reasonable to understand that the improvement of accuracy after adding Siamese network post-processing step is not obvious. Besides, even though the improvement of True-Positive cases in HCSNet is higher than those in Inception V3, the overall positive predicted value in Inception

V3 is still higher than HCSNet because of the good performance in the first step.

Compared with the results of Experiment I and Experiment II, the number of training cases will highly affect the performance in CNN. HCSNet performs well in both experiments. Siamese Network does not provide a negative effect from both experiments. Besides, if the dataset is not large enough, Inception ResNet V2 with

Siamese Network will improve the performance compared to other networks in neuro image case.

71 Chapter 5: Conclusion and Future Discussion

5.1 Conclusion

In this thesis, we proposed a Hybrid CNN-Siamese Network(HCSNet), a Con- volutional Neural Network (CNN) architecture which is used for 2D neuro image classification. The purpose of this research is to design a CNN architecture sensitive on normal case diagnosis instead of disease detection. HCSNet combines Inception

ResNet V2 and Siamese network together to achieve better performance than either of them. In HCSNet’s architecture, Siamese network is treated as a post processing

filter to double check the case marked as an abnormal case in Inception ResNet V2.

HCSNet performs better than other traditional CNNs when the dataset is small

(Experiment I), especially the Siamese network part. If images on the training set are mostly in similar depth, combination of Inception ResNet V2 and Siamese network will provide a better result than Inception ResNet V2 itself.

In a larger training set (2750 images), CNN shows its advantage on handling multiple layer processing. Classification accuracy is approximately 20% higher when the larger training set is used. HCSNet still performs well and the overall accuracy reaches to 92.6%. However, it is interesting to note that “Inception V3 + Siamese

Network” has a higher positive predictive value (accuracy of detect normal case) than

72 the HCSNet. Moreover, because of the high performance of Inception ResNet and

Inception V3, there is no particular enhancement after Siamese network is added. It

reveals that if the training set is large enough to cover most of the possible outcomes,

the performance of Siamese Network is limited. Overall, HCSNet provides a better

classification accuracy than traditional machine learning methods while processing

neuro images.

5.2 Future work and discussion

5.2.1 Datasets size

From the result of chapter 4, it is clear that the performance of HCSNet will be

improved if the dataset is larger. Therefore, if this CNN is implemented in a larger

dataset, the CNN may get a better result than 92.6% accuracy.

5.2.2 Siamese network parameter setting

In this experiment, we set up a 60% similarity threshold in HCSNet before compar-

ing to avoid CNN classified test image to certain class only based on z-axis position.

The value of init is defined as either 0.65 or 0.35 which may not be a perfect number for this architecture. The result may be different if the parameter is changed which need more experiments in the future.

5.2.3 Segmentation region

In this research, the purpose of adding the Siamese Network in the architecture is to find out how similar the image is to a normal case. However, there is another usage that can be implemented: find out how dissimilar two certain region are. It is possible to let Siamese Network decide which region is the most dissimilar part and

73 which region has a higher possibility to be the ROI for further research. This could be a future segmentation development from current design.

5.2.4 3D object processing

The HCSNet’s current input is 2D-slice from a 3D image. It is reasonable to expect that 3D image will provide more information than 2D. 3D image can provide more information about locations of the brain’s components. If there is a 3D object, depth of the slice can be counted as another parameter which will provide more information than slice image classification.

74 Bibliography

[1] Luciano M Prevedello, Barbaros S Erdal, John L Ryu, Kevin J Little, Mutlu Demirer, Songyue Qian, and Richard D White. Automated critical test find- ings identification and online notification system using artificial intelligence in imaging. Radiology, 285(3):923–931, 2017.

[2] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

[3] Yann LeCun, L´eonBottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278– 2324, 1998.

[4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[5] Martin Thoma. Analysis and optimization of convolutional neural network ar- chitectures. arXiv preprint arXiv:1707.09725, 2017.

[6] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pages 84–92. Springer, 2015.

[7] Stephanie Powell, Vincent A Magnotta, Hans Johnson, Vamsi K Jammala- madaka, Ronald Pierson, and Nancy C Andreasen. Registration and machine learning-based automated segmentation of subcortical and cerebellar brain struc- tures. Neuroimage, 39(1):238–247, 2008.

[8] Michael Hutchinson and Ulrich Raff. Structural changes of the substantia nigra in parkinson’s disease as revealed by mr imaging. American journal of neurora- diology, 21(4):697–701, 2000.

[9] Jeffrey R Petrella, R Edward Coleman, and P Murali Doraiswamy. Neuroimag- ing and early diagnosis of alzheimer disease: a look to the future. Radiology, 226(2):315–336, 2003.

75 [10] Jyrki MP L¨otj¨onen,Robin Wolz, Juha R Koikkalainen, Lennart Thurfjell, Gun- hild Waldemar, Hilkka Soininen, Daniel Rueckert, Alzheimer’s Disease Neu- roimaging Initiative, et al. Fast and robust multi-atlas segmentation of brain magnetic resonance images. Neuroimage, 49(3):2352–2365, 2010.

[11] Hongzhi Wang, Jung W Suh, Sandhitsu R Das, John B Pluta, Caryne Craige, and Paul A Yushkevich. Multi-atlas segmentation with joint label fusion. IEEE transactions on pattern analysis and machine intelligence, 35(3):611–623, 2013.

[12] Kolawole O Babalola, Tim F Cootes, Carole J Twining, Vlad Petrovic, and Chris Taylor. 3d brain segmentation using active appearance models and lo- cal regressors. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 401–408. Springer, 2008.

[13] Anil Rao, Paul Aljabar, and Daniel Rueckert. Hierarchical statistical shape analysis and prediction of sub-cortical brain structures. Medical , 12(1):55–68, 2008.

[14] Jing Yang and James S Duncan. 3d of deformable objects with joint shape-intensity prior models using level sets. Medical image analysis, 8(3):285–294, 2004.

[15] J Dolz, L Massoptier, and M Vermandel. Segmentation algorithms of subcortical brain structures on mri for radiotherapy and radiosurgery: a survey. IRBM, 36(4):200–212, 2015.

[16] Mahsa Shakeri, Stavros Tsogkas, Enzo Ferrante, Sarah Lippe, Samuel Kadoury, Nikos Paragios, and Iasonas Kokkinos. Sub-cortical brain structure segmentation using f-cnn’s. arXiv preprint arXiv:1602.02130, 2016.

[17] Dan Ciresan, Alessandro Giusti, Luca M Gambardella, and J¨urgenSchmidhu- ber. Deep neural networks segment neuronal membranes in electron microscopy images. In Advances in neural information processing systems, pages 2843–2851, 2012.

[18] Wenlu Zhang, Rongjian Li, Houtao Deng, Li Wang, Weili Lin, Shuiwang Ji, and Dinggang Shen. Deep convolutional neural networks for multi-modality isointense infant brain image segmentation. NeuroImage, 108:214–224, 2015.

[19] Mohammad Havaei, Axel Davy, David Warde-Farley, Antoine Biard, Aaron Courville, Yoshua Bengio, Chris Pal, Pierre-Marc Jodoin, and Hugo Larochelle. Brain tumor segmentation with deep neural networks. Medical image analysis, 35:18–31, 2017.

76 [20] S´ergioPereira, Adriano Pinto, Victor Alves, and Carlos A Silva. Brain tumor segmentation using convolutional neural networks in mri images. IEEE transac- tions on medical imaging, 35(5):1240–1251, 2016.

[21] Noah Lee, Andrew F Laine, and Arno Klein. Towards a approach to brain parcellation. In Biomedical Imaging: From Nano to Macro, 2011 IEEE International Symposium on, pages 321–324. IEEE, 2011.

[22] Pim Moeskops, Max A Viergever, Adri¨enne M Mendrik, Linda S de Vries, Manon JNL Benders, and Ivana Iˇsgum.Automatic segmentation of mr brain im- ages with a convolutional neural network. IEEE transactions on medical imaging, 35(5):1252–1261, 2016.

[23] Fausto Milletari, Seyed-Ahmad Ahmadi, Christine Kroll, Annika Plate, Verena Rozanski, Juliana Maiostre, Johannes Levin, Olaf Dietrich, Birgit Ertl-Wagner, Kai B¨otzel, et al. Hough-cnn: deep learning for segmentation of deep brain regions in mri and ultrasound. Computer Vision and Image Understanding, 164:92–102, 2017.

[24] Alexander de Brebisson and Giovanni Montana. Deep neural networks for anatomical brain segmentation. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition Workshops, pages 20–28, 2015.

[25] Jose Dolz, Christian Desrosiers, and Ismail Ben Ayed. 3d fully convolutional networks for subcortical segmentation in mri: A large-scale study. NeuroImage, 2017.

[26] Averill M Law, W David Kelton, and W David Kelton. Simulation modeling and analysis, volume 2. McGraw-Hill New York, 1991.

[27] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning rep- resentations by back-propagating errors. nature, 323(6088):533, 1986.

[28] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition, pages 815–823, 2015.

[29] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[30] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Rus- lan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

77 [31] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Rus- lan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.

[32] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi- novich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.

[33] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, volume 4, page 12, 2017.

[34] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.

[35] Tom Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861–874, 2006.

[36] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.

[37] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data min- ing, pages 1135–1144. ACM, 2016.

[38] Jonathan Ortigosa-Hern´andez,I˜nakiInza, and Jose A Lozano. Towards com- petitive classifiers for unbalanced classification problems: A study on the perfor- mance scores. arXiv preprint arXiv:1608.08984, 2016.

[39] Jason Sanders and Edward Kandrot. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional, 2010.

[40] Martin Thoma. A survey of semantic segmentation. arXiv preprint arXiv:1602.06541, 2016.

[41] Alexander Mordvintsev, Christopher Olah, and Mike Tyka. Inceptionism: Going deeper into neural networks. Google Research Blog. Retrieved June, 20(14):5, 2015.

[42] Aravindh Mahendran and Andrea Vedaldi. Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision, 120(3):233–255, 2016.

78 [43] Anh Nguyen, Jason Yosinski, and Jeff Clune. Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks. arXiv preprint arXiv:1602.03616, 2016.

79