Using Convolutional Neural Network to Generate Neuro Image Template
A Thesis
Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University
By
Songyue Qian,
Graduate Program in Department of Electrical and Computer Engineering
The Ohio State University
2019
Master’s Examination Committee:
Prof. Bradley D. Clymer, PhD, Co-Advisor Assistant Professor Barbaros Selnur Erdal, PhD, Co-Advisor c Copyright by
Songyue Qian
2019 Abstract
Machine learning techniques implemented by Convolutional Neural Network (CNN) have become the state-of-the-art technique for automatic neuro image classification because of their outstanding computing ability. The template model method, a tech- nique used to produce a stable pattern from the position of different anatomical shapes in the brain, is considered as another automatic method that can segment whether a patient is healthy or not. However, most machine learning techniques only focus on detecting specific diseases but overlook the patients health. The template model’s architecture may overlook the pattern of the entire brain. Therefore, a bet- ter automatic classification technique that focuses on the patient’s health status is in demand. The purpose of this research is to design an efficient CNN architecture sensitive on normal case diagnosis instead of disease detection.
In this research, we propose Hybrid-CNN-Siamese-Network(HCSNet), a CNN ar- chitecture which is used for 2D neuro image classification. HCSNet combines Incep- tion ResNet V2 and a Siamese network together to achieve better performance than those two methods alone. In order to illustrate why this architecture performs well on neuro imaging, a comprehensive overview of existing techniques for CNN analysis and experiments about comparing different CNNs’ performance on the same neuro image dataset are provided [1]. We merge all the abnormal cases to one abnormal
ii class so that there are only two classes (normal/abnormal) in the dataset. Experimen- tal results demonstrate the effectiveness of our proposed CNN architecture. HCSNet obtains an overall 92% accuracy and 94.4% accuracy for detecting normal cases. We further discuss the potential future work that would be done based on this CNN architecture.
iii This is dedicated to my parents.
iv Acknowledgments
I would like to thank my two thesis advisors Prof. Bradley D. Clymer of the
Electrical and Computer Engineering at the Ohio State University and Dr. Barbaros
Selnur Erdal of the Department of the Radiology at the Ohio State University. Dr.
Clymer, thank you for the guiding and patience to me. Dr. Erdal, it’s my honor to work with you, you are the best advisor. Thank you.
I would also like to thank all of my friends who have helped me to improve the writing. It is impossible for me to finish this thesis without you guys’ assistance and support.Thank you.
Songyue Qian
v Table of Contents
Page
Abstract ...... ii
Dedication ...... iv
Acknowledgments ...... v
List of Tables ...... viii
List of Figures ...... x
1. Introduction ...... 1
1.1 Motivation ...... 1 1.2 Problem Statement ...... 2 1.3 Organization ...... 3
2. Background ...... 5
2.1 Neuroimaging template ...... 5 2.2 Neural Network Architecture ...... 7 2.3 Contribution ...... 9
3. Methodology ...... 11
3.1 Overview of Convolutional Neural Network(CNN) ...... 11 3.2 Neural Network Training ...... 14 3.2.1 Backpropagation ...... 14 3.2.2 Training sets setting ...... 15 3.3 Neural Network Layers ...... 16 3.3.1 Convolutional Layers ...... 16 3.3.2 Pooling Layers ...... 21
vi 3.3.3 Activation Layers ...... 23 3.3.4 Dropout ...... 27 3.3.5 Normalization Layers ...... 27 3.3.6 Loss Layers ...... 29 3.4 Neural Network design ...... 30 3.4.1 LeNet ...... 30 3.4.2 AlexNet ...... 31 3.4.3 VGGNet ...... 34 3.4.4 GoogleNet ...... 37 3.4.5 ResNet ...... 41 3.4.6 Siamese Network ...... 45 3.4.7 Hybrid CNN-Siamese network(HCSNet) ...... 48 3.5 Pre-processing ...... 50 3.6 Analysis Techniques ...... 51 3.6.1 Qualitative Analysis by example ...... 51 3.6.2 Confusion Matrices ...... 53 3.6.3 Learning Curves ...... 55 3.6.4 Others ...... 55
4. Experiment and Result ...... 57
4.1 Parameter setting ...... 57 4.1.1 Implementation on scratch network ...... 58 4.1.2 Implementation on pre-trained network ...... 59 4.2 Result and Statistics on Experiment I ...... 60 4.3 Result and Statistics on Experiment II ...... 66
5. Conclusion and Future Discussion ...... 72
5.1 Conclusion ...... 72 5.2 Future work and discussion ...... 73 5.2.1 Datasets size ...... 73 5.2.2 Siamese network parameter setting ...... 73 5.2.3 Segmentation region ...... 73 5.2.4 3D object processing ...... 74
Bibliography ...... 75
vii List of Tables
Table Page
3.1 Example of Confusion matrix ...... 54
4.1 DataSet distribution on Experiment I ...... 59
4.2 DataSet distribution on experiment II ...... 60
4.3 Result of LeNet after 5000 steps in set 1 ...... 62
4.4 Result of LeNet after 5000 steps in set 2 ...... 62
4.5 Result of AlexNet after 5000 steps in set 1 ...... 62
4.6 Result of AlexNet after 5000 steps in set 2 ...... 62
4.7 Result of VGG-16 after 5000 steps in set 1 ...... 63
4.8 Result of VGG-16 after 5000 steps in set 2 ...... 63
4.9 Result of Inception v3 after 5000 steps in set 1 ...... 63
4.10 Result of Inception v3 after 5000 steps in set 2 ...... 63
4.11 Result of Inception v4 after 5000 steps in set 1 ...... 64
4.12 Result of Inception v4 after 5000 steps in set 2 ...... 64
4.13 Result of Inception ResNet after 5000 steps in set 1 ...... 64
4.14 Result of Inception ResNet after 5000 steps in set 2 ...... 64
viii 4.15 Result of Siamese Network after 5000 steps in set 1 ...... 65
4.16 Result of Siamese Network after 5000 steps in set 2 ...... 65
4.17 Result of HCSNet after 5000 steps in set 1 ...... 65
4.18 Result of HCSNet after 5000 steps in set 2 ...... 65
4.19 Result of Inception V3 in Experiment II ...... 67
4.20 Result of Inception ResNet V2 in Experiment II ...... 68
4.21 Result of Inception V3 and Siamese Network in Experiment II . . . . 69
4.22 Result of HCSNet in Experiment II ...... 70
ix List of Figures
Figure Page
3.1 Multiple Neurons with single directions node ...... 11
3.2 Multiple Neurons with single directions node ...... 12
3.3 Multiple Neurons with single directions node ...... 13
3.4 Convolutional Process ...... 18
3.5 Convolutional Matrix ...... 18
3.6 Convolutional Matrix with padding=2 ...... 20
3.7 Max and Average pooling ...... 22
3.8 Sigmoid activation function ...... 24
3.9 Derivative of Sigmoid Activation Function ...... 24
3.10 tanh activation function ...... 25
3.11 Derivative of tanh Activation Function ...... 25
3.12 ReLU activation function ...... 25
3.13 Derivative of ReLU Activation Function ...... 25
3.14 Leaky ReLU Activation ...... 26
3.15 Before and after applying dropout ...... 28
x 3.16 Dropout processing equation[2] ...... 29
3.17 Architecture of LeNet-5[3] ...... 31
3.18 Layer structure of AlexNet[4] ...... 32
3.19 Parameter number and setting of AlexNet ...... 33
3.20 Architecture of VGGNet[5] ...... 35
3.21 Inception module of GoogleNet ...... 38
3.22 Inception V2 module ...... 40
3.23 Inception V3 module ...... 41
3.24 Residual unit of ResNet ...... 42
3.25 Compare VGG and ResNet structure ...... 44
3.26 Residual unit of two and three layer ResNet ...... 45
3.27 Siamese Architecture [6] ...... 46
3.28 Workflow of HCSNet ...... 49
4.1 Different network validation accuracy vs. number of steps in set 1 . . 61
4.2 Different network validation accuracy vs. number of steps in set 2 . . 61
4.3 Validation accuracy vs. epoch Inception V3 Experiment II ...... 67
4.4 Loss vs. epoch Inception V3 Experiment II ...... 68
4.5 Validation accuracy vs. epoch Inception ResNet V2 Experiment II . . 69
4.6 Loss vs. epoch Inception ResNet V2 Experiment II ...... 70
xi Chapter 1: Introduction
1.1 Motivation
While the usefulness of medical imaging is constantly leading to better image quality and higher accuracy of segmentation results, manual segmentation is a time- consuming and laborious process. Hence, automatic segmentation is a demanded technique which is mostly done by quantitative research algorithms. In general, most of the quantitative research about neuroimaging focuses on aligning one or several anatomical templates to the target image (via a linear or nonlinear registration pro- cess) and transferring segmentation labels from the templates to the image. However, such methods may not be capable of capturing the full anatomical variability of tar- gets because the generated model may only focus on the major components of the brain structure.
One of the quantitative research algorithms generates a template model for a normal brain. Although template model is not a common method of medical image classification for all organs of the body, it produces a stable pattern from the position of different anatomical shapes in the brain. The similarity between this template and the testing image can be calculated to reveal how ‘normal’ the test case is.
1 Machine learning which uses images to train a predictive model that assigns class probabilities to each pixel/voxel is another widely used technique. Machine learning with its state-of-the-art ability in segmenting brain structures [7] has the capability to process large datasets and implement classification or segmentation far beyond the capabilities of human perception.
Most of the neuroimaging machine learning implementations are applied in the two methods introduced above, the template model and machine learning. The former method generates a brain template model to segment each component of the brain. It observes the situation of each brain component because the top concern is to find out which part matches to which component while the pattern of the entire brain may be overlooked. The machine learning method has been used to generate a training set for a specific disease. It causes a potential pitfall of any new disease in test image set may be wrongly classified as normal case. Therefore, a new method which can combine the two methods’ advantage is desired.
1.2 Problem Statement
In order to find out the best CNN for normal/abnormal classification design, Ex- periment I explores each neural network from the layer’s perspective and demonstrates a performance evaluation using different networks with same hardware conditions.
The training model is not for specific disease detection or component matching. It tells whether the test image belongs to the normal set or not. Hence, as long as the outcome is that the test image does not belong to the normal, the test image is treated as abnormal case and will be left to the radiologist for further investigation.
2 Another goal of this research is to change the usage of CNN for neuroimaging.
A Siamese Network is added as a post processing step in which a CNN applies clas- sification by comparing the similarity between two images. Because the similarity is calculated by comparing the entire image, it is capable of generating a template without splitting the image into different components, fixing the pitfall of template model technique with this method.
In the strategy of this research, all abnormal cases are counted in one single set. The classification process is not to diagnose the disease but instead determines whether the situation of the patient is normal or not. This strategy not only enhances the detectability of the normal cases compared with the component matching method, but it also avoids the possible incorrect diagnosis results from the specific disease classification method. Once the result determines that the image does not belong to the normal group, it will be put in the list of abnormal cases for further examination.
In Experiment II, the entire CNN architecture implements an image classification using the template generated from the training sets.
1.3 Organization
In Chapter 2, the background of the neuroimaging template and machine learn- ing techniques are discussed in order to illustrate the major components for this research. In Chapter 3, the common machine learning network designs are discussed.
This chapter also investigates the component and structure of neural networks and discusses the pros and cons of various machine learning neural network methods.
Chapter 4 compares the performance of each neural network for generating a normal
3 template from the training sets. Chapter 5 offers the conclusion of this research and an overview of planned future research.
4 Chapter 2: Background
2.1 Neuroimaging template
The purpose of creating a neuroimaging template is to make a region-by-region comparison between the templates and the test images. Once a template has been constructed, it models the unseen data from the same population, the template, provides an ideal reference image for the data and for the statistical analysis. For instance, the abnormal volumes and the shapes of certain anatomical regions of the brain have been found to be associated with a series of brain disorders, including
Alzheimers and Parkinson’s disease [8][9].
In general, the prior-art methods for generating a neuro image template can be divided into four main categories: atlas-based methods [10][11], statistical models
[12][13], deformable models [14] and machine learning based classifiers [7][10].
Atlas-based method aligns one or several anatomical templates to the target image by using either linear or non-linear registration process, transfers segmenting labels from the templates to the image. This kind of method heavily relies on a registration step, in which the atlases are non-linearly registered to the query image. If the registration step is complex, it will take too much time for processing and aligning.
Moreover, the atlas-based methods may not be able to capture the full anatomical
5 variability of target subjects and would fail in the cases of large misalignments or deformations.
Statistical model method utilizes the training data to learn a parametric model describing the variability of the specific brain features such as, shape, texture, etc.
These approaches might result in overfitting the data if the number of parameters to be learned is too small, which negatively affects the results. The robustness of this kind of statistical approach might also be affected by the noise in the training data. The parameters are updated iteratively by searching in the vicinity of the current solution. Therefore, an accurate initialization is required for converging the parameters to a correct structure.
Unlike atlas-based method and statistical model method, segmentation technique using deformable models is highly flexible and does not require any training data or prior knowledge. Besides, deforming in this method provides a capability for distinguishing the input structures. The disadvantage of this method is that the deformable models are heavily related on the initialization of the segmentation contour and the stopping criteria. It makes the method highly related to the characteristic of each case.
The fourth method is machine learning application[10][15], which uses the training images to learn a predictive model that assigns class probabilities to each pixel/voxel.
These probabilities are sometimes used as unary potentials in the standard regular- ization techniques such as graph cuts [16]. Nevertheless, this type of approach usu- ally involves heavy algorithmic design, with some carefully engineered, application- dependent features, and meta-parameters and limits the applicability to different brain structures and modalities.
6 2.2 Neural Network Architecture
Machine learning has emerged as a powerful tool in numerous applications of patterns and speech recognition. Unlike the hand-crafted methods, machine learning techniques make it possible to learn the hierarchical features which represent different levels of abstraction in a data-driven manner. Among the different types of learning approaches, the CNNs [3][4] have shown their potential of solving the problems in computer vision. The networks of this type are established by multiple convolution, pooling and fully-connected layers, and the parameters of which are learned using the back-propagation. The network surpasses traditional architectures on two prop- erties: local-connectivity and parameter sharing. The units in the hidden layers of a CNN are only connected to a small number of units, corresponding to a spatially localized region. This process reduces the number of parameters in the net, the mem- ory/computational burden, and the risk of overfitting. Moreover, CNNs also reduce the number of learned parameters by sharing the same basis function (i.e., convolution
filters) across different image locations.
CNNs have seen many applications in neuroimaging applications among many others in medical imaging. [17][18][19][20]. Ciresan et al. [17] have shown the CNN allows accurate segmentation of the neuronal membranes in electron microscopy im- ages. In the same study, a sliding-window strategy is applied to predict the class probabilities of each pixel by using the patches centered at the pixels as input to the network. CNN based methods make it possible to segment three brain tissues
(white matter, gray matter and cerebrospinal fluid) from the multi-sequence Magnetic
Resonance Images (MRI) of infants [18]. An important drawback of this strategy is
7 that its label prediction is based on the localized information. Also, this strategy is typically slow since the prediction must be carried out for each pixel.
2D image corresponding to a single plane can be another type of input of CNN.
In medical imaging, 2D images are obtained as a slice from 3D medical images. Deep
CNNs have been investigated for the glioblastoma tumor segmentation [19], under an architecture with several pathways, which modeled both local and global-context features. It is also feasible to apply a different CNN architecture for the segmenting of brain tumors in MRI data, exploring the use of small convolution kernels [20].
Similarly, several recent studies investigated CNNs for segmenting subcortical brain structures [16][21][22][23][24]. Lee et al. [21] presented a CNN-based approach to learn discriminative features from the expert-labelled MR images. Moeskops et al.
[22] applied CNNs to segment brain structures from five different datasets, and listed performance for the subjects in different age groups (ranging from pre-term infants to older adults). A multiscale patch-based strategy is used to improve these results, where the patches of different sizes were extracted around each pixel as the inputs to the network.
Although medical images are often in the form of 3D volumes (e.g., MRI or com- puted tomography scans), most of the existing CNN approaches use a slice-by-slice analysis of 2D images. An obvious advantage of a 2D approach, compared to one using
3D images, is its lower computational and memory requirements. Furthermore, 2D inputs accommodate using pre-trained networks, either directly or via transfer learn- ing. However, an important drawback of this approach is that anatomic context in orthogonal directions to the 2D plane is completely discarded. As discussed recently
8 in Millertari et al. [23], considering 3D MRI data directly, instead of slice-by-slice, we improve the performance of a segmentation method.
To incorporate 3D contextual information, there is also study about using 2D
CNNs on images from the three orthogonal planes [24]. The memory requirements for fully 3D networks are avoided by extracting large 2D patches from multiple image scales, and combining them with small single-scale 3D patches. All patches are as- sembled into eight parallel network pathways to achieve a high-quality segmentation of 134 brain regions from whole brain MRI. Recently, a research proposing a CNN scheme based on 2D convolutions to segment a set of subcortical brain structures [16].
The segmentation of the whole volume is first achieved by processing each 2D slice independently. To impose volumetric homogeneity, it constructs a 3D conditional random field (CRF) using scores from the CNN as unary potentials in a multi-label energy minimization problem.
3D CNNs have been largely avoided due to the computational and memory re- quirements of running 3D convolutions during inference. Most of the 3D CNN object is implemented by converting image to multiple 2D images then processed slice by slice. The structure of processing a 3D object [25] is different than the structure of processing multiple 2D slice images. But 3D CNN topics are not covered in this background section because this research only implemented 2D-slice images.
2.3 Contribution
This research aims to implement medical image classification applied by machine learning techniques and template model. In the implementation, Siamese Network is used to train a template of normal cases of the brain. The CNN architecture discussed
9 in this thesis consists of two components, pre-trained CNN and CNN for template model. The output of this CNN architecture is the similarity between the test image and the training set. According the accuracy, whether the test image is an abnormal case might be determined. These comprehensive experiments applied by different neural networks are implemented in the same hardware environment.
The dataset of this research includes 2750 consecutive head CT examinations which is the same dataset used in Luciano et al.[1] The CNN designed in this research is a combination of Siamese Network and Inception ResNet V2. It shows a better performance than simple CNN when the training set is small (less than 400). It improves 7 % accuracy and 10% positive predicted value. For the larger datasets, it provides a good result which reaches to 92% accuracy and over 94% accuracy for detecting normal case.
Moreover, the methodology section demonstrates the analysis and functionality of each layer components and each neural network. Another purpose of this thesis is to analyze which CNN architecture will be more sensitive in neuroimaging template case. It will benefit the future research for any neural network design for CT neuro imaging. The result provides survey information of how different neural networks affect the classification results.
10 Chapter 3: Methodology
3.1 Overview of Convolutional Neural Network(CNN)
Convolutional Neural Network (CNN) uses a variation of Multi-Layer Perceptions
(MLP) designed to achieve minimal pre-processing. To describe CNN, we will start from the simplest possible neural network, which comprises a single neuron as Fig- ure 3.1.
The “neuron” is a computational unit that takes inputs x1, x2, x3 to produce a single output, which is similar to the neuron’s structure of the human. That is the reason why it is called as “artificial neuron”. The core part of a neuron can be treated
Figure 3.1: Multiple Neurons with single directions node
11 Figure 3.2: Multiple Neurons with single directions node
as a function which corresponds to the input-output mapping defined by the logistic regression. Therefore, we can generate a neural network model by combing multiple
“neurons”.
The network architecture in Figure 3.2 can be treated as an example of MLP.
(Note that the orientation in between nodes can be recurrent in other designs) In this model, we can assume that each parameter x has a weight value w and a threshold
value b used for computation in each neuron. However, the top concern is how to get
the value of w and b. In general, to find a correct value of w and b, we will use the
“control variates method” used in used in Monte Carlo methods [26]. Observe the
outputs difference while changing w and b to δw and δb. Let the machine repeats this
process until it find out an ideal match value for w and b. This process is called as
the training process of CNN.(Figure 3.3)
The order of CNN training can be concluded as below:
12 Figure 3.3: Multiple Neurons with single directions node
• Define the input and output
• Find one or more algorithm to get output from input
• Based on classified datasets to train the model, used for finding parameters
• Once new case’s input is given, based on the parameter calculated from training
process to generate output
If the training object is a computer image, CNN models would use the fact that the pixels of the images are listed in order. The computer image is processed by learning image filters which are called the convolutional layers. While MLPs vectorize the input, the input of a layer in a CNN are feature maps. A feature map is a matrix m ∈ Rw+h,(w, h represents width and height) but typically the width equals the
height (w = h). For an RGB input image, the number of feature maps is d = 3.
Each color channel is a feature map. Since AlexNet [4] almost halved the error in the
13 ImageNet challenge, CNN is a state-of-the-art technique in various computer vision machine learning tasks.
3.2 Neural Network Training
3.2.1 Backpropagation
In CNN, Backpropagation[27] is an essential method to compute descending rate or gradient of the loss function while the network weights are calculated. In other words, backpropagation is the process to calculate the w and the b referred in Chapter
3.1.
Backpropagation adjusting the weight of neurons by calculating the gradient of the loss function is a complement of the gradient descent optimization algorithm.
The loss and effect value from the function is considered as an evaluation of the training performance. Backpropagation requires the derivative of the loss function with respect to the network output to be known, which typically means that a desired target value is known.
The origin of Backpropagation is from an older and more general technique called automatic differentiation. This technique is also sometimes called backward propa- gation of errors, because the error is calculated at the output and distributed back through the network layers. It is commonly used to train deep neural networks, a term referring to neural networks with more than one hidden layer.
Backpropagation is also a generalization of the delta rule to multi-layered feedfor- ward networks, made possible by using the chain rule to iteratively compute gradients for each layer. In Chapter 3.3, we introduce the definition and the structure of dif- ferent CNN layers, which is the objects of Backpropagation.
14 3.2.2 Training sets setting
As Chapter 3.1 states, the input of CNN will be vectored with multiple features to get a single value output. Because of this architecture, most of the implementations made by CNN are used for classification. Even though there are multiple implemen- tations [16][21][22][23][24] which focus on generating a model template, most of them only focus on detecting a specific region based on image template but do not apply segmenting. Unlike other research focus on specific disease or specific region, this thesis aims to compare testing case with the normal template to judge whether the testing case is normal or not. The output for the test image will be the similarity between the test image and different classes’ templates. If the test image is classified as an abnormal case, it means that the system recommends the doctor do further segmentation for this test case. The design of this CNN model has different orienta- tion compared with the traditional CNN experiment because it focuses on the normal case instead of the disease cases. In other words, the outcome of the research switches from “Does the patient have disease A?” to “Is the patient healthy?” The training set we used in this research is the same dataset used in the previous AI classification project [1]. The abnormal training set consists of multiple types of brain diseases.
The normal training set includes the normal neuro image. To simplify the problem, this project only focuses on 2D image instead of original 3D CT image. The 2D slices on the training sets are collected from similar z-axis from the CT images.
There is another method can be used to generate a template for the normal cases.
An image can be converted to a matrix made of vectors, which can be used as input of the CNN. However, the output does not have to be a single value. A standard template can also be generated from the datasets, which is similar to the design of
15 FaceNet[28]. They represent the structure of a face by two hundred points. Then they use those points positions to generate a template from one class. After that, they compare the similarity between multiple face template to judge that whether it belongs to this class or not. In this architecture, the output is a set of points converted from original images (matrices). Similar to facial image, components in human brain have a similar shape and located in a similar position. Therefore, it is possible to apply the template model method on neuro image CNN application.
This method can be implemented by Siamese network which is used for comparing two images’ similarity by CNN. The comparing process includes a convolution matrix result which captures more vectors characteristics compared with traditional pixel comparing method. In this method, only the normal training set is used because the key point is to compare the similarity between the test image and all the normal cases.
The output value is a floating point value between 0 to 1 instead of Boolean output.
In this thesis research, the classification result of HCSNet is calculated by combining the normal CNN’s result and the template model’s result. Due to the classification process implemented by Inception ResNet and Siamese Network are separated, there is no particular pre-processing required for images in the training set.
3.3 Neural Network Layers
3.3.1 Convolutional Layers
In CNN, the convolutional layer computes the output volume by computing, the dot product between all filters and image patch. Convolutional layers pass the pro- cessed outcome to the next layer by applying a convolution operation to the input.
The layer’s parameters consist of a set of filters, which have a receptive field extending
16 through the full depth of the input volume. Each filter is convolved across the width and height of the input volume to compute the dot product between the entries of the
filter and the input.(Figure 3.4) The output is a 2-dimensional feature map of that
filter which indicates if there is any specific feature detected at some spatial position in the input.
The output volume of the convolution layer is built by combining the feature maps for all filters along the depth. Every entry in the output volume can be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same feature map.
When convolutional layer processing images, it is unworkable to connect neurons to all neurons in the previous volume since the spatial structure of the data is not included in the network architecture. Convolutional networks use spatially local cor- relation to implement a local connectivity pattern between neurons of adjacent layers: each neuron is connected to only a small region of the input volume as Figure 3.5.
A 7 × 7 input matrix is convoluted to a 3 × 3 matrix. The red square marked in the output volume is convoluted from the top left region of the input volume. Because of the convoluted input is divided by input matrices’ size, the connections are local along width and height and extend along the entire depth of the input volume. This architecture ensures that trained filters generate response to a spatially local input pattern.
There are three hyper-parameters control the size of the output volume of the convolutional layer: the depth, stride and zero-padding.
17 Figure 3.4: Convolutional Process
Figure 3.5: Convolutional Matrix
18 The depth of the output volume controls the number of neurons in a layer that connect to the same region of the input volume. These neurons learn to activate for different features (oriented edges, color, etc) in the input.
Stride controls how depth columns around the spatial dimensions are allocated.
When the stride is 1 then we move the filters one pixel at a time. This leads to heavily overlapping receptive fields between the columns, and also to large output volumes.
When the stride is 2 then the filters jump 2 pixels at a time as they slide around.
Figure 3.6 is an example with padding = 2. The receptive fields overlap less and the resulting output volume has smaller spatial dimensions.
Zero padding is to pad the input with zeros on the border of the input volume which is a good supplement of the stride control (strided area filled by 0). The size of this padding is a third hyper-parameter. Padding provides control of the output volume spatial size which preserve the spatial size of the input volume.
Figure 3.6 is an example to implement stride control and zero padding. The input volume is 32 ∗ 32 ∗ 3. If we imagine two borders of zeros around the volume, this give us a 36 ∗ 36 ∗ 3 volume. Then, when we apply our convolutional layer with our three
5 ∗ 5 ∗ 3 filters and a stride of 1, then we will also get a 32 ∗ 32 ∗ 3 output volume.
Assumed W1 and H1 are the width and height before convoluted, F is the size of
filter. W2 and H2 are the width and height after convoluted. In general, setting zero
Padding is P = (F −1)/2 when the stride is S = 1 ensures that the input volume and output volume will have the same size spatially. Relations between each parameter are listed in equations 3.1, 3.2 and 3.3. W − F + 2P W = 1 + 1 (3.1) 2 S H − F + 2P H = 1 + 1 (3.2) 2 S 19 Figure 3.6: Convolutional Matrix with padding=2
20 K − 1 P = (3.3) 2
Parameter sharing scheme is another key feature implemented in convolutional layers to control the number of free parameters. It means that denoting a single
2-dimensional slice of depth as a depth slice, the architecture constrains the neurons in each depth slice and region to use the same weights and bias.
Parameter sharing contributes to the translation invariance of the CNN archi- tecture. Because all neurons in a single depth slice and region exploit the same parameters, the forward pass in each depth slice of the convolutional layer can be computed as a convolution of the neuron’s weights with the input volume. The sets of weights are used as parameters for the filter to process convolution. The result of this convolution is a feature map. The set of feature maps for different filters are stacked along the depth dimension to produce the output.
However, when we expect completely different features from the input image to be learned on different spatial locations, the parameter sharing assumption may not work. For example, when the inputs are faces that have been centered in the image: we might expect different eye-specific or hair-specific features to be learned as different components of the image. In that case, it is common to call the layer a locally connected layer instead of a parameter sharing scheme.
3.3.2 Pooling Layers
Pooling is a non-linear down-sampling architecture in the CNN. There are several non-linear functions to implement pooling, such as max pooling, average pooling. It partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, generates the required output. For example, if there is a max pooling
21 Figure 3.7: Max and Average pooling
method implemented, it will post the max value in the sub-region as the output.
Pooling layers’ are implemented under an assumption that the output value generated after computation is able to represent the characteristic of the sub-region. To prevent overfitting and to reduce the number of parameters and amount of computation in the network, the pooling layer reduces the spatial size of the representation. In general, a pooling layer is inserted between successive convolutional layers in a CNN architecture. The pooling operation provides another form of translation invariance.
22 Pooling summarizes a p×p area of the input feature map. Similar to convolutional layers, pooling can be used with a stride of S ∈ N>1. Usually, p ∈ {2, 3, 4, 5} and s = 2 is the most common one which implemented in AlexNet [4] and VGG-16 [29].
As Thoma (2017)[5] concluded, pooling is applied for three reasons: To get local translational invariance, to get invariance against minor local changes and, most
1 important, for data reduction to S2 th of the data by using strides of S > 1. Examples of Max Pooling and Average pooling are listed in Figure 3.7.
3.3.3 Activation Layers
Activation layer is used to increase non-linearity of a network without affecting the receptive fields of the convolution layers. There are five non-linear activation functions mostly discussed.
Sigmoid
Sigmoid method takes a real-valued number and squashes it into a range between
0 and 1. It converts large negative numbers to 0 and large positive numbers to 1.
Mathematically it is written as:
1 σ(x) = 1 + e−x
Figure 3.8 and Figure 3.9 show the Sigmoid function and its derivative graphi- cally.
During backpropagation through the network with Sigmoid activation, the gradi- ents in neurons whose outputs are near 0 or 1 are nearly 0. These neurons are called saturated neurons. These neurons’ values cannot make significant change because of the extremely small value, which is called as vanishing gradient. If there was a
23 Figure 3.9: Derivative of Sigmoid Acti- Figure 3.8: Sigmoid activation function vation Function
large network comprising of Sigmoid neurons and there are in a saturated region, the network will not be able to implement backpropagation. Meanwhile, the out- puts of Sigmoid are not zero-centered and the exponent function is computationally expensive. A better activation method is required.
Tanh
Tanh function takes a real-valued number but converts it into a range between -1 and 1. Unlike Sigmoid, tanh outputs are zero-centered because the output range is between -1 and 1. The negative inputs considered as strongly negative, zero input values mapped near zero, and the positive inputs regarded as positive. The only disadvantage of tanh is that the tanh function also suffers from the vanishing gradient problem.
Figure 3.10 and Figure 3.11 show the tanh function and its derivative graphically:
Rectified Linear Unit (ReLU)
Unlike methods introduced above, ReLU method’s output region is not converted to a certain scope. When the input x < 0, the output is 0. When the input x > 0, the output is x. Mathematically, it can be represented as:
24 Figure 3.11: Derivative of tanh Activa- Figure 3.10: tanh activation function tion Function
Figure 3.13: Derivative of ReLU Activa- Figure 3.12: ReLU activation function tion Function
y = max(0, x)
Figure 3.12 and Figure 3.13 show the activation function and derivative of ReLU
activation function.
ReLU implementes faster network converge than other methods. It is resistant to
the vanishing gradient problem when x > 0. It ensures that backpropagation can be implemented at least in half of their regions.
However, when x is not a positive number, the network does not learn. In detail, if x < 0, the neuron remains inactive and it kills the gradient during the backward pass. If x = 0 the slope is undefined, instead the value is picking either the left or the
25 Figure 3.14: Leaky ReLU Activation
right gradient. Considered the undefined values may be slightly incorrect because of picking nearest gradient, it is counted as a drawback of ReLU method.
Leaky and Parametic ReLU
Leaky and Parametic ReLU are activation functions used to fix the vanishing gradient issue happened when ReLU activation function processed x < 0. Mathe- matically, the leaky ReLU function can be represented as:
y = max(0.1x, x)
The concept of leaky ReLU is: when x < 0, it will have a small positive slope of
0.1. This function somewhat eliminates the dying ReLU problem but not consistent.
In leaky ReLU, 0.1 is used as a slope parameter to avoid vanishing gradient issue.
The parameter can also be changed to an arbitrary hyper-parameter α. This α can be learned since you can backpropagate into it. This gives the neurons the ability to choose which slope is the best in the negative region. This method is Parametic
ReLU.
26 The parametic ReLU function is given by:
y = max(αx, x)
In summary, PReLU is a more stable and popular choice for current CNN field for non-linear activation layers.
3.3.4 Dropout
Dropout is a technique used to reduce overfitting and co-adaptations on traning data. It was introduced in Hinton et al. [30] and Srivastava et al.[31]. As Thoma’s survey paper concluded[5], a Dropout layer can be implemented as follows: For an input in of any shape s, a tensor of the same shape D ∈ {0, 1}S is sampled, where each element di is sampled independently from a Bernoulli distribution. p represents the dropout probability for every value of the input. Mathematically, dropout method can be written as:
out = D in with di ∼ B (1, p) (3.4)
In general, Dropout is used with p = 0.5. Layers closer to the input usually have a lower dropout probability than later layers. The output of a dropout layer is multiplied with 1/(1 − p) when dropout is enabled[32]. Dropout is only applied after fully connected layers to prevent overfitting and max-pooling layers to impelment data augmentation. Figure 3.15 is an example of implementing Dropuout on CNN.
3.3.5 Normalization Layers
While the parameters of layers close to the output are adapted to some input pro- duced by lower layers, those lower layers parameters are also adapted. This problem
27 Figure 3.15: Before and after applying dropout
is called as internal covariate shift, which leads to the parameters in the upper layers
being worse and a low learning rate.
Normalizing mini-batches [2] is a way to approach this problem. Normalization
layer is the layer implement Normalizing mini-batches on CNN which improves the
gradient flow and allows higher learning rates. Moreover, it can also reduce depen-
dency on initialization and implement regulation [5].
In general, most of the batch normalization layer applied the following normaliza-
tion. xk − E xk xˆk = pV AR (xk) The output of the batch norm layer has the γ,β as parameters. Those parameters will be learned to best represent your activation. Those parameters allow a learnable
(scale and shift) factor.
yk = γk · xˆk + βk
Overall, the operation can be summarized as 3.16[2].
28 Figure 3.16: Dropout processing equation[2]
3.3.6 Loss Layers
The loss is used to determine how a training process penalizes the deviation be- tween the predicted and true labels. Function used in the loss layer is called loss function which describes how far off the result that the CNN produced is from the expected result. Loss function indicates the magnitude of error the CNN made on its prediction.
Before loss layer, there is usually another layer called fully-connected(FC) layer which converts the previous calculation result to a simpler format for loss layer to pro- cess. For example, if we implemented VGG16[29] without using fully connected layer, the loss layers has 4096 nodes need to process result from last pooling layer which has
25088 nodes. The transmission requires 4096 × 25088 weights, which requires a large amount of memory. Instead, the full connection layer can be regarded as a special
29 case of the convolution layer. The Pooling layer (POOL2) to Fully connected (FC1) layers are fully connected, and the output nodes of POOL2 are arranged by vector, that is, there are 25088 dimensions, each dimension is 1*1. The convolution kernel can be seen as numf ilters = 4096, channel = 25088, kernelsize = 1, stride = 1, which has a faster speed than previous design.
Result on loss function can then take that error and ’backpropagate’ it through the CNN model, adjusting its weights and making it get closer to the expected result in the next time.
Various loss functions are appropriate for different tasks. For example, Softmax loss is used for predicting a single class of K mutually exclusive classes. Sigmoid cross-entropy loss is used for predicting K independent probability values in [0, 1].
Euclidean loss is used for regressing to real-valued labels.
3.4 Neural Network design
3.4.1 LeNet
One of the first CNNs used was LeNet-5 [3]. LeNet-5 uses two times the common pattern of a single convolutional layer with tanh as a non-linear activation function followed by a pooling layer and three fully connected layers. One fully connected layer is used to get the right output dimension, another one is necessary to allow the network to learn a non-linear combination of the features of the feature maps.
Architecture of LeNet is shown in Figure 3.17.
In LeNet’s architecture, tanh function is applied after layers 1, 3, 5 and 6. Soft- max function is applied after layer 7. The characteristic is that the convolutional
30 Figure 3.17: Architecture of LeNet-5[3]
layer requires fewer parameters but has an order of magnitude more floating point operations(FLOP) per parameter than fully connected layers.
LeNet was designed for MNIST which achieves a 0.8% test error rate. How- ever, compared with other neural networks introduced below, LeNet is a very simple pipeline processing network. The limitation of layer functionality and structure leads to a low performance while processing medical images.
3.4.2 AlexNet
The first CNN which achieved major improvements on the ImageNet dataset was
AlexNet [4]. Its architecture is shown in Figure 3.18 which has about 60 × 106 pa- rameters. Parameter setting is in Figure 3.19. Convolutional Layers are followed by pooling layers multiple times and ended by a fully connected network which is a similar processing structure inherited from LeNet.
In AlexNet, all the convolution layers size is small after the max-pooling layer, either 5 × 5 or 3 × 3 with stride equal to 1 which means it will scan all the pixels in the image. However, the max-pooling layers stride is 2. Therefore, it reveals
31 Figure 3.18: Layer structure of AlexNet[4]
that: in the first few convolutional layers, although the amount of calculation is very large, the amount of parameters is very small, and they are all around 1M or even smaller, which only accounts for a very small part of AlexNet’s total parameter amount. This convolutional layer’s architecture can extract valid features with a small amount of parameters. If the first few layers directly use the fully connected layer, then the amount of parameters and the amount of calculation will become astronomical. Although each convolution layer occupies less than 1% of the entire network’s parameter amount, if you remove any convolutional layer, the classification performance of the network will be greatly reduced.
To implement AlexNet on GPU, common solution is to reduce the number of parameters and allow parallel computation on separate GPUs. It provides more space to handle and implement data augmentation. To make the architecture easier to compare, this grouping was ignored for the parameter count, which implements the dropout functionality noted above.
32 Figure 3.19: Parameter number and setting of AlexNet
33 3.4.3 VGGNet
VGGNet[29] is a deep convolutional neural network developed by researchers from
the Visua Geometry Group at Oxford University and Google DeepMind. VGGNet
explored the relationship between the depth of convolutional neural networks and
their performance. Through repeated stacking of 3 × 3 small convolution kernels and
2 × 2 of the largest pooled layer, VGGNet succeeded in constructing 16 to 19 deep
convolutions kernels. Compared with the previous state-of-the-art network structure,
VGGNet’s error rate has dropped significantly, and it has achieved second place in
the ILSVRC 2014 competition classification project and first place in the positioning
project. VGGNet is highly scalable. The entire network uses the same size of convo-
lution kernel 3 × 3 and maximum pool size (2 × 2). So far, VGGNet is often used to extract image features. VGGNet trained model parameters are open sourced on its official website and can be used to retrain domain-specific image classification tasks
(equivalent to providing very good initialization weights) and are therefore used in many places.
All of the VGGNet papers use 3×3 convolution kernels and 2×2 pooled nucleuses
to improve performance by continuously deepening the network structure. Figure
3.20 shows the VGGNet’s network structure at each level. Figure 3.20 shows the
parameters for each level. From the 11th floor network to the 19th floor network,
there are detailed performance tests. Although each level of network from A to E
gradually becomes deeper, the amount of parameters of the network does not increase
much. This is because the amount of parameters is mainly consumed in the last three
full-connection layers. Although the previous part of the convolution is very deep, the
amount of parameters consumed is not large, but the more time-consuming part of
34 Figure 3.20: Architecture of VGGNet[5]
the training is still convolution, because of its larger calculation. Among them, D, E is what we often say VGGNet-16 and VGGNet-19. C is very interesting when compared to B, there are several 1×1 convolutional layers. The significance of 1×1 convolution
is mainly linear transformation. The number of input channels and output channels
are unchanged, and no dimension reduction occurs.
VGGNet has 5 segments of the convolution layers, with 2 to 3 convolution layers
in each segment, and a maximum pooling layer is connected to the end of each
segment to reduce the size of the picture. The number of convolutional nuclei in each
segment is the same, and the more contiguous the convolution nuclei are in the later
segment: 64 128 256 512 512. There are often cases where multiple identical 3 × 3
convolutional layers are stacked together. The two 3 × 3 convolutional layers in series
correspond to a 5 × 5 convolutional layer, which means one pixel will be associated
with the surrounding 5 × 5 pixels. It can be said that the receptive field size is
5 × 5. The effect of three 3 × 3 convolutional layers in series is equivalent to a 7 × 7
35 convolutional layer. In addition, three concatenated 3 × 3 convolution layers have fewer parameter parameters than a 7 × 7 convolution layer. Moreover, the three 3 × 3 convolutional layers have more nonlinear transformations than a 7 × 7 convolutional layer (the former can use three ReLU activation functions and the latter only one time), making the CNN pair characteristic which increase the learning ability.
VGGNet has a little trick in the training process. First, it trains a simple network of level A, and then reuses the weight of network A to initialize the following several complex models, so that the training converges faster. In the prediction, VGG uses the Multi-Scale method to scale the image to a size Q and enter the image into the convolutional network. Then, the sliding window is used for classification prediction in the last convolution layer, and the classification results of different windows are averaged, and then the results of different size Q are averaged to obtain the final result, which can improve the utilization rate of the image data and improve the prediction accuracy. Meanwhile, VGGNet also uses the Multi-Scale method to do data enhancement, the original image is scaled to different sizes S, and then randomly crop 224 × 224 pictures, this can increase a lot of data which can potentially increase the training set’s reliability.
Compared with AlexNet and previous network, VGGNet provides a more solid performance. In general, VGGNet has a lower error rate than AlexNet in the same processing time. However, the training constraint bring a doubt that the LRN(Local
Response Normalization) layer may not be quite useful. The convolutional 3 × 3 layer corresponding to LRN may increase the error rate because of the noise possibly brought from LRN which causes an unstable performance. This is a disadvantage for using VGGNet.
36 3.4.4 GoogleNet
Computational cost is a big problem while CNN is processing thousands of images because of the large number of parameters and operations is a problem when such models should get applied in practice to thousands of images. To reduce the com- putational cost while maintaining the classification quality, GoogleNet[32] and the
Inception module were developed. The Inception module essentially only computes
1 × 1 filters, 3 × 3 filters and 5 × 5 filters in parallel, but applied bottleneck 11 filters before to reduce the number of parameters.
Let’s look at the basic structure of the Inception Module in Figure 3.21, which has four branches: The first branch convolves the input with 1 × 1, which is actually an important structure proposed on GoogleNet. The 1 × 1 convolution is a very good structure. It can organize information across channels, improve the expressiveness of the network, and at the same time can upscale and reduce the output channels.
It can be seen that the 4 branches of the Inception Module use a 1 × 1 convolution to perform cross-channel feature conversion at a low cost (a much smaller amount of computation than 3 × 3). The second branch first uses a 1 × 1 convolution and then a
3×3 convolution, which is equivalent to performing two feature transformations. The third branch is similar, first a 1×1 convolution and then a 5×5 convolution. The last branch is to directly use 1 × 1 convolution after 3 × 3 maximum pooling. We can find that some branches only use 1×1 convolution, and some branches use 1×1 convolution when using other size convolutions, because 1 × 1 convolution is very cost-effective.
A small amount of computation can add a layer of feature transformation and non- linearization. The four branches of the Inception Module are finally merged by an aggregate operation (aggregated in the number of output channels). The Inception
37 Figure 3.21: Inception module of GoogleNet
Module contains three different sizes of convolutions and one maximum pooling, which increases the adaptability of the network to different scales. This section is similar to the Multi-Scale idea. In the early research of computer vision, inspired by the primate neural visual system, the author used different size Gabor filters to process different size pictures. Inception V1 borrowed this idea. The Inception V1 paper pointed out that Inception Module can efficiently expand the depth and width of the network, improve accuracy and avoid overfitting.
Inception V2[39] inspired by VGGNet which is replacing the large convolution of
5 × 5 with two 3 × 3 convolutions (to reduce the amount of parameters and cover the overfitting), and the well-known Batch Normalization (BN) method. BN is a very effective regularization method that can speed up the training of large-scale convolutional networks by many times. At the same time, the classification accuracy after convergence can be greatly improved. When BN is used in a neural network layer, it normalizes the internals of each mini-batch data, normalizes the output to a normal distribution of N(0,1), and reduces Internal Covariate Shift (internal
38 nerves). Meta-distribution changes). BN’s paper points out that during the training of traditional deep neural networks, the distribution of the input at each layer is changing, which makes training difficult. We can only solve this problem with a very small learning rate. After using BN for each layer, we can effectively solve this problem. The learning rate can be increased many times. The number of iterations required to achieve the previous accuracy rate is only 1/14, and the training time is greatly reduced. After achieving the previous accuracy rate, it is possible to continue training and eventually achieve far better performance than the Inception V1 model
- the top-5 error rate is 4.8%, which is better than the human eye level. Because
BN also plays a role of regularization in some sense, Dropout can be reduced or eliminated, simplifying the network structure.
The gain obtained by simply using BN is not obvious, and some corresponding adjustments are needed: increasing the learning rate and accelerating the learning decay speed to apply the BN normalized data; removing Dropout and reducing the
L2 regularity (because BN has played the role of regularization); removal of LRN; shuffle the training sample more thoroughly; reduce the optical distortion of the data during data enhancement (because BN training is faster, each sample is trained less often) After using these measures, Inception V2 is 14 times faster in training to achieve
Inception V1 accuracy, and the model has a higher accuracy rate when it converges.
The Inception V3 network has two major transformations compared with previ- ous design. The first is the introduction of Factorization into small convolutions, which splits a larger two-dimensional convolution into two smaller one-dimensional convolutions, such as 7 × 7 volumes. The product is split into 1 × 7 convolutions and
7 × 1 convolutions or the 3 × 3 convolution is divided into 1 × 3 convolutions and
39 Figure 3.22: Inception V2 module
3 × 1 convolutions, as shown in Figure 3.21. On the one hand, saving a lot of pa-
rameters, speeding up calculations and reducing over-fitting (more than reducing the
7 × 7 convolution into 1 × 7 convolution and 7 × 1 convolution, it is more economical than splitting into 3 3 × 3 convolutions), while adding a layer of non-linear expansion model expression capabilities. The paper points out that the result of this asymmetric convolutional structure splitting is more obvious than splitting into several identical small convolution kernels symmetrically, which can handle more and more abundant spatial features and increase feature diversity.
On the other hand, Inception V3 optimizes the structure of the Inception Module.
Inception Module now has three different configurations of 35 × 35, 17 × 17 and 8 × 8, as shown in Figure 3.22. These Inception Modules only appear in the back of the network, and the front is a normal convolutional layer. And Inception V3 uses branching in the branch (8 × 8 structure) in addition to branching in the Inception
Module, and it can be said Network in Network in Network. (Figure 3.23)
40 Figure 3.23: Inception V3 module
Inception-v4 as described in [33] consists of four main building blocks: The stem,
Inception A, Inception B and Inception C. To quote the authors: Inception-v4 is a deeper, wider and more uniform simplified architecture than Inception-v3. The stem,
Reduction A and Reduction B use max-pooling, whereas Inception A, Inception B and
Inception C use average pooling. The stem, module B and module C use separable convolutions.
3.4.5 ResNet
ResNet (Residual Neural Network) [33] was proposed by Kaiming He of Microsoft
Research, successfully training 152 deep neural networks through the use of Residual
41 Figure 3.24: Residual unit of ResNet
Units (Figure 3.24), and won the championship in ILSVRC 2015, achieving 3.57%
error rate which is top-5 performance in this research. At the same time, the amount of
parameters is lower than VGGNet, and the effect is very prominent. The structure of
ResNet can speed up the training of ultra-deep neural networks very quickly compared
with general AlexNet and VGGNet-19, and the accuracy of the model is also superior.
Assume that the input of a certain neural network is x and the expected output
is f(x), if we pass the input x directly to the output as the initial result, then the goal
we need to learn at this time is t. As shown in the figure, this is a residual unit of
ResNet. ResNet is equivalent to changing the learning target. It is no longer learning a complete output, but only the difference between the output and the input, that is, the residual.
Figure 3.25 shows a comparison of VGGNet-19, a 34-layer deep ordinary con- volutional network, and a 34-layer deep ResNet network. It can be seen that the biggest difference between ordinary directly connected convolutional neural networks and ResNet is that ResNet has many bypass branches connecting the input directly
42 to the following layers, so that the following layers can directly learn the residuals.
This structure is also called Shortcut or skip connections.
When the traditional convolutional layer or fully connected layer transmits infor- mation, there are more or less problems such as information loss. ResNet solves this problem to some extent. By directly passing the input information to the output and protecting the integrity of the information, the entire network only needs to learn the difference between the input and output, simplifying the learning objectives and difficulty.
In ResNet’s paper, in addition to the proposed two-level residual learning unit, there are three levels of residual learning units (Figure 3.26). The two-level residual learning unit contains 3 × 3 convolutions with the same number of output channels
(because the residual is equal to the target output minus the input, i.e. the input and output dimensions must be consistent); and the 3-level residuals The network uses the 1 × 1 convolution product of Network In Network and Inception Net, and uses a 1 × 1 convolution before and after the convolution of the middle 3 × 3, and has the operation of reducing the dimension first and then increasing the dimension.
In addition, if there are different cases of input and output dimensions, we can do a linear mapping transformation on x and connect it to the next layer.
Overall, advantage of ResNet can be concluded as three points: (1) The residual network does not have direct advantages in model characterization. ResNets does not better characterize a certain aspect, but ResNets allows more models to be repre- sented in depth. (2) The residual network makes the feed-forward/back-propagation algorithm very smooth. To a large extent, the residual network makes it easier to optimize deeper models. (3) ”shortcut” joins neither generate extra parameters nor
43 Figure 3.25: Compare VGG and ResNet structure
44 Figure 3.26: Residual unit of two and three layer ResNet
increase computational complexity. Quick Connect simply performs identity mapping and adds their output to the overlay’s output. By backpropagating SGD, the entire network can still be trained in an end-to-end fashion
3.4.6 Siamese Network
Unlike other neural networks, Siamese network[6] is seldom used for classifica- tion. The outcome of Siamese Network is the similarity between two different images.
Siamese network is a similarity measure method. When the number of categories is large, but the number of samples per category is small, it can be used for category identification and classification. The traditional classification method for distinguish- ing is to know exactly which class each sample belongs to, and it needs to have an exact label for each sample. When the number of categories is too large and the number of samples per category is relatively small, these methods are less applicable.
The Siamese network learns a similarity measure from the data and uses this learned metric to compare and match samples of new unknown classes. This method can
45 Figure 3.27: Siamese Architecture [6]
be applied to classification problems where there are many categories or the entire training sample cannot be used for previous method training.
The architecture of Siamese Network can be represented as Figure 3.27. Let X1
and X2 be a pair of images shown to our learning machine. Let Y be a binary label of
the pair, Y = 0 if the images X1 and X2 belong the same object and Y = 1 otherwise.
Let W be the shared parameter vector that is subject to learning and let GW (X1)
and GW (X2) be the two points in the low-dimensional space that are generated by mapping X1 and X2. Then our system can be viewed as a scalar energy function
46 EW (X1,X2) that measures the compatibility between X1 and X2. It is defined as:
EW (X1,X2) = kGW (X1) − GW (X2)k (3.5)
We can define the loss function as:
P X i L(W ) = L W, (Y,X1,X2) (3.6) i=1
i i i L W, (Y,X1,X2) = (1 − Y ) LG EW (X1,X2) + (Y ) LI EW (X1,X2) (3.7)
i (Y,X1,X2) means the ith sample which is grouped by an image and a label. LG is to calculate the loss function for same class, LI is for different class. P represents the number of training samples. Based on this design, the difference between two classes are more obvious. It can be converted to a function written as EW with condition
0 that ∃m > 0, such that EW (X1,X2) + m < EW (X1,X2). Based on this condition, we can get the loss function as: