
International Conferences Computer Graphics, Visualization, Computer Vision and Image Processing 2021; Connected Smart Cities 2021; and Big Data Analytics, Data Mining and Computational Intelligence 2021 ANALYSIS OF CAPSULE NETWORKS FOR IMAGE CLASSIFICATION Evgin Goceri Biomedical Engineering Department, Engineering Faculty Akdeniz University, Dumlupinar Boulevard, 07058, Antalya, Turkey ABSTRACT Recently, the interest in convolutional neural networks have grown exponentially for image classification. Their success is based on the ability to learn hierarchical and meaningful image representation that results in a feature extraction technique which is general, flexible and can encode complex patterns. However, these networks have some drawbacks. For example, they need a large number of labeled data, lose valuable information (about the internal properties, such as shape, location, pose and orientation in an image and their relationships) in the pooling and are not able to encode deformation information. Therefore, capsule based networks have been introduced as an alternative to convolutional neural networks. Capsules are a group of neurons (logistic units) representing the presence of an entity and the vector indicating the relationship between features by encoding instantiation parameters. Unlike convolutional neural networks, maximum pooling layers are not employed in a capsule network, but a dynamic routing mechanism is applied iteratively in order to decide the coupling of capsules between successive layers. In other words, training between capsule layers is provided with a routing-by-agreement method. However, capsule networks and their properties to provide high accuracy for image classification have not been sufficiently investigated. Therefore, this paper aims (i) to point out drawbacks of convolutional networks, (ii) to examine capsule networks, (iii) to present advantages, weaknesses and strengths of capsule networks proposed for image classification. KEYWORDS Image Classification, Capsule Network, Dynamic Routing, Deep Network, Convolutional Neural Networks 1. INTRODUCTION In recent years, the importance of deep neural networks has increased due to their great success in many fields. Particularly, Convolutional Neural Networks (CNNs) have been widely implemented for classification of images because of their learning ability by using convolution filters and non-linearity units (Seeland and Mader, 2021; Guoqing et al., 2021; Abbas et al., 2021). However, their architectures lead to two important issues. One of them is about the robustness of these architectures for affine transformation (e.g., shifting in pose). This problem is usually alleviated by various image augmentations. However, the testing dataset can present some unpredictable shifts. Therefore, CNNs need a huge amount of training dataset. The other problem is about the spatial relationships between features. CNNs have a tendency to memorize data instead of understanding it and they do not have the capability to learn the relationships between required features (Hinton et al., 2018). Several pooling algorithms can provide a little translation invariance. However, pooling methods do not store information about the location of features since they maintain only the presence information and ignore the pose information (Szegedy et al., 2013). To be able to overcome these two drawbacks in CNNs, Capsule Networks (CapsNets) (Sabour et al., 2017), new architectures based on the capsule concept (Hinton et al. 2011; Kosiorek et al.,2019; Wang and Liu, 2018), have been proposed. Unlike CNNs, CapsNets can store information at a vector level not as scalar values. These capsule vectors (groups of neurons) represent richer information in the architecture. Also, to classify objects in an image, CapsNets use routing-by-agreement method. The routing process enables each capsule to maintain information, which is obtained from the parent (previous) capsules, and provides classification agreement by comparing the information. Therefore, CapsNets can store the orientation and location of components in an image. For example, a CapsNet can learn to decide whether a flower does exist in the image, and also whether it is rotated or located to the right/left. This property is called as equivariance. The authors in 53 ISBN: 978-989-8704-32-0 © 2021 (Jimenez-Sanchez et al., 2018) reported that this property reduces the requirement of huge data in the training stage. Therefore, CapsNets are promising for image classification. CapsNets are new architectures and explorations in the literature about capabilities of CapsNets are focused on designing of capsule layers (Hinton et al., 2018; Rajasegaran et al., 2019; Deliege et al, 2018) and containment of extra convolutional layers to feed primary capsules for feature extraction (Nair et al., 2018; Phaye et al., 2018). However, the performance of CapsNets in classification of images have not been sufficiently investigated. Therefore, this paper aims (i) to point out drawbacks of CNNs, (ii) to examine CapsNets, (iii) to present advantages, strengths and weaknesses of the CapsNets proposed for image classification. This paper has been organized as follows: Fundamentals of a CapsNet architecture and comparison with convolutional networks are explained in Section 2. CapsNet based image classification approaches are given in Section 3. Finally, discussion and conclusion are presented in Section 4. 2. FUNDEMANTALS OF A CAPSNET ARCHITECTURE AND ITS DIFFERENTIATIONS FROM A CNN In the original CapsNet architecture (Sabour et al., 2017), capsule vectors have been used in the convolutional network. Also, instead of conventional down-sampling algorithms (e.g., maximum pooling), a dynamic routing mechanism has been implemented to link units within a capsule. This original CapsNet structure has been extended and its various models have been designed to achieve image classifications with high performance (Section 3). Like CNNs, a CapsNet constructs a hierarchical representation of an image by passing it in layers. On the other hand, unlike deeper CNNs including many layers, the CapsNet in the original form includes merely 2 layers called as primary and secondary capsule layer. The primary layer captures low-level features. The secondary layer has the ability to predict the existence of an object and its pose information in the image (Figure 1). The main properties of a CapsNet and differentiations from a CNN architecture can be summarized as follows: (i) Convolution is solely applied in the primary capsule layer as the first operation. (ii) A series of feature channels are grouped to construct tensors in a CapsNet rather than performing a non-linear operation with scalar values obtained from convolutional filters. A squashing process is applied to increase non-linearity of the capsules. By the squashing process, a longer capsule is shrunk to a length less than 1 and a short capsule is shrunk to a length close to zero. In other words, squashing algorithm uses j th vector sj and bounds its limit to [0.1] interval for probability modelling and preservation of vectoral-orientations. This process produces a vector vj indicating the probability value about the presence of an object. The direction of this vector (that is also the output value of the capsule j) presents information about the pose of the object. (iii) Routing by agreement is applied to optimize the weight values (wij) in a CapsNet rather than a back-propagation as it is usual in a CNN model. In the routing process, a capsule in a low-level transmits its input value into a capsule in the upper level. Therefore, these weights provide connections between the ith primary capsule (low level information) and jth secondary capsule (high level information). In other words, the weights provide affine transformation to learn part-whole relations. (iv) Similar to a CNN model used for classification, the last layer of a CapsNet is a softmax layer, which is usually implemented by cross-entropy loss function, which can be computed for kth class is as (Sabour et al., 2017); (1) In (2), the term (0.5) is a weight parameter which is used to guarante the final convergence. The term Tk refers to labels and gets 1 when is 0.1, is 0.9 and an object, whose expected probability is greather than 0.9 (i.e., ), does exist in kth class. The distance between positive samples are forced to be short by the marginal loss because 0.5 is not assigned to the threshold. Therefore a robust classifier is generated. 54 International Conferences Computer Graphics, Visualization, Computer Vision and Image Processing 2021; Connected Smart Cities 2021; and Big Data Analytics, Data Mining and Computational Intelligence 2021 After the original CapsNet, discussed matrix capsules have been discussed and expectation maximization based routing have been applied (Hinton et al., 2018). Figure 1 shows connections in a standard convolutional network (Figure 1.a) and a CapsNet (Figure 1.b). (b) (a) Figure 1. A CapsNet structure (a) and a covolutional network (b) 3. CAPSNET BASED IMAGE CLASSIFICATION TECHNIQUES Various modifications of the original CapsNet architecture have been proposed to improve accuracy and to enhance the computational efficiency and representation capacity for image classification. For instance, in (Xi et al., 2017), the capacity of the network has been increased (by increasing both capsules’ size and numbers of layers) according to the changes in the activation function. To generate semantic and structural information, a multi-scale
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-