2018 International Symposium on Nonlinear Theory and Its Applications, NOLTA2018, Tarragona, Spain, September 2-6, 2018

Recent Advances in the Deep CNN

Kunihiko Fukushima

Fuzzy Logic Systems Institute Iizuka, Fukuoka, Japan Email: [email protected]

U U U U U U U U Abstract—Deep convolutional neural networks (deep S1 C1 S2 C2 S3 C3 S4 C4 CNN) show a large power for robust recognition of visual U patterns. The neocognitron is a network classified to this 0 category. It acquires the ability to recognize visual pat- terns robustly through learning. This paper discusses the recent neocognitron, focusing on differences from the con- ventional deep CNN.

1. Introduction Recently, deep convolutional neural networks (deep input feature visual classification CNN) have become very popular in the field of visual pat- pattern extraction features tern recognition (e.g. [1]). The neocognitron, which was first proposed by Fukushima[2][3], is a network classified Figure 1: The architecture of the neocognitron. to this category. It is a hierarchical multi-layered network. Its architecture was suggested by neurophysiological find- ings on the visual systems of mammals. It acquires the pattern are extracted in shallow layers, and more global fea- ability to recognize visual patterns robustly through learn- tures are extracted in deeper layers. ing. Although the neocognitron has a long history, im- The output of layer U is fed to layer U . C-cells pool provements of the network are still continuing. This paper Sl Cl the response of S-cells. Each C-cell has fixed excitatory discusses the recent neocognitron, focusing on differences connections from a group of S-cells whose receptive field from the conventional deep CNN. locations are slightly deviated. C-cells thus average the re- sponses of S-cells spatially. We can also interpret that S- 2. Network Architecture cells’ response is spatially blurred by C-cells. Although de- As shown in Fig. 1, the neocognitron consists of a cas- formation of an input pattern causes shift of local features, caded connection of a number of stages preceded by an C-cells can absorb the effect of shift by spatial averaging. input layer U0. Each stage is composed of a layer of S- In the neocognitron, the pooling by C-cells is done by cells followed by a layer of C-cells. After learning, S-cells spatial averaging (L2 pooling), not by MAX-operation. come to work as feature-extracting cells. C-cells pool the The spatial averaging is useful, not only for robust recogni- response of S-cells in retinotopic neighborhood. Inciden- tion of deformed patterns, but also for smoothing additive tally, S-cells behave like simple cells in the visual cortex, random noise contained in the responses of S-cells. To re- and C-cells behave like complex cells. In the figure, USl duce the computational cost by decreasing the density of and UCl indicate the layer of S-cells and the layer of C-cells cells, down-sampling of cells is done in layers of C-cells. of the lth stage, respectively. Spatial blur (i.e., low-pass filter for spatial frequency) be- Each layer of the network is divided into a number of fore down-sampling is also important for suppressing alias- sub-layers, called cell-planes, depending on the feature to ing noise caused by coarse down-sampling. The output of which cells respond preferentially. A cell-plane is a group layer UCl is fed to layer USl+1 of the next stage. of cells that are arranged retinotopically and share the same Based on the features extracted in the intermediate set of input connections. All cells in a cell-plane have re- stages, the final classification of the input pattern is made ceptive fields of an identical characteristic, but the locations in the deepest stage. of the receptive fields differ from cell to cell. The input connections to S-cells are variable and are 3. Feature Extraction by S-cells modified through learning. After learning, each S-cell comes to respond selectively to a particular visual feature. S-cells work as feature extractors. Let vector a be the The feature extracted by an S-cell is determined during strength of excitatory connections to an S-cell from presy- learning. Generally speaking, local features of the input naptic C-cells, whose outputs are x (Fig. 2). The S-cell also

- 1 - a C-cells S-cell added to the network if all postsynaptic cells are silent in spite of non-silent presynaptic cells. The strength of the an u xn input connections (namely, reference vector) of the gener- −θ ated S-cell is determined to be proportional to the response r x v = Σ x 2 = ||x|| of the presynaptic C-cells at this moment. Once a cell is n n generated, its input connections do not change any more. V-cell Thus the training process is very simple and does not re- Figure 2: Connections converging to an S-cell. quire time-consuming repetitive calculation. This means that the learning by the AiS is a process of choosing a small set of reference vectors from the large set of training vec- receives an inhibitory connection of strength −θ (0 ≤ θ< tors. 1) through a V-cell. The V-cell receives signals from the Under the AiS rule, no cell can be generated any more same group of C-cells as does the S-cell and calculates the within the tolerance areas of existing S-cells, whose size norm of x. Namely, v = x. is determined by the threshold θ of S-cells. As a result, The output u of the S-cell is given by reference vectors of generated S-cells come to distribute 1 uniformly in the multi-dimensional feature space after pre- u = · ϕ [(a, x) − θ x ](1) 1 − θ sentation of a large enough number of training vectors. where ϕ[ ] is a rectified linear function. Connection a is We can express this situation as follows. During the given by a = X/X, where X is the training vector (or a learning with the AiS, S-cells behave like grandmother linear combination of training vectors) that the S-cell has cells. Namely, each training vector elicits a response from learned. Then, (1) reduces to only one (or only a small number of) S-cell. This is useful ϕ[s − θ] (X, x) for producing a uniform distribution of reference vectors of u = x· where s = (2) the generated S-cells in the feature space. 1 − θ X·x During the recognition phase, however, behavior like In the multi-dimensional feature space, s shows a kind of grandmother cells is not desirable for robust recognition similarity between X and x, which is defined by the inner of deformed patterns. If an S-cell, which has been the only θ product of X and x. If similarity s is larger than ,the active S-cell in a layer, stops responding by a slight defor- θ S-cell yields a non-zero response [3]. Thus determines mation of the test vector, and if another S-cell comes to >θ the threshold of the S-cell. The area that satisfies s in respond instead, the layer comes to exhibit a completely the multi-dimensional feature space is named the tolerance different response. area of the S-cell. We call X the reference vector of the For robust recognition of deformed patterns, it is desir- S-cell. It represents the preferred feature of the S-cell. able that some number of cells respond together to an input Now, the distance between two vectors can be defined pattern. In other words, situation like a sparse population by similarity s. In this paper, the nearest vector means the coding is required. We then use dual threshold for S-cells: vector that has the largest similarity. ff After finishing the learning, the threshold of S-cells is set Di erent from the neocognitron of old versions, the in- to a lower value than the threshold for the learning. We can hibition from V-cells works, not in a divisional manner, but ff thus produce a situation like a population coding during the in a subtractive manner. This is e ective for increasing ro- recognition phase. bustness to background noise [4]. We can understand that S-cells of a low threshold pro- If we use divisional inhibition, the response of an S-cell duces a blur (or pooling) in the feature space, while C-cells is given by u = ϕ[s − θ]/(1− θ) instead of (2). Hence, fea- produces a blur in the retinotopic space. tures are extracted independently of their intensity. This The reason why a good recognition rate can be obtained might be felt desirable, but actually causes a great trouble with simple algorithm of the AiS can be explained as fol- when an input pattern presented to U is contaminated by 0 lows. The final classification of input patterns is made, not week background noise. S-cells in an intermediate layer by an intermediate layer, but by the deepest layer of the have small receptive fiels. The response of an S-cell whose network. The role of intermediate layers is to represent an receptive field covers only a part of the week background input pattern accurately, not by the response of a single cell, noise becomes as large as the response of an S-cell whose but by the population coding. In the case of population cod- receptive field covers a part of the strong target pattern. ing, best-fitting of individual cells to training stimuli is not This makes the network very vulnerable to noise. necessarily important. It is enough if the input pattern is accurately represented by the response of the whole cells. 4. Intermediate Layers 4.1. Add-if-Silent 4.2. AiS with Feedback For training intermediate layers of the neocognitron, To apply the AiS (Add-if-Silent) rule to a neocognitron, an unsupervised called AiS (Add-if-Silent)is a slight modification is required because the neocognitron used [3]. Under the AiS rule, a new cell is generated and is a CNN (convolutional neural network). Each layer of

- 2 - the nearest the neocognitron consists of a number of cell-planes. In a reference vector cell-plane, all cells are arranged retinotopically, and share test vector the same set of input connections. This condition of shared the nearest connections has to be kept even during the learning. interpolating In the neocognitron, generation of a new S-cell means vector the generation of a new cell-plane. All cells in the cell- plane come to have the same input connections as the gen- erated S-cell. the nearest Suppose a training pattern is presented to input layer reference vector line reference vector of class A of class B U0. Here, the response of the C-cells of UCl−1 works as the training stimulus for USl. The AiS rule is applied at Figure 3: Recognition by the IntVec (Int-2). The test vector the retinotopic location where all post-synaptic S-cells are is classified, not to class B, but to class A, because the near- silent in spite of non-silent presynaptic C-cells. If there est line is chosen instead of the nearest reference vector. are a number of such locations, we have to choose one of them. We use negative feedback signals from non-silent 1 S-cells for this purpose [5]. The strength of negative feed- vector X of the S-cell. back connections from an S-cell is the same as that of the In the multidimensional feature space, we assume lines feed-forward connections converging to the S-cell. Nega- connecting every pair of reference vectors of the same la- tive feedback signals from active postsynaptic S-cells thus bel (Fig. 3). Every line is assigned the same label as the inhibit the presynaptic C-cells that have already contributed reference vectors that span the line. We then measure dis- for eliciting responses from S-cells. We then apply the AiS tances (based on similarity) to these lines from test vector rule at the location where the response of C-cells, which x. The label of the nearest line (namely, the line that has have received negative feedback, is the largest. the largest similarity to x), instead of the nearest reference After generation of a new cell-plane with the AiS rule, vector, shows the result of . if there still remains any area in which presynaptic C-cells The process of searching the nearest line can be ex- have not been inhibited yet, the same process of generating pressed mathematically as follows. Let Xi and X j be two ξ a cell-plane is repeated. We can thus choose all retinotopic reference vectors of the same label. Let be a vector that locations where important features exist. is given by a linear combination of this pair of vectors. It It should be noted here, however, that the feedback is is named an interpolating vector, and represents a point on used only for determining the retinotopic location where the line connecting the two vectors: the AiS rule is to be applied, and that the reference vector ξ = pi Xi/Xi + p j X j/X j, (pi + p j = 1) (3) of the generated S-cell (seed-cell) is set to be proportional Under possible combinations of p and p , the similarity to the response of the presynaptic C-cells before being sup- i j between ξ and test vector x takes a maximum value pressed by the negative feedback.  = 2 − + 2 / − 2 sline (si 2si s j sij s j ) (1 sij)(4) 5. Deepest Layer where si = (Xi, x) / {Xi·x} = , / { · } Training S-cells of USL is done by a supervised learning. s j (X j x) X j x The reference vectors of S-cells are created in such a way sij = (Xi, X j) / {Xi·X j} that a large number of training vectors of each class can be We can interpret that sline represents similarity (or distance) represented by a small number of reference vectors. Each between test vector x and line Xi X j. reference vector is made of a weighted sum of training vec- This method of IntVec, which is named Int-2, can be ex- tors of the same class and has a label of the class name. tended to Int-3. In the Int-3, we assume planes spanned by Here we first discuss the method of recognition, before every trio of reference vectors of the same label. Although discussing the detailed method of learning. we use planes instead of lines, the rest of the process is the same as that for the Int-2. The computational cost in- 5.1. Interpolating-Vector creases a little, but the Int-3 yields a better recognition rate Test patterns presented to input layer U0, are classified in than Int-2. Computer simulation has shown that recogni- the deepest layer, USL, based on the features extracted by tion error can be made much smaller by the IntVec than by the intermediate layers (Fig. 1). For this purpose, a method the WTA or even by the SVM. named Interpolating-Vector (IntVec) is used[3]. Not only in the neocognitron but also in most neural Let x be the input signals to an S-cell of USL, which networks, the recognition error can usually be reduced by are the response of C-cells of UCL−1. In the deepest layer, increasing the number of training patterns. It is reported θ θ = threshold of S-cells is set to 0. Hence, from (2), the 1 = ·  = For the economy of computational cost, analysis of the response of S- response of an S-cell is given by u s x , where s cells of USL is actually performed, not at all retinotopic locations, but only (X, x)/{X·x} is the similarity between x and reference at the location where v = x (the response of the V-cell) is the largest.

- 3 - that, if a sufficiently large number of training patterns are not available, even the use of artificially generated training nearest vector patterns to cover the shortage is effective. We can under- of class B (= non-A) reference vector stand that the IntVec performs this process, not during the of class B training phase, but during the recognition phase. Namely, the number of extracted features is virtually increased dur- training vector (class A) ing the recognition phase, without increasing the number x of training patterns themselves. margin 5.2. Margined WTA reference vector nearest vector of class A One of the important roles of learning is to produce a of class A compact set of reference vectors (namely, S-cells) that can correct by WTA error by mWTA accurately represent the large set of training vectors. Sim- new reference vector (x) ilarly to intermediate layers, S-cells are generated during learning. In the deepest layer USL, however, each S-cell Figure 4: During training by the mWTA, a handicap is (namely, its reference vector) has a label indicating the given to vectors of classes other than A. In this figure, class name, to which it is to be classified. Hence the train- training vector x is judged to be classified, correctly by the ing of the deepest layer is done by a supervised learning. WTA, but erroneously by the mWTA. The result of classification by the IntVec (as well as by other methods, say, WTA or SVM), is mainly affected by reference vectors located near borders between different relevant enough to the training vector. classes. Hence it is desired that the reference vectors of We then separate the learning process of the deepest each class are generated so as to distribute more densely layer into two steps. In the first step, reference vectors are near class borders than near the center of its cluster. just generated, and their tuning starts from the second step. A simple method for producing this situation is the use of the WTA (Winner-Take-All). Every time when a training 6. Networks Extended from the Neocognitron vector is presented, we search the nearest reference vector. Various extensions and modifications of the neocogni- If the nearest reference vector has a label different from tron have been proposed [3]. For example, by adding top- that of the training vector, we judge that the classification is down connections to the neocognitron, function of selective wrong and generate a new reference vector (namely, a new attention can be introduced. The ability of recognizing and S-cell) at the location of the training vector. In other words, completing partly occluded patterns can also be realized. the training vector itself is adopted as the new reference vector. If the classification is correct, however, the training Acknowledgments: This work was partially supported vector is just ignored, and no reference vector is generated. by JSPS KAKENHI Grant Number 25330300. Here we propose, not to use the WTA directly, but to in- troduce a certain amount of margin for the WTA. We call References this margined Winner-Take-All (mWTA) [5][6]. In other words, in the search for the nearest reference vector during [1] J. Schmidhuber, “ in neural networks: an the learning, we impose a handicap on reference vectors of overview”. Neural Networks, vol.61, pp.85–117, 2015. classes other than the training vector (Fig. 4). Thus, refer- [2] K. Fukushima, “Neocognitron: a self-organizing neu- ence vectors are just chosen (or sampled) from vectors in ral network model for a mechanism of pattern recogni- the training set. tion unaffected by shift in position”, Biological Cyber- Different from intermediate layers, the tuning of chosen netics, vol.36, no.4, pp.193–202, 1980. reference vectors is effective for reducing the recognition [3] K. Fukushima, “Artificial vision by multi-layered neu- error, because, in the deepest layer, the test vector is clas- ral networks: neocognitron and its advances”, Neural sified based on similarities to individual labeled reference Networks, vol.37, pp.103–119, 2013. vectors. The tuning is done by adding training vectors to [4] K. Fukushima, “Increasing robustness against back- reference vectors selected by the IntVec. ground noise: visual pattern recognition by a neocog- It is important, however, that the tuning should not start nitron”, Neural Networks, vol.24, pp.767–778, 2011. from the beginning of the learning. If we start tuning from the beginning of the learning, when a sufficiently large [5] K. Fukushima, H. Shouno (2015). “Deep convolu- number of reference vectors have not been generated yet, tional network neocognitron: improved interpolating- there is a large risk of distorting reference vectors that have vector”, IJCNN 2015, pp.1603–1610, 2015. already been generated. Even if a trio (in the case of Int- [6] K. Fukushima, “Margined winner-take-all: new learn- 3) of reference vectors happened to span the nearest plane, ing rule for pattern recognition”, Neural Networks, there is a possibility that the three reference vectors are not vol.97, pp.152–161, 2018.

- 4 -