Recent Advances in the Deep CNN Neocognitron
Total Page:16
File Type:pdf, Size:1020Kb
NOLTA, IEICE Invited Paper Recent advances in the deep CNN neocognitron Kunihiko Fukushima 1 a) 1 Fuzzy Logic Systems Institute 680-41 Kawazu, Iizuka, Fukuoka 820-0067, Japan a) [email protected] Received October 19, 2018; Published October 1, 2019 Abstract: Deep convolutional neural networks (deep CNN) show a large power for robust recognition of visual patterns. The neocognitron, which was first proposed by Fukushima (1979), is a network classified to this category. Its architecture was suggested by neurophysi- ological findings on the visual systems of mammals. It acquires the ability to recognize visual patterns robustly through learning. Although the neocognitron has a long history, improve- ments of the network are still continuing. This paper discusses the recent neocognitron, focusing on differences from the conventional deep CNN. Some other functions of the visual system can also be realized by networks extended from the neocognitron, for example, recognition of partly occluded patterns, the mechanism of selective attention, and so on. Key Words: deep CNN, pattern recognition, neocognitron, add-if-silent rule, interpolating- vector 1. Introduction Recently, deep convolutional neural networks (deep CNN) have become very popular in the field of visual pattern recognition (e.g. [1]). The neocognitron, which was first proposed by Fukushima [2–4], is a network classified to this category. It is a hierarchical multi-layered network. Its architecture was suggested by neurophysiological findings on the visual systems of mammals. It acquires the ability to recognize visual patterns robustly through learning. Although the neocognitron has a long history, improvements of the network are still continuing. For training intermediate layers of the neocognitron, the learning rule called AiS (Add-if-Silent) is used. Under the AiS rule, a new cell is generated and added to the network if all postsynaptic cells are silent in spite of non-silent presynaptic cells. The generated cell learns the activity of the presynaptic cells in one-shot. Once a cell is generated, its input connections do not change any more. Thus the training process is very simple and does not require time-consuming repetitive calculation. In the deepest layer, a method called IntVec (Interpolating-Vector) is used for classifying input patterns based on the features extracted by intermediate layers. For the recognition by the IntVec, we search, in the multi-dimensional feature space, the nearest plane (or line) that is made of a trio (or pair) of reference vectors. Computer simulation shows that recognition error can be made much smaller by the IntVec than by the WTA (Winner-Take-All) or even by the SVM (support vector machine). 304 Nonlinear Theory and Its Applications, IEICE, vol. 10, no. 4, pp. 304–321 c IEICE 2019 DOI: 10.1587/nolta.10.304 Some other functions of the visual system can also be realized by networks extended from the neocognitron, for example, recognition of partly occluded patterns, the mechanism of selective atten- tion, and so on. Sections from 2 to 5 discuss the recent neocognitron, focusing on differences from the conventional deep CNN. Sections 6 and 7 show some networks extended from the neocognitron. 2. Network architecture of the neocognitron As shown in Fig. 1, the neocognitron consists of a cascaded connection of a number of stages preceded by an input layer U0. Each stage is composed of a layer of S-cells followed by a layer of C-cells. After learning, S-cells come to work as feature-extracting cells. C-cells pool the response of S-cells in retinotopic neighborhood. Incidentally, S-cells behave like simple cells in the visual cortex, and C-cells behave like complex cells. In the figure, USl and UCl indicate the layer of S-cells and the layer of C-cells of the lth stage, respectively. Fig. 1. The architecture of the neocognitron. Each layer of the network is divided into a number of sub-layers, called cell-planes, depending on the feature to which cells respond preferentially. A cell-plane is a group of cells that are arranged retinotopically and share the same set of input connections. All cells in a cell-plane have receptive fields of an identical characteristic, but the locations of the receptive fields differ from cell to cell. The input connections to S-cells are variable and are modified through learning. After learning, each S-cell comes to respond selectively to a particular visual feature. The feature extracted by an S-cell is determined during learning. Generally speaking, local features of the input pattern are extracted in shallow layers, and more global features are extracted in deeper layers. The output of layer USl is fed to layer UCl. C-cells pool the response of S-cells. Each C-cell has fixed excitatory connections from a group of S-cells whose receptive field locations are slightly deviated. C-cells thus average the responses of S-cells spatially. We can also interpret that S-cells’ response is spatially blurred by C-cells. Although deformation of an input pattern causes shift of local features, C-cells can absorb the effect of shift by spatial averaging. In the neocognitron, the pooling by C-cells is done by spatial averaging (L2 pooling), not by MAX- operation. The spatial averaging is useful, not only for robust recognition of deformed patterns, but also for smoothing additive random noise contained in the responses of S-cells. To reduce the computational cost by decreasing the density of cells, down-sampling of cells is done in layers of C- cells. Spatial blur (i.e., low-pass filter for spatial frequency) before down-sampling is also important for suppressing aliasing noise caused by coarse down-sampling. The output of layer UCl is fed to layer USl+1 of the next stage. Based on the features extracted in the intermediate stages, the final classification of the input pattern is made in the deepest stage. 305 3. Feature extraction by S-cells 3.1 Response of an S-cell S-cells work as feature extractors. Let vector a be the strength of excitatory connections to an S-cell from presynaptic C-cells, whose outputs are x (Fig. 2). The V-cell receives signals from the same group of C-cells as does the S-cell and calculates the norm of x. Namely, v = x. Fig. 2. Connections converging to an S-cell. The output u of the S-cell is given by 1 u = · ϕ [(a, x) − θ x ](1) 1 − θ where ϕ[ ] is a rectified linear function, which is defined by ϕ[x]=max(x, 0). Connection a is given by a = X/X, where X is the training vector (or a linear combination of training vectors) that the S-cell has learned. Then, (1) reduces to ϕ[s − θ] (X, x) u = x· where s = (2) 1 − θ X·x In the multi-dimensional feature space, s shows a kind of similarity between X and x, which is defined by the inner product of X and x. If similarity s is larger than θ, the S-cell yields a non-zero response [4]. Thus θ determines the threshold of the S-cell. The area that satisfies s>θin the multi-dimensional feature space is named the tolerance area of the S-cell (Fig. 3). We call X the reference vector of the S-cell. It represents the preferred feature of the S-cell. Fig. 3. Tolerance area of an S-cell in the feature space [4]. Now, the distance between two vectors can be defined by similarity s. In this paper, the nearest vector means the vector that has the largest similarity. 3.2 Robustness to background noise Different from the neocognitron of old versions, the inhibition from V-cells works, not in a divisional manner, but in a subtractive manner. This is effective for increasing robustness to background noise [5]. 306 Figure 4 illustrates a situation when the background noise is a faint image of other character. As shown in the lower left of the figure, each S-cell of an intermediate layer has a small receptive field and observes only a part of the input pattern. Fig. 4. The effect of background noise on the response of S-cells of an in- termediate layer. Comparison between subtractive and divisional inhibition. Recent neocognitrons use subtractive inhibition, while old neocognitrons used the divisional inhibition. (modified from [5]) If inhibition works in a divisional manner, an S-cell, whose receptive field (left circle on U0)covers only a part of the faint background character, responds strongly despite of the weak intensity of the stimulus. As a result, S-cells in the intermediate layer respond as though an input pattern shown in the lower right in Fig. 4 is presented. Correct recognition of the foreground character is largely interfered by the strong responses from S-cells elicited by these irrelevant features. On the contrary, if the inhibition works in a subtractive manner, the faint background noise elicits only negligibly small responses from S-cells. Thus, the responses of S-cells are little affected by the background noise. Incidentally, when the receptive field of an S-cell covers parts of both background and foreground characters together as shown by the upper circle in U0, the S-cell’s response is almost the same, either with divisional or subtractive inhibition. Figure 5 shows how the intensity of the background nose affects the recognition error by the neocog- Fig. 5. The recognition error under background noise, which is a faint image of a different digit. The recognition error remains small if the inhibition to S-cells works in a subtractive manner, but becomes very large if it works in a divisional manner [5].