arXiv:1909.13480v1 [cs.NE] 30 Sep 2019 rlss rrueo n oyihe opnn fti oki work this of component copyrighted fut any or of redi current reuse or or any resale lists, for in works, or uses, collective adve new other creating for all purposes, material this for reprinting/republishing obtained including be must IEEE skn ffso akwihnesbt mg recognition recur image is, both This needs simultaneously. which recognitio prediction task video time-series fusion The and of [9]. kind recognition video is to applied also detection or [8]. classification recognition [6], improved image VGG16 in highly accuracy [5], [7], GoogLeNet ResNet [4], Neural and AlexNet (Convolutional as CNNs such example, Network) For [3]. traditi [2], to compared enab methods data methods big several learning for deep performance in higher [1]. life. advances our recent in the technique pecially, essential an become has nologies 0 rdcto cuayfrts data). than test (more for performance r model, accuracy prediction LSTM predication includes in higher the 90% method video showed with proposed method Compared since information. our our field, recognition. visual of research of video power source recognition for the set video reveal data the Moving to benchmark to this challenge a applied In We was is Memory). Term method which Short data prediction MNIST, (Long proposed LSTM deep DBN our of analysi paper, in a idea time-series Adapt the the layer to of actualize using extended new by was algorithm to DBN a Adaptive learning algorithm and make RBM the can generation Moreover, it layer representation. then the and by Machine data, trai Boltzmann to input Restricted algorithm optima given generation-annihilation a the neuron of find by neurons can a (RBM) hidden method realize of The can training. structur number the (DBN) adaptive optimal during Network the structure searching The Belief while capability Deep patterns. classification high of input re method of effectively learning to features networks multiple neural artificial multi-layered tutrllann ehd ie recognition Video method, learning structural

sipoeeto mg eonto,de erigis learning deep recognition, image of improvement As eety rica nelgne(I ihsophisticated with (AI) Intelligence Artificial Recently, Terms Index Abstract c tutrlLann fLn hr emMemory Term Short Long of Learning Structural dacdAtfiilItliec rjc eerhCenter, Research Project Intelligence Artificial Advanced 09IE.Proa s fti aeili emte.Permi permitted. is material this of use Personal IEEE. 2019 ie eonto ehdb sn Adaptive using by Method Recognition Video A eerhOgnzto fRgoa retdStudies, Oriented Regional of Organization Research De erigbid epacietrssc as such architectures deep builds learning —Deep -al [email protected] E-mail: De erig epBle ewr,Adaptive Network, Belief , —Deep rfcua nvriyo Hiroshima of University Prefectural --1 jn-iah,Minami-ku, Ujina-Higashi, 1-1-71, iohm 3-58 Japan 734-8558, Hiroshima .I I. hnKamada Shin NTRODUCTION ae epBle Network Belief Deep based tsn rpromotional or rtising tiuint servers to stribution te works. other n r media, ure so from ssion network present the n tech- onal rent Es- ich ive le al n s l fsqeta ie a ese soeiae[13]. image one as seen be can fram video one since sequential For used of be filt [12]. can convolutional neuron data using one-dimensional of idea sequential instead the but LSTM, given of memory recognition for video short-term learning memory recurrent only long-term deep traditional not also to the recognizes network applied enabled neural method is The and methods[11]. prediction time-series [10]. for on fields, so vide industrial and from system, kinds estimation driving facial various autonomous or pose camera, in detection, human object expected as an such is detects of video Understanding or required. series is image future, given the predicting an while classifies that function hleg orva h oe forpooe ehdin method r proposed We includes video our recognition. since of video field, research power for paper, recognition the set video this the reveal data in to [23] benchmark challenge MNIST a Moving is to which applied was method and (MIDI) Nottingham higher as Capture). time showed such several (Motion sets, CMU method for data methods RBM proposed benchmark other Adaptive series the the CNN than our a then accuracy on on prediction and implemented LSTM DBN, often implemented and idea was the we LSTM using structure, by [22]. prediction LSTM DBN time-series Adaptive of the and to RBM extended Adaptive Moreover, was of [21]. algorithm CIFAR-100 learning and such the CIFAR-10, sets data [20], benchmark MNIST some DBN) as using by (Adaptive research recognition the DBN image in of capability of classification [19]. highest learning the (DBN) shows structural Network algorithm Belief adaptive Deep generation The on layer implemented and was Boltz [17], [18] annihilatio Restricted (RBM) on and Machine implemented generation mann were spac neuron [16] input [15], can The given algorithms for learning training. structure structural its network of adaptive during size The suitable a [14]. find DBN of method dacdAtfiilItliec rjc eerhCenter, Research Project Intelligence Artificial Advanced n aut fMngmn n nomto System, Information and Management of Faculty and norrsac,w rpsdteaatv tutrllearni structural adaptive the proposed we research, our In method well-known a is Memory) Term Short (Long LSTM o ute mrvmn ftemto,orproposed our method, the of improvement further For eerhOgnzto fRgoa retdStudies, Oriented Regional of Organization Research -al [email protected] E-mail: rfcua nvriyo Hiroshima of University Prefectural --1 jn-iah,Minami-ku, Ujina-Higashi, 1-1-71, iohm 3-58 Japan 734-8558, Hiroshima auiIchimura Takumi time field ich ng er o n e e - - source of visual information. Compared with the LSTM model hidden neurons [24], our method make a higher performance of prediction. ... The remainder of this paper is organized as follows. In h0 h1 hJ section II, basic idea of the adaptive structural learning of DBN is briefly explained. Section III gives the description W of the extension algorithm of Adaptive DBN for time-series ij prediction. In section IV, the effectiveness of our proposed method is verified on moving MNIST. In section V, we give v ... v some discussions to conclude this paper. 0 v1 v2 I

II. ADAPTIVE LEARNING METHOD OF DEEP BELIEF visible neurons NETWORK This section explains the traditional RBM [17] and DBN Fig. 1. A network structure of RBM [19] to describe the basic behavior of our proposed adaptive learning method of DBN. hidden neurons hidden neurons

h0 h1 h0 hnew h1 A. Neuron Generation and Annihilation Algorithm of RBM generation While recent deep learning model has higher classification capability, some problems related to the network structure or v v v v the number of some parameters still remains to become a 0 v1 v2 3 0 v1 v2 3 difficult task as the AI research. For the problem, we have visible neurons visible neurons developed the adaptive structural learning method in RBM (a) Neuron generation model (Adaptive RBM) [14]. RBM as shown in Fig. 1 is an hidden neurons hidden neurons unsupervised graphical and energy based model on two kinds h h h h h h of layers; visible layer for input and hidden layer for feature 0 1 2 0 1 2 vector, respectively. The neuron generation algorithm of the annihilation Adaptive RBM can generate an optimal number of hidden neurons and the trained RBM is suitable structure for given v0 v v v3 v0 v v v3 input space. 1 2 1 2 The neuron generation is based on the idea of Walking Dis- visible neurons visible neurons tance (WD), which is inspired from the multi-layered neural (b) Neuron annihilation network in the paper [25]. WD is the difference between the Fig. 2. Adaptive RBM prior variance and the current one of learning parameters.RBM has 3 kinds of parameters according to visible neurons, hidden neurons, and the weights among their connections. The Adap- tive RBM can monitor their parameters excluding the visible can represent the specified features from an abstract concept to one ( The paper [14] describes the reason of the disregard).The an concrete object in the direction from input layer to output situation means that only the existing hidden neurons cannot layer. However, the optimal number of RBMs depends on the represent an ambiguous pattern, because there is the lack of the target data space. number of hidden neurons. In order to express the ambiguous We developed Adaptive DBN which can automatically patterns, a new neuron is inserted to inherit the attributes of adjust an optimal network structure by the self-organization in the parent hidden neuron as shown in Fig. 2(a). the similar way of WD monitoring. If both WD and the energy In addition to the neuron generation, the neuron annihilation function do not become small values, then a new RBM will algorithm was applied to the Adaptive RBM after neuron be generated to keep the suitable network classification power generation process as shown in Fig. 2(b). We may meet that for the data set, since the RBM has lacked the power of data some unnecessary or redundant neurons were generated due representation to draw an image of input patterns. Therefore, to the neuron generation process. Therefore, such neurons will the condition for layer generation is defined by using the total be removed the corresponding hidden neuron according to the WD and the energy function. Fig. 3 shows the overview of output activities. layer generation in Adaptive DBN.

B. Layer Generation Algorithm of DBN III. ADAPTIVE RNN-DBN FORTIME-SERIES PREDICTION A DBN is a hierarchical model of stacking the several pre- trained RBMs. For building process, output (hidden neurons In time-series prediction, some LSTM methods improve the activation) of l-th RBM can be seen as the next input of (l+1)- prediction performance of the traditional recurrent neural net- th RBM. Generally, DBN with multiple RBMs has higher data work by using the several gates such as forget-gate, peephole representation power than one RBM. Such hierarchical model connection gate, full and gradient gate [11]. These gates can pre-training between Suitable number of hidden neurons and layers 4th and 5th layers, is automatically generated. and fine-tuning for pre-training between 3rd and 4th layers

Annihilation pre-training between 2nd and 3rd layers

Generation pre-training between 1st and 2nd layers

Generation

Input

Fig. 3. Overview of Adaptive DBN

represent multiple patterns of time-series sequence, that is not c(1) h(1) c(2) h(2) . . . c(t) h(t) . . . c(T) h(T) only short-term memory but also long-term memory. Restricted W W (RNN-RBM) [26] is a RBM based recurrent model for time- uh b(1) v(1) b(2) v(2) . . . b(t) v(t) . . . b(T) v(T) series prediction. The method is a extension model of the

Wuv traditional Temporal RBM (TRBM) and Recurrent TRBM Wvu

(RTRBM) [27] and it also used a similar idea of LSTM for W (0) uu (1) (2) . . . (t) . . . (T) better performance. u u u u u Fig. 4 shows the network structure of RNN-RBM. RNN- RBM had a recurrent structure on Markov process for given Fig. 4. Structure of RNN-RBM time-series sequence, as well as RNN. Let an input sequence with the length T be V = {v(1), ··· , v(t), ··· , v(T )}. v(t) t and h( ) are the input and hidden neurons at t-th RBM, t respectively. The time dependency parameters b( ), c(t), and u(t) are calculated by the past parameters at t−1 as following Through Time method which is often used in the traditional equations. recurrent neural network. (t) (t−1) b = b + W uvu , (1) We developed Adaptive RNN-RBM by applying the neu- (t) (t−1) c = c + W uhu , (2) ron generation and annihilation algorithm to RNN-RBM. An suitable number of hidden neurons is sought by the neuron (t) (t−1) (t) u = σ(u + W uuu + W vuv ), (3) generation and annihilation algorithm for better representation power as same as usual Adaptive RBM [22]. In addition, the where σ() is a activation function. For example, Tanh func- hierarchical model with layer generation was developed as tion was used in the paper [26]. u(t) represents time-series Adaptive RNN-DBN by using the same monitoring function context for give input. u(0) is the initial state and the values are of Adaptive DBN. In other words, the output signal of hidden (t) given randomly. θ = {b, c, W , u, W uv, W uh, W vu, W uu} neuron h at l-th Adaptive RNN-RBM can be seen as the is a set of learning parameters, and is not time-dependency. next input signal at (l + 1)-th Adaptive RNN-RBM as shown At each time t, the RBM learning can employ the update in Fig. 3 The proposed Adaptive RNN-DBN was applied to t algorithm of b( ) and c(t) at time t, and weights W between several time-series benchmark data sets, such as Nottingham them. After the calculation of error continues till time T , the (MIDI) and CMU (Motion Capture). In [22], the prediction gradient for θ is updated to trace from time T back to time t accuracy of the proposed method is higher than that of the to the contrary by BPTT[28], [29]. BPTT is Back Propagation traditional methods. IV. EXPERIMENT RESULTS reached almost 100%. For test, the three settings of Adaptive In this paper, the effectiveness of our proposed method for RNN-DBN showed smaller prediction error than the LSTM. moving MNIST benchmark data set [24] was verified. We The best performance was acquired when the learning ratio challenge to reveal the power of our proposed method in was 0.050 in the Adaptive RNN-DBN. the video recognition research field, since video includes rich We also investigated the intermediate result in the predicted source of visual information. When we want to detect and images of the Adaptive RNN-DBN. Table III expresses the identify visual objects, faces, emotions, actions or events in future prediction accuracy for each predicted sequence. Basi- the real time, state-of-the-art video recognition software for the cally, the prediction accuracy was slowly decreased from 11th comprehensive video content analysis can make the advanced to 20th frames. solutions for AI-based visual content search. Therefore, the V. CONCLUSION prediction performance is compared with the traditional LSTM Deep learning is widely used in various kinds of research in this paper. fields, especially image recognition. In our research, Adaptive A. Data set: Moving MNIST DBN which can find the optimal network structure for given Moving MNIST [24] is a benchmark data set for video data was developed. The method shows higher classification recognition. There are 10,000 samples including 8,000 for accuracy than existing deep learning methods for several training and 2,000 for test. Each sample consists of 20 sequen- benchmark data sets. In this paper, our proposed prediction tial gray scale images ([0, 255]) of 64 × 64 patch, where two method was applied to Moving MNIST, which is a benchmark digits move in the sequences. The digits were chosen randomly data set for video recognition. Compared with the LSTM from the training set. The selected digits are placed initially model, our method showed higher prediction performance at random locations inside the patch. Each digit was assigned (more than 90% predication accuracy for test data). Our a velocity which has direction and magnitude. The direction proposed method will be further improved for better prediction and the magnitude also chosen randomly. Fig. 5 shows two capability by evaluating the method on the other large video samples of 20 sequential images. databases such as video streaming and defect detection in time- Moving MNIST is used in various researches such as the series video data. video decomposition [30] and the future predictor [24]. This ACKNOWLEDGMENT paper aims to investigate the effectiveness of the our LSTM This work was supported by JSPS KAKENHI Grant Num- model in the video recognition and we compare the recognition ber 19K12142, 19K24365, and obtained from the commis- capability as future predictor. sioned research by National Institute of Information and Since any teacher signal for each sample is not provided Communications Technology (NICT, 21405), JAPAN. in moving MNIST, the LSTM in [24] investigated the cross entropy between the given sequence (ground truth) and the REFERENCES predicted sequence. Fig. 6 shows the composite model of [1] Markets and Markets, http://www.marketsandmarkets.com/Market-Reports/deep-learning-market-107369271.html LSTM in [24]. In the method, the first 10 frames are used (accessed 28 November 2018) (2016) as an input sequence of the model, the remaining 10 frames [2] Y.Bengio: Learning Deep Architectures for AI, Foundations and Trends in archive, vol.2, no.1, pp.1-127 (2009) are used for evaluation of future prediction as shown in Fig. 5. [3] V.Le.Quoc, R.Marc’s Aurelio, et.al.: Building high-level features using Fig. 7 shows an abstract procedure of prediction in our large scale , Proc. of 2013 IEEE International method. As same as in our method, same evaluation method Conference on Acoustics, Speech and Signal Processing, pp.8595-8598 (2013) was used. Since our method can predict next frame for given [4] A.Krizhevsky, I.Sutskever, G.E.Hinton, ImageNet Classification with frame, the predicted frame is used for next input, and then Deep Convolutional Neural Networks, Proc. of Advances in Neural squared error and prediction accuracy of the ground truth Information Processing Systems 25 (NIPS 2012) (2012) images and the predicted images were evaluated. B. Experimental Results TABLE I PREDICTION RESULT (CROSS ENTROPY) Table I and Table II show the prediction result on moving Model Training Test MNIST for training and test. The value in Table I is the cross LSTM [24] - 341.2 entropy between the ground truth images and the predicted Adaptive RNN-DBN (lr = 0.010) 18.5 165.3 Adaptive RNN-DBN (lr = 0.050) 16.1 134.0 images at last sequence. The result of the LSTM is cited from Adaptive RNN-DBN (lr = 0.001) 17.0 140.6 the paper [24]. The left value of each column in Table II is the squared error and the value in brackets is the prediction accuracy. TABLE II PREDICTION RESULT (SQUARED LOSS AND CORRECT RATIO) In the LSTM, the result for training and the prediction accuracy for test are not reported. In the Adaptive RNN-DBN, Model Training Test three settings of learning ratio (lr) were used for evaluation. Adaptive RNN-DBN (lr = 0.010) 11.4 (99.0%) 119.9 (91.8%) Adaptive RNN-DBN (lr = 0.050) 10.3 (99.4%) 100.5 (92.5%) For training of the Adaptive RNN-DBN, the cross entropy and Adaptive RNN-DBN (lr = 0.001) 14.5 (98.9%) 140.2 (89.4%) the squared error reached almost 0 and prediction accuracy Fig. 5. Two samples of Moving MNIST

TABLE III PREDICTION ACCURACY FOR EACH FRAME

Sequence Model 11 12 13 14 15 16 17 18 19 20 Adaptive RNN-DBN (lr = 0.010) 98.2% 97.9% 97.0% 95.2% 94.5% 93.9% 93.3% 92.8% 92.0% 91.8% Adaptive RNN-DBN (lr = 0.050) 99.5% 98.0% 97.9% 96.5% 96.4% 94.5% 93.1% 92.9% 92.8% 92.5% Adaptive RNN-DBN (lr = 0.001) 96.8% 96.8% 96.4% 94.8% 93.6% 92.9% 91.8% 91.0% 90.0% 89.4%

nications Surveys & Tutorials (2018) ^ ^ ^ [10] H.Zhang, Y.Zhang and B.Zhong, et.al., A Comprehensive Survey of Input Reconstruction v3 v2 v1 Vision-Based Human Action Recognition Methods, Sensors, vol.19, no.5, pp.1–20 (2019) [11] Y.Bengio, P.Simard, and P.Frasconi, Learning long-term dependencies W2 W2 with is difficult, IEEE Transactions on Neural Networks, vol.5, no.2, pp.157–166 (1994) [12] Z.C.Lipton, D.C.Kale, C.Elkan, and R.Wetzell, Learning to Diagnose with LSTM Recurrent Neural Networks, in International Conference on Learned Representation Learning Representations (ICLR 2016), pp.1–18 (2016) copy v v [13] S.Xingjian, C.Zhourong, W.Hao, Y.Dit-Yan, W.Wai-kin, W.Wang-chun, 3 2 Convolutional LSTM Network: A Machine Learning Approach for Precip- W1 W1 itation Nowcasting, Advances in Neural Information Processing Systems 28 (NIPS 2015) pp.802–810 (2015) [14] S.Kamada, T.Ichimura, A.Hara, and K.J.Mackin, Adaptive Structure copy ^ ^ ^ v4 v5 v6 Learning Method of Deep Belief Network using Neuron Generation- Annihilation and Layer Generation, Neural Computing and Applications, pp.1–15 (2018) v1 v2 v3 W W [15] S.Kamada and T.Ichimura, An Adaptive Learning Method of Restricted 3 3 Boltzmann Machine by Neuron Generation and Annihilation Algorithm. Proc. of 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC2016), pp.1273–1278 (2016) [16] S.Kamada, T.Ichimura, A Structural Learning Method of Restricted Boltzmann Machine by Neuron Generation and Annihilation Algorithm, Future Prediction Neural Information Processing, Proc. of the 23rd International Conference v4 v5 on Neural Information Processing, Springer LNCS9950), pp.372–380 (2016) [17] G.E.Hinton, A Practical Guide to Training Restricted Boltzmann Ma- Fig. 6. The Composite Model of LSTM [24] chines, Neural Networks, Tricks of the Trade, Lecture Notes in Computer Science (LNCS, vol.7700), pp.599–619 (2012) [18] S.Kamada and T.Ichimura, An Adaptive Learning Method of Deep Belief Network by Layer Generation Algorithm, Proc. of IEEE TENCON2016, [5] C.Szegedy, W. Liu, Y.Jia, P.Sermanet, S.Reed, D.Anguelov, D.Erhan, pp.2971–2974 (2016) V.Vanhoucke, A.Rabinovich, Going Deeper with Convolutions, Proc. of [19] G.E.Hinton, S.Osindero and Y.Teh, A fast learning algorithm for deep CVPR2015 (2015) belief nets, Neural Computation, vol.18, no.7, pp.1527–1554 (2006) [6] K.Simonyan, A.Zisserman, Very deep convolutional networks for large- [20] Y.LeCun, L.Bottou, Y.Bengio, and P.Haffner, Gradient-based learning scale image recognition, Proc. of International Conference on Learning applied to document recognition, Proc. of the IEEE, vol.86, no.11, Representations (ICLR 2015) (2015) pp.2278–2324 (1998) [7] K.He, X.Zhang, S.Ren, J.Sun, J, Deep residual learning for image [21] A.Krizhevsky: Learning Multiple Layers of Features from Tiny Images, recognition, Proc. of 2016 IEEE Conference on Computer Vision and Master of thesis, University of Toronto (2009) Pattern Recognition (CVPR), pp.770-778 (2016) [22] T.Ichimura, S.Kamada, Adaptive Learning Method of Recurrent Tem- [8] O.Russakovsky, J.Deng, H.Su, J.Krause, S.Satheesh, S.Ma, Z.Huang, poral Deep Belief Network to Analyze Time Series Data, Proc. of the A.Karpathy, A.Khosla, M.Bernstein, et al., Imagenet large scale visual 2017 International Joint Conference on Neural Network (IJCNN 2017), recognition challenge, International Journal of Computer Vision, vol.115, pp.2346–2353 (2017) no.3. pp.211–252 (2015) [23] http://www.cs.toronto.edu/∼nitish/unsupervised video/ (2019/7/23) [9] M.Mohammadi, A.Al-Fuqaha, S.Sorour, and M.Guizani, Deep Learning [24] N.Srivastava, E.Mansimov, R.Salakhutdinov, Unsupervised learning of for IoT Big Data and Streaming Analytics: A Survey, in IEEE Commu- video representations using LSTMs, Proc. of ICML’15 Proceedings of the Fig. 7. Abstract of prediction

32nd International Conference on International Conference on Machine Learning (ICML 15), vol.37, pp.843–852 (2015) [25] T.Ichimura, E.Tazaki and K.Yoshida, Extraction of fuzzy rules using neural networks with structure level adaptation -verification to the diagnosis of hepatobiliary disorders, International Journal of Biomedical Computing, Vol.40, No.2, pp.139–146 (1995) [26] N.B.Lewandowski, Y.Bengio and P.Vincent, Modeling Temporal Depen- dencies in High-Dimensional Sequences:Application to Polyphonic Music Generation and Transcription, Proc. of the 29th International Conference on Machine Learning (ICML 2012), pp.1159–1166 (2012) [27] I.Sutskever, G.E.Hinton, and G.W.Taylor, The Recurrent Temporal Re- stricted Boltzmann Machine, Proc. of Advances in Neural Information Processing Systems 21 (NIPS-2008) (2008) [28] J.Elman, Finding structure in time, Cognitive Science, Vol.14, No.2 (1990) [29] M.Jordan, Serial order: A parallel distributed processing approach, Tech. Rep. No. 8604. San Diego: University of California, Institute for Cognitive Science (1986) [30] J.Hsieh, B.Liu, D.Huang, L.Fei-Fei, J.C.Niebles, Learning to Decom- pose and Disentangle Representations for Video Prediction, Procs. of Advances in Neural Information Processing Systems 31 (NIPS 2018) (2018)