arXiv:1804.09812v2 [stat.ML] 12 Aug 2019 oesprombte hntetopaetann approach training two-phase the t than verifies better study perform supervis computational models to and The unsupervised respect effectively. both bound developmeobjectives capture with model weight models in proposed expectation applied introducing The are for model, programming multi-level account generative and Modifyin to strategy. underlying two-phase function the bas this improve loss models to supervised order the several in unsupe for DBN developed used on We widely a are learning. networks training vised Bolt belief before deep restricted and features Generative machines problems. model unsuper learning separate solve and supervised to needed model is it Typically the classifier. initialize to eeaieaddsrmntv B.Cneunl,this Consequently, RBM. discriminative and and com generative joint models unsupervised the prior a both suggested calculate with the RBM deal probabilities, Since conditional can learning. that supervised model and a and to modeling RBM feature form which conduct a decide is can to there that [1]. Therefore, effort concurrently method problem. classification an a each for is for the best guaranteed need It in the not 3) useful is are is classifier task. independently step obtained It classification are first the they 2) the since classifier. in phase features a classification RBMs modeled training training the for for that one processes; one training and two requires It 1) ewr ossso B n lsie ntpo it. of top on classifier deep a a and on DBN belief based of in training Deep consists trained Two-phase network and [2]. networks. RBMs, manner stacked deep layer-wise with for a for built approach are feature used training (DBN) the two-phase networks also on This is based [1]. classifier step classification separate first a the train from called to step, second that is The RBM tuning, input. training of by distribution done the feat be captures model can to This is classification. for pre-training, and steps used called step, to training modeling first Two The used classification. feature needed. for widely for learning is appr learning supervised an distribution, unsupervised Such classification. input combines before features an latent define extract to model orsleteepolm,rcn aessgett trans- to suggest papers recent problems, these resolve To Abstract h w-hs riigsrtg a he osbeproblem possible three has strategy training two-phase The energy-based an (RBM), machine Boltzmann Restricted Terms Index mrvdCasfiainBsdo epBelief Deep on Based Classification Improved Frbte lsicto eeaiemdl r used are models generative classification better —For de erig erlntok,classification networks, neural learning, —deep [email protected] .I I. NTRODUCTION otwsenUniversity Northwestern vntn L USA IL, Evanston, aho Koo Jaehoon Networks ie and vised a our hat zmann oach fine- bine ures are . nt. ed ed of r- s. s, g s u rtapoc st einnwojciefntos We functions. objective on new based design function loss to expected an is design models applied approach our we developing first in Our problems, techniques optimization modeling separate optimization two-phase two the as Viewing DBN. training on based network. problem shallow classification a method, are is learning approaches that RBM supervised these transforming However, a to limited [4]. for learning used t reinforcement was approximation based which develope function RBM, is free-energy classification the for applying RBM si by a self-contained In both a [3]. way for [1], contributions ilar concurrently two the trained summing by is objectives RBM discriminative hybrid eety hsapoc a enapidt oo imagery motor to [13]. applied [12], been [2], has recognition approach text this and Recently, image classificat various as solve such to problems studied deep been and has in DBN network trained using neural is method Two-phase that manner. RBMs layer-wise [2 stacked a Salakhutdinov of consisting and DBN Hinton solve suggested task. to [11] classification RBM al. with document et regress classifier a Gehler neighborhood Logistic [10]. 1-nearest [8]. [9], the explored [7], used and been [6], documents, has [5], SVM images, replacing been on data has intrusion (SVM) tasks network machine classification vector in support explored and t RBM Two-phase data. with of ing types different on tasks classification obnn B n osfnto nachrn a.In way. models method. suggested two-phase coherent the the a than that in better verify perform function we study loss computational a the and DBN DBN combining where only to objective optimal. are values in classification optima solutions DBN possible classifier objective the searches the the for model constrains This of solutions phase-1. it function to but optimal loss the function a objective has we its Third, model updates. bilevel meth input during training The two-phase of weights the to the representation programming bilevel regularize good applied as feed-forward a well the keep introduce in as constraints we weights The Second, DBN classifier. the phase. the bound of that function constraints loss the and nti td,w eeoe lentv oest ov a solve to models alternative developed we study, this In h w-hs riigsrtg a enapidt many to applied been has strategy training two-phase The u ancnrbtosaesvrlcasfiainmodels classification several are contributions main Our [email protected] otwsenUniversity Northwestern vntn L USA IL, Evanston, I L II. ig Klabjan Diego ITERATURE R EVIEW p ( h | x ) ul yDBN by built rain- ion ion od. m- o d ] l . classification in the area of brain–computer interface [14], samples after k steps. It is shown that CD with a few steps biomedical research, classification of Cytochrome P450 1A2 performs effectively [18], [19]. inhibitors and non-inhibitors [15], and web spam classification b) Deep Belief Networks: DBN is a generative graphical that detects web pages deliberately created to manipulate model consisting of stacked RBMs. Based on its deep structure search rankings [16]. All these papers rely on two distinct DBN can capture a hierarchical representation of input data. phases, while our models assume a holistic view of both Hinton et al. (2006) introduced DBN with a training algorithm aspects. that greedily trains one layer at a time. Given visible unit x Many studies have been conducted to improve the problems and ℓ hidden layers the joint distribution is defined as [18], of two-phase training. Most of the research has been focused [20] on transforming RBM so that the modified model can achieve ℓ−2 generative and discriminative objectives at the same time. p(x, h1, ··· ,hℓ)= p(hℓ−1,hℓ) p(hk|hk+1) p(x|h1). Schmah et al. [17] proposed a discriminative RBM method, k=1 ! and subsequently classification is done in the manner of a Y Since each layer of DBN is constructed as RBM, training each Bayes classifier. However, this method cannot capture the layer of DBN is the same as training a RBM. relationship between the classes since the RBM of each class Classification is conducted by initializing a network through is trained separately. Larochelle et al. [1], [3] proposed a self- DBN training [12], [20]. A two-phase training can be contained discriminative RBM framework where the objective done sequentially by: 1) pre-training, function consists of the generative learning objective p(x, y), of stacked RBM in a layer-wise manner, and 2) fine-tuning, and the discriminative learning objective, p(y|x). Both dis- with a classifier. Each phase requires tributions are derived from RBM. Similarly, a self-contained solving an optimization problem. Given training dataset D = discriminative RBM method for classification is proposed [4]. {(x(1),y(1)),..., (x(|D|),y(|D|))} with input x and label y, the The free-energy function based approximation is applied in pre-training phase solves the following optimization problem the development of this method, which is initially suggested at each layer k for . This prior paper relying on RBM conditional probability while we handle general loss functions. |D| 1 (i) Our models also hinge on completely different principles. min −log p(xk ; θk) θk |D| i=1 h i III. BACKGROUND X where θk = (Wk,bk,ck) is the RBM model parameter that a) Restricted Boltzmann Machines: RBM is an energy- denotes weights, visible bias, and hidden bias in the energy based probabilistic model, which is a restricted version of (i) function, and xk is visible input to layer k corresponding Boltzmann machines (BM) that is a log-linear Markov Ran- to input x(i). Note that in layer-wise updating manner we dom Field. It has visible nodes x corresponding to input need to solve ℓ of the problems from the bottom to the top and hidden nodes h matching the latent features. The joint hidden layer. For the fine-tuning phase we solve the following RJ distribution of the visible nodes x ∈ and hidden variable optimization problem h ∈ RI is defined as |D| 1 1 p(x, h)= e−E(x,h), E(x, h)= −hW x − ch − bx min L(φ; y(i),h(x(i))) (1) Z φ |D| i=1 RI×J RJ RI X h i where W ∈ , b ∈ , and c ∈ are the model where L() is a loss function, h denotes the final hidden parameters, and Z is the partition function. Since units in a features at layer ℓ, and φ denotes the parameters of the layer are independent in RBM, we have the following form of (i) (i) classifier. Here for simplicity we write h(x ) = h(xℓ ). conditional distributions: When combining DBN and a feed-forward neural networks I J (FFN) with sigmoid activation, all the weights and hidden bias p(h|x)= p(hi|x), p(x|h)= p(xj |h). parameters among input and hidden layers are shared for both i=1 j=1 Y Y training phases. Therefore, in this case we initialize FFN by For binary units where x ∈ {0, 1}J and h ∈ {0, 1}I, we training DBN. can write p(hi = 1|h) = σ(ci + Wix) and p(xj = 1|h) = IV. PROPOSED MODELS σ(bj +Wjx) where σ() is the . In this manner RBM with binary units is an unsupervised neural network We model an expected loss function for classification. with a sigmoid activation function. The model calibration Considering classification of two phase method is conducted of RBM can be done by minimizing negative log-likelihood on hidden space, the probability distribution of the hidden through . RBM takes advantage of having the variables obtained by DBN is used in the proposed models. above conditional probabilities which enable to obtain model The two-phase method provides information about modeling samples easier through a method. Contrastive parameters after each phase is trained. Constraints based on divergence (CD) makes Gibbs sampling even simpler: 1) start the information are suggested to prevent the model param- a Markov chain with training samples, and 2) stop to obtain eters from deviating far from good representation of input. Optimal solution set for unsupervised objective of the two- has a constraint that the DBN parameters are bounded by phase method is good candidate solutions for the second phase. their optimal values at the end of both phases. This model Bilevel model has the set to find optimal solutions for the regularizes parameters with those that are fitted in both the phase-2 objective so that it conducts the two-phase training at unsupervised and supervised manner. Therefore, it can achieve one-shot. better accuracy even though we need an additional training to a) DBN Fitting Plus Loss Model: We start with a naive the two-phase trainings. Given training samples D the model model of summing pre-trainning and fine-tuning objectives. (EL-DBNOPT) reads

This model conducts the two-phase training strategy simul- |D| taneously; however, we need to add one more hyperparam- 1 (i) (i) (i) min p(h|x )L(θL; y ,h(θDBN ; x )) eter ρ to balance the impact of both objectives. The model θL,θDBN |D| i=1 " h # (DBN+loss) is defined as X X∗ s.t. |θDBN − θDBN−OP T |≤ δ min Ey,x[L(θL; y,h(x))] + ρ Ex[− log p(x; θDBN )] (3) θL,θDBN ∗ where θDBN−OP T are the optimal values of DBN parameters and empirically based on training samples D, after two-phase training and δ is a hyperparameter.

|D| 1 (i) (i) (i) min L(θL; y ,h(x )) − ρlog p(x ; θDBN ) d) Feed-forward Network with DBN Boxing: We also θL,θDBN |D| i=1 propose a model based on boxing constraints where FFN is X h (2)i constrained by DBN output. The mathematical model (FFN- where θ ,θ are the underlying parameters. Note that L DBN DBN) based on training samples D is θL = φ from (1) and θDBN = (θk)k=1. This model has already been proposed if the classification loss function is |D| 1 (i) (i) based on the RBM conditional distribution [1], [3]. min L(θL; y ,h(θDBN ; x )) θL,θDBN |D| (4) b) Expected Loss Model with DBN Boxing: We first de- i=1 X h ∗ i sign an expected loss model based on conditional distribution s.t. |θDBN − θDBN |≤ δ. p(h|x) obtained by DBN. This model conducts classification on the hidden space. Since it minimizes the expected loss, it e) Feed-forward Network with DBN Classification Box- should be more robust and thus it should yield better accuracy ing: Given training samples D this model (FFN-DBNOPT), on data not observed. The mathematical model that minimizes which is a mixture of (3) and (4), reads the expected loss function is defined as |D| E 1 min y,h|x[L(θL; y,h(θDBN ; x))] min θ y(i),h θ x(i) θL,θDBN L( L; ( DBN ; )) θL,θDBN |D| i=1 and empirically based on training samples D, X h ∗ i s.t. |θDBN − θDBN−OP T |≤ δ. |D| 1 (i) (i) (i) min p(h|x )L(θL; y ,h(θDBN ; x )) . θL,θDBN |D| f) Bilevel Model: We also apply bilevel programming to i=1 " h # X X the two-phase training method. This model searches optimal (i) (i) With notation h(θDBN ; x ) = h(x ) we explicitly show solutions to minimize the loss function of the classifier only the dependency of h on θDBN . We modify the expected where DBN objective solutions are optimal. Possible candi- loss model by introducing a constraint that sets bounds on dates for optimal solutions of the first level objective function DBN related parameters with respect to their optimal values. are optimal solutions of the second level objective function. This model has two benefits. First, the model keeps a good This model (BL) reads representation of input by constraining parameters fitted in E ∗ min∗ y,x[L(θL; y,h(θDBN ; x))] the unsupervised manner. Also, the constraint regularizes the θL,θDBN model parameters by preventing them from blowing up while ∗ E s.t. θDBN = arg min x[−log p(x; θDBN )] being updated. Given training samples D the mathematical θDBN form of the model (EL-DBN) reads and empirically based on training samples,

|D| |D| 1 (i) (i) (i) 1 min p(h|x )L(θL; y ,h(θDBN ; x )) min L(θ ; y(i),h(θ∗ ; x(i))) θL,θDBN |D| ∗ L DBN " # θL,θDBN |D| i=1 h i=1 X X∗ X h i s.t. |θDBN − θDBN |≤ δ |D| ∗ 1 (i) where θ∗ are the optimal DBN parameters and δ is a s.t. θDBN = arg min −log p(x ; θDBN ) . DBN θDBN |D| i=1 hyperparameter. This model needs a pre-training phase to X h i obtain the DBN fitted parameters. One of the solution approaches to bilevel programming is to c) Expected Loss Model with DBN Classification Boxing: apply Karush–Kuhn–Tucker (KKT) conditions to the lower Similar to the DBN boxing model, this expected loss model level problem. After applying KKT to the lower level, we obtain Mean Sd. E ∗ DBN-FFN 1.33% 0.03% min∗ y,x[L(θL; y,h(θDBN ; x))] θL,θDBN DBN+loss 1.84% 0.14%

E ∗ EL-DBN 1.46% 0.05% s.t. ∇θDBN x[−log p(x; θDBN )|θ ]=0. DBN EL-DBNOPT 1.33% 0.04% Furthermore, we transform this constrained problem to an FFN-DBN 1.34% 0.04% unconstrained problem with a quadratic penalty function: FFN-DBNOPT 1.32% 0.03% E ∗ BL 1.85% 0.07% min∗ y,x[L(θL; y,h(θDBN ; x))]+ θL,θDBN µ (5) TABLE I: Classification errors with respect to the best DBN ∗ 2 ||∇ DBN E [−log p(x; θ )]| || 2 θ x DBN θDBN structure for the MNIST. where µ is a hyperparameter. The gradient of the objective function is derived in the appendix. ties broken by deviation. In Table 1, the best mean test error V. COMPUTATIONAL STUDY rate was achieved by FFN-DBNOPT, 1.32%. Furthermore, the models with the DBN classification constraints, EL-DBNOPT To evaluate the proposed models classification tasks on three and FFN-DBNOPT, perform similar to, or better than the datasets were conducted: the MNIST hand-written images 1, two-phase method. This shows that DBN classification boxing the KDD’99 network intrusion dataset (NI)2, and the isolated constraints regularize the model parameters by keeping a good letter speech recognition dataset (ISOLET) 3. The experimen- representation of input. tal results of the proposed models on these datasets were compared to those of the two-phase method. B. Network Intrusion In FFN, the sigmoid functions in the hidden layers and The classification task on NI is to distinguish between nor- the softmax function in the output layer were chosen with mal and bad connections given the related network connection negative log-likelihood as a loss function of the classifiers. information. The preprocessed dataset consists of 41 input We selected the hyperparameters based on the settings used features and 5 classes, and 4,898,431 examples for training and in [21], which were fine-tuned. We first implemented the two- 311,029 examples for testing. The experiments were conducted phase method with DBNs of 1, 2, 3 and 4 hidden layers to on 20%, 30%, and 40% subsets of the whole training set, find the best configuration for each dataset, and then applied which were obtained by stratified random sampling. The the best configuration to the proposed models. hyperparameters are set as: there are 15 hidden units in each Implementations were done in Theano using GeForce GTX layer; the number of pre-training epochs per layer is 100 with TITAN X. The mini-batch gradient descent algorithm was used the learning rate of 0.01; the number of fine-tuning epochs is to solve the optimization problems of each model. To calculate 500 with the learning rate of 0.1; the batch size is 1,000; and the gradients of each objective function of the models Theano’s ρ in the DBN+loss and µ in the BL model are diminishing built-in functions, ’theano.tensor.grad’, was used. We denote during epochs. by DBN-FFN the two-phase approach. On NI the best structure of DBN-FFN was 41-15-15-5 for A. MNIST all three datasets. Table 2 shows the experimental results of the proposed models with the same network as the best DBN-FFN. The task on the MNIST is to classify ten digits from 0 to BL performed the best in all datasets, achieving the lowest 9 given by pixel hand-written images. The dataset 28 × 28 mean classification error without the pre-training step. The is divided in 60,000 samples for training and validation, and difference in the classification error between our best model, 10,000 samples for testing. The hyperparameters are set as: BL, and DBN-FFN is statistically significant since the p-values there are 1,000 hidden units in each layer; the number of pre- are 0.03, 0.01, and 0.03 for 20%, 30%, and 40% datasets, training epochs per layer is 100 with the learning rate of 0.01; respectively. This showed that the model being trained con- the number of fine-tuning epochs is 300 with the learning rate currently for unsupervised and supervised purpose can achieve of 0.1; the batch size is 10; and ρ in the DBN+loss and µ better accuracy than the two-phase method. Furthermore, both in the BL model are diminishing during epochs. Note that EL-DBNOPT and FFN-DBNOPT yield similar to, or lower DBN+Loss and BL do not require pre-training. mean error rates than DBN-FFN in all of the three subsets. DBN-FFN with three-hidden layers of size, 784-1000-1000- 1000-10, was the best, and subsequently we compared it to the C. ISOLET proposed models with the same size of the network. We com- The classification on ISOLET is to predict which letter- puted the means of the classification errors and their standard name was spoken among the 26 English alphabets given 617 deviations for each model averaged over 5 random runs. In input features of the related signal processing information. The each table, we stressed in bold the best three models with dataset consists of 5,600 for training, 638 for validation, and 1http://yann.lecun.com/exdb/mnist/ 1,559 examples for testing. The hyperparameters are set as: 2http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html there are 1,000 hidden units in each layer; the number of pre- 3https://archive.ics.uci.edu/ml/datasets/ISOLET training epochs per layer is 100 with the learning rate of 0.005; 20%dataset 30%dataset FFN-DBN, do not perform better than DBN-FFN in almost all Mean Sd. Mean Sd. datasets. Regularizing the model parameters with unsupervised DBN-FFN 8.14% 0.12% 8.18% 0.12% learning is not so effective in solving a supervised learning DBN+loss 8.07% 0.06% 8.13% 0.09% problem. Third, the models with DBN classification boxing, EL-DBN 8.30% 0.09% 8.27% 0.07% EL-DBNOPT and FFN-DBNOPT, perform better than DBN- EL-DBNOPT 8.14% 0.14% 8.15% 0.15% FFN in almost all of the experiments. FFN-DBNOPT is FFN-DBN 8.17% 0.09% 8.20% 0.08% consistently one of the best three performers in all instances. FFN-DBNOPT 8.07% 0.12% 8.12% 0.11% This shows that classification accuracy can be improved by BL 7.93% 0.09% 7.90% 0.11% regularizing the model parameters with the values trained for 40% dataset unsupervised and supervised purpose. One drawback of this Mean Sd. approach is that one more training phase to the two-phase ap- DBN-FFN 8.06% 0.02% proach is necessary. Last, BL shows that one-step training can DBN+loss 8.05% 0.05% achieve a better performance than two-phase training. Even EL-DBN 8.29% 0.14% though it worked in one instance, improvements to current EL-DBNOPT 8.08% 0.10% BL can be made such as applying different solution search FFN-DBN 8.07% 0.11% algorithms, supervised learning regularization techniques, or FFN-DBNOPT 7.95% 0.11% different initialization strategies. BL 7.89% 0.10%

TABLE II: Classification errors with respect to the best DBN REFERENCES structure for NI [1] H. Larochelle, M. Mandel, R. Pascanu, and Y. Bengio, “Learning Mean Sd. algorithms for the classification restricted ,” Journal DBN-FFN 3.94% 0.22% of Research, vol. 13, pp. 643–669, 2012. [2] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of DBN+loss 4.38% 0.20% data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, EL-DBN 3.91% 0.18% 2006. EL-DBNOPT 3.75% 0.14% [3] H. Larochelle and Y. Bengio, “Classification using discriminative re- stricted Boltzmann machines,” in International Conference on Machine FFN-DBN 3.94% 0.19% Learning (ICML) 25, (Helsinki, Finland), pp. 536–543, 2008. FFN-DBNOPT 3.75% 0.13% [4] S. Elfwing, E. Uchibe, and K. Doya, “Expected energy-based restricted BL 4.43% 0.18% Boltzmann machine for classification,” Neural Networks, vol. 64, pp. 29– 38, 2015. TABLE III: Classification errors with respect to the best DBN [5] E. P. Xing, R. Yan, and A. G. Hauptmann, “Mining associated text and structure for ISOLET. images with dual-wing Harmoniums,” in Conference on Uncertainty in Artificial Intelligence (UAI), vol. 21, (Edinburgh, Scotland), pp. 633– 641, 2005. [6] M. Norouzi, M. Ranjbar, and G. Mori, “Stacks of convolutional restricted the number of fine-tuning epochs is 300 with the learning rate Boltzmann machines for shift-invariant ,” in IEEE Com- puter Society Conference on Computer Vision and Pattern Recognition of 0.1; the batch size is 20; and ρ in the DBN+loss and µ in Workshops (CVPR), (Miami, FL, USA), pp. 2735–2742, 2009. the BL model are diminishing during epochs. [7] M. A. Salama, H. F. Eid, R. A. Ramadan, A. Darwish, and A. E. In this experiment the shallow network performed better Hassanien, “Hybrid intelligent intrusion detection scheme,” Advances than the deep network; 617-1000-26 was the best structure for in Intelligent and Soft Computing, pp. 293–303, 2011. [8] G. E. Dahl, R. P. Adams, and H. Larochelle, “Training restricted DBN-FFN. One possible reason for this is its small size of Boltzmann machines on word observations,” in International Conference training samples. EL models performed great for this instance. on Machine Learning (ICML) 29, vol. 29, (Edinburgh, Scotland, UK), EL-DBNOPT achieved the best mean classification error, tied pp. 679–686, 2012. [9] A. Mccallum, C. Pal, G. Druck, and X. Wang, “Multi-conditional with FFN-DBNOPT. With the same training effort, EL-DBN learning : generative / discriminative training for clustering and classifi- achieved a lower mean classification error and smaller standard cation,” in National Conference on Artificial Intelligence (AAAI), vol. 21, deviation than the two-phase method, DBN-FFN. Considering pp. 433–439, 2006. [10] K. Cho, A. Ilin, and T. Raiko, “Improved learning algorithms for a relatively small sample size of ISOLET, EL shows that it restricted Boltzmann machines,” in Artificial Neural Networks and yields better accuracy on unseen data as it minimizes the Machine Learning (ICANN), vol. 6791, Springer, Berlin, Heidelberg, expected loss, i.e., it generalizes better. In this data set, p-value 2011. [11] P. V. Gehler, A. D. Holub, and MaxWelling, “The rate adapting Poisson is 0.07 for the difference in the classification error between our (RAP) model for information retrieval and object recognition,” in best model, FFN-DBNOPT, and DBN-FFN. International Conference on Machine Learning (ICML) 23, vol. 23, (Pittsburgh, PA, USA), pp. 337–344, 2006. VI. CONCLUSIONS [12] Y. Bengio and P. Lamblin, “Greedy layer-wise training of deep net- works,” in Advances in Neural Information Processing Systems (NIPS) DBN+loss performs better than two-phase training DBN- 19, vol. 20, pp. 153–160, MIT Press, 2007. FFN only in one instance. Aggregating two unsupervised [13] R. Sarikaya, G. E. Hinton, and A. Deoras, “Application of deep belief networks for natural language understanding,” IEEE/ACM Transactions and supervised objectives without a specific treatment is not on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 778– effective. Second, the models with DBN boxing, EL-DBN and 784, 2014. [14] N. Lu, T. Li, X. Ren, and H. Miao, “A scheme for motor c) Bilevel Model: From Eq. (5), ∇θDBN log p(x) in the imagery classification based on restricted Boltzmann machines,” IEEE objective function is approximated for i =0, 1, ··· ,ℓ as Transactions on Neural Systems and Rehabilitation Engineering, vol. 25, pp. 566–576, 2017. ∂ logp(x) [∇ DBN log p(x)] = [15] L. Yu, X. Shi, and T. Shengwei, “Classification of Cytochrome P450 θ i ∂θi 1A2 of inhibitors and noninhibitors based on deep belief network,” DBN International Journal of Computational Intelligence and Applications, 1 2 ℓ ∂ log h1,h2,··· ,hℓ p(x, h ,h , ··· ,h ) vol. 16, p. 1750002, 2017. = [16] Y. Li, X. Nie, and R. Huang, “Web spam classification method based on i P ∂θDBN  deep belief networks,” Expert Systems With Applications, vol. 96, no. 1, i i+1 pp. 261–270, 2018. ∂ log ( hi+1 p(h ,h )) ≈ i [17] T. Schmah, G. E. Hinton, R. S. Zemel, S. L. Small, and S. Strother, ∂θDBN “Generative versus discriminative training of RBMs for classification of P (6) fMRI images,” in Advances in Neural Information Processing Systems where θ = (θ0 ,θ2 , ··· ,θi , ··· θℓ ). The (NIPS) 21, vol. 21, pp. 1409–1416, Curran Associates, Inc., 2009. DBN DBN DBN DBN DBN [18] Y. Bengio, “Learning deep architectures for AI,” Foundations and Trends gradient of this approximated quantity is then the Hessian in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009. matrix of the underlying RBM. [19] G. E. Hinton, “Training products of experts by minimizing contrastive divergence.,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002. [20] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets.,” Neural computation, vol. 18, no. 7, pp. 1527–54, 2006. [21] B. Wang and D. Klabjan, “Regularization for unsupervised deep neural nets,” in National Conference on Artificial Intelligence (AAAI), vol. 31, pp. 2681–2687, 2017. 2 We write the approximated ||∇θDBN − log p(x)|| at the [22] A. Fischer and C. Igel, “An introduction to restricted Boltzmann layer i as machines,” Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, vol. 7441, pp. 14–36, 2012. i i+1 2 ∂ − log ( hi+1 p(h ,h )) 2 ||[∇ DBN − log p(x)] || ≈ || || θ i ∂θi P DBN 2 2 ∂ − log p(hi) ∂ − log p(hi) = i + i + APPENDIX ∂θ ∂θ "  11   12  2 ∂ − log p(hi) ··· + ∂θi DBN defines the joint distribution of the visible unit x and  nm  # 1 2 ℓ the ℓ hidden layers, h ,h , ··· ,h as i i+1 i where m and n denote dimensions of h and h and θpq ℓ−2 th th i denotes the p and q component of the θDBN . The gradient 1 ℓ ℓ−1 ℓ k k+1 2 p(x, h , ··· ,h )= p(h ,h ) p(h |h ) of the approximated ||∇θDBN − log p(x)|| at the layer i is k ! =0 2 Y ∂ ∂ − log p(hi) with h0 = x. θi ∂θi pq p,q pq ! a) DBN Fitting Plus Loss Model: From Eq. (2), p(x) in X   the second term of the objective function is approximated as ∂ − log p(hi) ∂2 − log p(hi) =2 i i i + 1 ℓ 1 " ∂θ11 ∂θ11θpq p(x; θDBN )= p(x, h , ··· ,h ) ≈ p(x, h ).     i 2 i h1,h2,··· ,hℓ h1 ∂ − log p(h ) ∂ − log p(h ) X X i i i + ∂θ12 ∂θ12∂θpq b) Expected Loss Models: p(h|x) in the objective func-     ∂ − log p(hi) ∂2 − log p(hi) tion is approximated as ··· + i i i + ∂θpq ∂θpq∂θpq p(hℓ|x) ≈ p(hℓ|x, h1, ··· ,hℓ)     ∂ − log p(hi) ∂2 − log p(hi) ℓ ℓ−1 1 p(h ,h , ··· ,h , x) ··· + i i i = ∂θnm ∂θnmθpq p(hℓ−1,hℓ−2, ··· ,h1, x)     # ℓ−1 ℓ ℓ−2 k k+1 for p = 1, ...n, q = 1, ...m. This shows that the gradient of p(h ,h ) k=0 p(h |h ) 2 = the approximated ||∇θDBN − log p(x)|| in (5) is then the ℓ−2 ℓ−1 ℓ−3 k k+1 Hessian matrix times the gradient of the underlying RBM. The p(h ,h ) Q k=0 p(h |h ) stochastic gradient of −log p(x) of RBM with binary input x p(hℓ−1,hℓ)p(hℓQ−2|hℓ−1)  = and hidden unit h with respect to θDBN wpq is p(hℓ−2,hℓ−1) ∂RBM ℓ−1 ℓ ℓ−2 ℓ−1 p(h ,h )p(h ,h ) = p(hp =1|x)xq − p(x)p(hp =1|x)xq ∂wpq = ℓ−2 ℓ−1 ℓ−1 x p(h ,h )p(h ) X = p(hℓ|hℓ−1). where RBM denotes −log p(x) [22]. We derive the Hessian matrix with respect to wpq as ∂2RBM 2 ∂wpq ∂ ∂ = [p(h =1|x)x )] − [p(x)p(h =1|x)x )] w p q w p q pq x pq X ∂p(x) = σ(net )(1 − σ(net ))x2 − [ p(h =1|x)x + p p q ∂w p q x pq X2 p(xg)σ(netp)(1 −gσ(netp))xq ], ∂2RBM g g ∂wpk∂wpq ∂ ∂ = [p(h =1|x)x )] − [ p(x)p(h =1|x)x )] w p q w p q pk pk x X ∂p(x) = σ(net )(1 − σ(net ))x x − [ p(h =1|x)x + p p q k ∂w p q x pk X p(xg)σ(netp)(1 −gσ(netp))xq xk], ∂2RBM g g ∂wkq∂wpq ∂ ∂ = [p(h =1|x)x )] − [ p(x)p(h =1|x)x ] w p q w p q kq kq x X ∂p(x) ∂ = − [ p(h =1|x)x + p(x) [p(h =1|x)x ]], ∂w p q ∂w p q x kq kq X ∂2RBM ∂p(x) = − [ p(h =1|x)x + p(x)] ∂w ∂w ∂w p q kp pq x kp X where σ() is the sigmoid function, netp is q wpqxq +cp, and cp is the hidden bias. Based on what we derive above we can P 2 calculate the gradient of approximatedg||[∇θDBN −logp(x)]i|| .