Wasserstein Training of Deep Boltzmann Machines

Chaoyang Wang 1 Tian Tong 2 Yang Zou 2

Abstract tional Restricted Boltzmann Machines (CRBMs) (Taylor & Hinton, 2009), etc. Training various probabilistic graphical models, such as restricted Boltzmann machines (RBMs), Learning in Boltzmann machines is usually done by max- deep Boltzmann machines (DBMs), is usually imum likelihood estimation, which is equivalent to mini- done by maximum likelihood estimation, or mizing the Kullback-Leibler Divergence (KL-divergence) equivalently minimizing the Kullback-Leibler between the empirical distribution and the model distribu- (KL) divergence between the data distribution tion. and the model distribution. On the other hand, In the world of high dimensions, KL-divergence will fail to Wasserstein distance, also known as optimal work when the two distributions do not have non-negligible transport distance and earth mover’s distance, is common supports, which happens commonly when dealing a fundamental distance to quantify the differ- with distributions supported by lower dimensional man- ence between probability distributions. Due to ifolds. In such cases, another kind of probability dis- its good properties like smoothness and symme- tance, Wasserstein distance, has gained lots of attention try, Wasserstein distance aroused numerous re- in and (Cuturi, 2013), searchers’ interests in machine learning and com- (Doucet, 2014)(Montavon et al., 2016), (Arjovsky et al., puter vision. In this project, we proposed a gen- 2017) developed a training method for a restricted Boltz- eral Wasserstein training method for graphical mann machine with the Wasserstein distance as the loss models by replacing the standard KL-divergence function, and showed the power of Wasserstein training in with Wasserstein distance as novel loss func- RBM through its application in digit generation, data com- tions. And we developed Wasserstein training of pletion and denoising. The Wasserstein training of RBM RBMs and DBMs as specific examples. Finally, leads to better generative properties compared to traditional we experimentally explored Wasserstein training RBMs. (Arjovsky et al., 2017) introduced the Wasserstein of RBMs and DBMs for digit generation with distance for Generative Adversarial Networks (GAN) to MNIST dataset, and showed the superiority of improve learning stability. (Frogner et al., 2015) defined Wasserstein training compared to traditional KL- Wassserstein distance as the loss function for supervised divergence training. learning in the label space. Theoretically, the Wasserstein distance is a well-defined distance metric even for non- overlapping distributions, and has the advantages of conti- 1. Introduction nuity with respect to the parameters (Arjovsky et al., 2017). Boltzmann machine is a family of recurrent In this project, we propose a general training method to neural networks and Markov Random Fields. It has ap- minimize the Wasserstein distance, and discuss specific ex- plications in dimensionality reduction, classification, col- amples on training the RBMs and DBMs. In experiments, laborative filtering, and topic modelling we evaluated the validity of Wasserstein training of the (Hinton & Salakhutdinov, 2006; Larochelle & Bengio, RBMs and DBMs in digit generation with MNIST dataset. 2008; Salakhutdinov et al., 2007; Coates et al., 2010; Hin- The organization of the report is as follows. Section 2 ton & Salakhutdinov, 2009). Boltzmann machines have gives a brief introduction to the definition of Wasserstein many variants, such as Deep Boltzmann Machines (DBMs) distance, its duality and comparison with KL-divergence. (Salakhutdinov & Hinton, 2009), Restricted Boltzmann Section 3 presents the theory of Wasserstein training. In Machines (RBMs) (Salakhutdinov et al., 2007), Condi- section 4 we focus on Wasserstein RBM while section 5 fo- cuses on Wasserstein DBM. Section 6 considers the prac- *Equal contribution 1Robotics Institute 2Department of Elec- trical & Computer Engineering. Correspondence to: Tian Tong tical implementation of Wasserstein training with the KL , Chaoyang Wang , Yang Zou . regularization. In section 7, we experimentally demon- strate Wasserstein Boltzmann machines’ validity in digit Wasserstein Training of DBM generation with MNIST dataset. Section 8 gives our final of bread sold in cafe´ located x0, d(x, x0) as the cost to trans- conclusion. port unit bread from x to x0, and π(x, x0) as the trans- port plan, i.e., the amount of bread to transport from x 0 2. Background to x . The primal problem can be interpreted as the min- imum cost to transport all bread from bakeries to cafes.´ 2.1. Wasserstein Distance Furthermore, assume that there is an agent who wants to compete for business with the transportation department. Given a complete separable metric space (M, d), let p and He buys the bread from bakery located at x with unit price q be absolutely continuous Borel probability measures on −α(x), and sells the bread to cafe´ located at x0 with unit M. Consider π as a joint probability measure on M × M price β(x0). The agent hopes to keep competitive, mean- with marginals p and q, also called a transport plan: ing that his net price is lower than the transportation cost, 0 0 π(A × M) = p(A) for any Borel subset A ⊆ M; i.e., α(x) + β(x ) ≤ d(x, x ). Therefore, the dual prob- (1) π(M × C) = q(B) for any Borel subset B ⊆ M. lem can be interpreted as the maximum profit of the agent while keeping competitive. If the agent chooses the optimal For d(x, x0) as a distance metric on M × M, the Wasser- pricing scheme, he will gain the same as the transportation stein distance between p and q is defined as department, i.e., the dual achieves the primal.

0 W (p, q) = min Eπd(x, x ), (2) 2.3. Comparison between Wasserstein Distance and π∈Π(p,q) KL-divergence where Π(p, q) denotes the marginal polytope of p and q.A The KL-divergence is defined as transport plan π is optimal if (2) achieves its infimum. Op- timal transport plans on the Euclidean spaces are charac- Z p(x) KL(p, q) := log p(x)dx. (5) terized by the push forward measures. Practically, a linear M q(x) program is usually adopted to solve the Wasserstein dis- tance. However, when the dimension of probability distri- KL-divergence is asymmetric and thus not a valid distance. bution domain is high, solving the linear programming be- From the definition of KL-divergence, we know that if the comes intractable. Recently, (Cuturi, 2013) and (Doucet, supports of p and q do not have non-negligible intersec- 2014) proposed an efficient approximation of Wasserstein tions, which means that the probabilities in the two distri- distances along with their derivatives. The approximation butions cannot be both non-zero except for some negligible plays an important role in the practical implementation of regions, the KL-divergence will become undefined or infi- these computations. nite. The following example could illustrates the idea. 2.2. Kantorovich Duality Let Y ∼ U[0, 1] be the uniform distribution on the unit in- 2 terval. Let p0 be the distribution of (0,Y ) ∈ R , uniform Under the case that p and q are discrete distributions, the on a straight horizontal line segment passing through the Wasserstein distance can be written as a linear program- origin; pθ be the distribution of (θ, Y ) with θ as a parame- ming as ter. Fig.1 illustrates both distributions. X W (p, q) = min π(x, x0)d(x, x0) π x,x0 X X s.t. π(x, x0) = p(x), π(x, x0) = q(x0), x0 x π(x, x0) ≥ 0. (3)

Its dual form can be written as X X W (p, q) = max α(x)p(x) + β(x0)q(x0) α,β x x0 s.t. α(x) + β(x0) ≤ d(x, x0). (4)

The primal and dual forms have an interesting intuitive in- terpretation (Villani, 2008). Consider p(x) as the amount of bread produced in bakery located at x, q(x0) as the amount Figure 1. Example for Wasserstein distance vs. KL-divergence. Wasserstein Training of DBM

0 The Wasserstein distance between p0 and pθ is where α(x), β(x ) are Lagrange multipliers, also as dual variables. Set the derivative of L with respect to π(x, x0) W (p0, pθ) = |θ|, (6) as 0: ∂L while the KL-divergence is = d(x, x0) + γ(1 + log π(x, x0)) ∂π(x, x0)  ∞ if θ 6= 0 0 KL(p , p ) = . (7) −α(x) − β(x ) = 0. 0 θ 0 if θ = 0 The solution is As we could see, the Wasserstein distance is continuous  1  π(x, x0) = exp (α(x) + β(x0) − d(x, x0)) − 1 . with respect to θ, while KL-divergence is not. γ The above simple example illustrates the idea that if two (9) distributions are non-overlapping, the Wasserstein dis- 0 α(x) 0 β(x ) tance could provide discriminant information, while KL- Define vectors u(x) = exp( γ ), v(x ) = exp( γ ), 0 divergence fails to do so. 0 d(x,x ) matrix K(x, x ) = exp(− γ − 1). The constraints require that: 3. Theory of Wasserstein Training X u(x) K(x, x0)v(x0) = p(x) In this section, we introduce the theory of Wasserstein x0 X training. First, we tell about computing the Wasserstein u(x)K(x, x0)v(x0) = q(x0). (10) distance. Since solving the vanilla Wasserstein distance x involves a linear programming and is quite time consum- 0 ing, we turn to a γ-smoothed Wasserstein distance (Cuturi, Under the case where x, x take finite values, (10) becomes 2013). Given the marginal distributions p(x) and q(x0), the a matrix equation: γ-smoothed Wasserstein distance is defined as: Kv = p./u, KT u = q./v, (11) 0 Wγ (p, q) = min Eπd(x, x ) − γH(π), (8) u, v π∈Π(p,q) which can be solved by iteratively updating until fixed points. After that, we can recover α?(x), β?(x0) as where Π(p, q) denotes the set of joint distributions with α?(x) = γ log u(x), β?(x0) = γ log v(x0), marginals as p(x) and q(x0), H(π) denotes the Shan- (12) non of π. This is equivalent to considering the and the final result for γ-smoothed Wasserstein distance is maximum-entropy principle. The γ-smoothed Wasserstein X ? X ? 0 0 X distance is differentiable with respect to p and q, which Wγ (p, q) = α (x)p(x) + β (x )q(x ) − γ considerably facilitates computations. We will show that x x0 x,x0 it can be computed for a much lower complexity than the  1  exp (α?(x) + β?(x0) − d(x, x0)) − 1 . vanilla Wasserstein distance. γ For simplicity, assume that p and q take discrete values. (13) The γ-smoothed Wasserstein distance can be written as an Notice that given (α?(x), β?(x0)) as optimal dual vari- optimization problem: ables, for any constant c, (α?(x) − c, β?(x0) + c) are also X 0 0 0 optimal. The optimal dual variables have one extra degree Wγ (p, q) = min π(x, x )(d(x, x ) + γ log π(x, x )) π of freedom, due to the existence of one redundant con- x,x0 straint. Therefore, we can set c = P α?(x)p(x), i.e., X 0 X 0 0 x s.t. π(x, x ) = p(x), π(x, x ) = q(x ). require that α?(x) as the centered optimal dual variable, x0 x P ? satisfying x α (x)p(x) = 0. To solve the problem, the Lagrangian is introduced as After that, we consider the parameter estimation using the X γ-smoothed Wasserstein distance. Here we choose p as the L = π(x, x0)(d(x, x0) + γ log π(x, x0)) model distribution pθ, while q as the data distribution pˆ = x,x0 PN 1 n i=1 N δxi , where {xi}i=1 are data points. The sensitive X X 0 + α(x)(p(x) − π(x, x )) analysis theorem relates the gradient with respect to pθ(x) ? x x0 to the optimal dual variable α (x), as X X + β(x0)(q(x0) − π(x, x0)), ∂Wγ (pθ, pˆ) ? x0 x = α (x), ∂pθ(x) Wasserstein Training of DBM

Algorithm 1 Compute Wasserstein distance and optimal is: dual variables ∂Wγ (pθ, pˆ) X ∂pθ(x) Input: data x , samples x˜ , distance metric d, smoothing = α?(x) i j ∂θ ∂θ parameter γ. x Output: Wasserstein distance W (˜pθ, pˆ), optimal dual X ? X ∂pθ(x, h) ? ? = α (x) variables α (˜xj), β (xi). ∂θ x h D = [d(x ˜j, xi)]; K = exp(D/γ − 1); u = 1/N˜ X X ? ∂ log pθ(x, h) while u changes or other relevant stopping criterion do = α (x) pθ(x, h) T ∂θ v = 1/N./(K u) x h u = 1/N./˜ (Kv)  ∂ log p (x, h) = E α?(x) θ . end while (x,h)∼pθ (x,h) ∂θ W (˜pθ, pˆ) = dot(u, (D. ∗ K)v) (16) α? = γ log u β? = γ log v + mean(α?) By adopting the sampling approximation, we can draw ? ? ? ˜ α = α − mean(α ) % centered samples (˜xj, hj) from the joint distribution pθ(x, h), and estimate the gradient in (16) as

N˜ ˜ ∂Wγ (pθ, pˆ) 1 X ? ∂ log pθ(˜xj, hj) θ ≈ α (˜xj) . (17) therefore, by chain rule, the gradient with respect to is: ∂θ ˜ ∂θ N j=1

∂Wγ (pθ, pˆ) X ∂pθ(x) = α?(x) ∂θ ∂θ 4. Wasserstein Training of RBM x In this section, we implement the Wasserstein training al- X ∂ log pθ(x) = α?(x) p (x) ∂θ θ gorithm for a restricted Boltzmann machine (RBM). In a x RBM, the joint distribution of the observable variable x and  ∂ log p (x) = E α?(x) θ . (14) the hidden variables h is x∼pθ (x) ∂θ 1 T T T pθ(x, h) = exp(c x + h W x + b h). Zθ ? Since α (x) are centered, pθ(x) in (14) actually can be an unnormalized distribution. The gradients of log pθ(x) with respect to parameters θ = Next, we point out that solving α?(x) corresponding to (W, b, c) are: Wγ (pθ, pˆ), and computing the expectation in (14), are usu- ∂ log p (x) θ = σ(W x + b)xT − E [σ(W x + b)xT ] ally intractable, since x has exponentially many configura- ∂W x∼pθ (x) tions under high dimensional cases. In practice, we adopt ∂ log p (x) ˜ θ N = σ(W x + b) − Ex∼p (x)[σ(W x + b)] the sampling approximation, i.e., draw samples {x˜j}j=1 ∂b θ from the distribution p , and replace p by an empirical θ θ ∂ log pθ(x) N˜ 1 = x − E [x]. distribution p˜ = P δ . The optimal dual variable x∼pθ (x) θ j=1 N˜ x˜j ∂c α?(˜x ) W (˜p , pˆ) j , corresponding to γ θ , can be computed with Plug these results into (15). The second terms vanish due a complexity of order NN˜ through iteratively updates in ? to the requirement that α (˜xi) are centered. The gradients (11), whose details are described in Algorithm1. The gra- are estimated as: dient is estimated as: N˜ ∂Wγ (pθ, pˆ) 1 X ? T ≈ α (˜xj)σ(W x˜j + b)˜x (18) N˜ ∂W N˜ j ∂Wγ (pθ, pˆ) 1 X ? ∂ log pθ(˜xj) j=1 ≈ α (˜xj) . (15) ∂θ ˜ ∂θ ˜ N j=1 N ∂Wγ (pθ, pˆ) 1 X ? ≈ α (˜xj)σ(W x˜j + b) (19) ∂b N˜ Notice that in the gradient, the real data involves only indi- j=1 ? N˜ rectly through optimal dual variables α (˜xj). ∂Wγ (pθ, pˆ) 1 X ≈ α?(˜x )˜x . (20) ∂c ˜ j j Finally, in some probabilistic graphical models like DBMs, N j=1 it is simpler to consider a joint distribution including hid- N˜ den variables, where the is pθ(x) = The samples {x˜j}j=1 are updated using persistent con- P h pθ(x, h). In such cases, the gradient with respect to θ trastive divergence (PCD). Wasserstein Training of DBM

550 5. Wasserstein Training of DBM training validation

In this section, we implement the Wasserstein training al- 500 gorithm for a deep Boltzmann machine (DBM). Take a two layer DBM as example. The joint distribution is 450

1 T (1)T (1) (1)T (1) pθ(x, h) = exp(c x + h W x + b h

Zθ cross-entropy 400 + h(2)T W (2)h(1) + b(2)T h(2)).

350 The gradients of log pθ(x, h) with respect to parameters θ = (W (1),W (2), b(1), b(2), c) are:

300 0 200 400 600 800 1000 1200 1400 ∂ log p (x, h) epoch θ = h(1)xT − E [h(1)xT ] ∂W (1) (x,h)∼pθ (x,h) Figure 2. loss v.s. epoch numbers for pure Wasserstein RBM ∂ log pθ(x, h) = h(2)h(1)T − E [h(2)h(1)T ] 1500 (2) (x,h)∼pθ (x,h) training ∂W validation ∂ log p (x, h) θ = h(1) − E [h(1)] ∂b(1) (x,h)∼pθ (x,h) ∂ log p (x, h) 1000 θ = h(2) − E [h(2)] ∂b(2) (x,h)∼pθ (x,h) ∂ log pθ(x, h) = x − E(x,h)∼p (x,h)[x]. ∂c θ cross-entropy 500 Plug the results into (17). The second terms vanish. The gradients are estimated as:

˜ ∂W (p , pˆ) 1 N γ θ X ? ˜(1) T 0 ≈ α (˜xj)h x˜ (21) 0 500 1000 1500 2000 2500 3000 ∂W (1) ˜ j j epoch N j=1 N˜ Figure 3. loss v.s. epoch numbers for KL regularized Wasserstein ∂Wγ (pθ, pˆ) 1 X (2) (1)T ≈ α?(˜x )h˜ h˜ (22) RBM ∂W (2) ˜ j j j N j=1 N˜ minimum. Based on the observed data x, we generate x0 ∂Wγ (pθ, pˆ) 1 X ? ˜(1) ≈ α (˜xj)h (23) using one Gibbs step, and calculate the cross-entropy loss ∂b(1) N˜ j j=1 between x and x0 as a monitor of the training. As shown in N˜ Fig.2, the loss stops decreasing at a level around 300. ∂Wγ (pθ, pˆ) 1 X ? ˜(2) ≈ α (˜xj)h (24) ∂b(2) N˜ j We hypothesize the poor training is due to the fact that the j=1 Wasserstein gradient mainly depends on the generated sam- N˜ ples from the model distribution pθ, only indirectly on the ∂Wγ (pθ, pˆ) 1 X ? ≈ α (˜xj)˜xj. (25) data distribution pˆ. Suppose the samples generated by the ∂c N˜ j=1 model strongly differs from data, this becomes a problem because there is no weighting α?(x ˜ ) of the generated sam- ˜(1) ˜(2) N˜ j The samples {x˜j, hj , hj }j=1 are updated using PCD. ples that can represent the desired direction to a better min- Compared to that the gradients of KL-divergence are com- imum. In that case, the Wasserstein gradient will lead to a posed of model expectation terms minus data expectation bad local minimum. terms, the gradients of Wasserstein distance does not have To alleviate this issue, we use a hybrid learning of Wasser- the data expectation terms. We should know in mind that stein distance and KL-divergence, which is similar to that real data only influences through the optimal dual variable ? proposed in (Montavon et al., 2016). The learning objective α (˜xj). now becomes minimizing:

6. Stabilization with KL regularization (1 − η)W (pθ, pˆ) + ηKL(ˆp, pθ), (26) where η is the mixing ratio. For the training based on the pure Wasserstein distance, we notice that it is very likely to be entrapped into local Fig.3 shows that by using the proposed KL regularized Wasserstein Training of DBM

(a) Samples generated from standard RBM (b) Samples generated from Wasserstein RBM

(c) Illustration of W for standard RBM (d) Illustration of W for Wasserstein RBM

Figure 4. Visualization of generated samples and model weights for Wasserstein RBM(η = 0.1) and standard RBM.

Wasserstein learning objective with η = 0.1, the validation the using batch size 100, PCD chain size cross-entropy loss now decreases to < 200, which is a big 100, start with learning rate 0.01 with adaptive adjustments improvement over the pure Wasserstein distance training. (RMSprop), and train for at most 5000 epochs with early stopping. 7. Experiments The learned feature W is shown in Fig.4(c,d). It reveals some structures such as edges. We evaluate the proposed Wasserstein training for RBM and DBM on MNIST dataset. We strictly follow the Samples drawn from the learned RBMs are shown in MNIST protocol which divides the dataset into a training Fig.4(a,b). We draw these samples by first randomly ini- set with 60,000 digits, and a testing set with 10,000 digits. tialize the nodes and then perform 1000 Throughout the experiments, we set the smoothing param- steps. eter γ = 0.1 for the γ-smoothed Wasserstein distance. 7.2. DBM 7.1. RBM The DBM model we used in this experiment consists of two The results of training with standard KL divergence and hidden layers: the first hidden layer consists of 1000 hidden with the Wasserstein distance are compared as following. nodes; and the second layer has 500 nodes. In the training of DBM, we adopt the gradient descent using batch size In the training of RBM, we use 100 hidden nodes, adopt 100, PCD chain size 100, RMSprop to adaptively adjust Wasserstein Training of DBM

(a) Samples generated from standard DBM (b) Samples generated from Wasserstein DBM

(c) Illustration of W (1) for standard DBM (d) Illustration of W (1) for Wasserstein DBM

Figure 5. Visualization of generated samples and model weights for W-DBM(η = 0.3) and standard DBM. learning rate, and train for 2000 epochs. To give a more subjective comparison between the pro- posed Wasserstein-trained DBM (W-DBM, η = 0.3) to To investigate the effect of mixing ratio parameter η, we DBM trained with KL-divergence, we show the learned train DBMs with different η ranging from 0 to 1, and report model weights W (1) and digits randomly sampled from W- their performance in terms of Wasserstein distance between DBM (Fig.5, right) and standard DBM (Fig.5, left). 10,000 digits sampled form the model and 10,000 digits from the test set. Moreover, to faithfully reflect the perfor- Finally, in Fig.7, we visualize the learned W-DBM model mance of Wasserstein training, we do not perform discrim- distribution by projecting 10,000 randomly drawn samples inative fine-tuning which is usually done in DBM training. from the model (red dots) into a 2D plane. The projection The final values of Wasserstein distances after 2000 epochs is acquired by performing PCA over the data points of the of training are shown in Fig.6. training set. As a reference, we also plot the samples from test set (blue dots). From the figure, we can see that though From Fig.6, we can see that a mixture of Wassertein- not perfectly assembles the whole empirical data distribu- distance and KL-divergence outperforms either pure tion, W-DBM does loosely align with the data distribution Wasserstein-distance (η = 0) or KL-divergence (η = in many local areas. 1). This result empirically shows that though using Wassertein-distance alone is problematic, it is complemen- tary to KL-divergence and a combination of those two achieves better result for DBM training. Wasserstein Training of DBM References Arjovsky, Martin, Chintala, Soumith, and Bottou, Leon. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.

Coates, Adam, Lee, Honglak, and Ng, Andrew Y. An analysis of single-layer networks in unsupervised feature learning. Ann Arbor, 1001(48109):2, 2010. Cuturi, Marco. Sinkhorn distances: Lightspeed computa- tion of optimal transport. In Advances in Neural Infor- mation Processing Systems, pp. 2292–2300, 2013. Doucet, Arnaud. Fast computation of wasserstein barycen- ters. 2014. Figure 6. Wasserstein training using different mixing ratio η. Frogner, Charlie, Zhang, Chiyuan, Mobahi, Hossein, When η = 1, it’s equivalent to standard DBM (green star) trained by KL divergence; When η = 0.3 (orange star), it achieves the Araya, Mauricio, and Poggio, Tomaso A. Learning with lowest W-distance on test set. a wasserstein loss. In Advances in Neural Information Processing Systems, pp. 2053–2061, 2015. Hinton, Geoffrey E and Salakhutdinov, Ruslan R. Reduc- 0.03 ing the dimensionality of data with neural networks. sci- data ence, 313(5786):504–507, 2006. W-DBM 0.02 Hinton, Geoffrey E and Salakhutdinov, Ruslan R. Repli- 0.01 cated softmax: an undirected topic model. In Advances in neural information processing systems, pp. 1607– 0 1614, 2009.

-0.01 Larochelle, Hugo and Bengio, Yoshua. Classification us- ing discriminative restricted boltzmann machines. In -0.02 Proceedings of the 25th international conference on Ma- chine learning, pp. 536–543. ACM, 2008. -0.03 -0.06 -0.04 -0.02 0 0.02 Montavon, Griegoire, Muller, Klaus-Robert, and Cuturi, Marco. Waterstone training of restricted boltzmann ma- Figure 7. 2-D visualization of samples from W-DBM (red dots) chines. In Advances in Neural Information Processing and real data (blue dots). Systems, pp. 3711–3719, 2016. Salakhutdinov, Ruslan and Hinton, Geoffrey. Deep boltz- mann machines. In Artificial Intelligence and Statistics, 8. Conclusion pp. 448–455, 2009. In this report, we introduced Wasserstein distance as a Salakhutdinov, Ruslan, Mnih, Andriy, and Hinton, Geof- novel loss function instead of KL-divergence, and proposed frey. Restricted boltzmann machines for collaborative a general training method to minimize the Wasserstein dis- filtering. In Proceedings of the 24th international con- tance between the sample distribution and the model distri- ference on Machine learning, pp. 791–798. ACM, 2007. bution for various probabilistic models. By adding a regu- Taylor, Graham W and Hinton, Geoffrey E. Factored condi- larized entropy term, we found an efficient method to ap- tional restricted boltzmann machines for modeling mo- proximate the gradient of γ-smoothed Wasserstein distance tion style. In Proceedings of the 26th annual interna- with respect to parameters. We also discussed how to train tional conference on machine learning, pp. 1025–1032. for models containing hidden variables. After that, we de- ACM, 2009. veloped the Wasserstein training of RBMs and DBMs as specific examples. Experiments and analysis have shown Villani, Cedric.´ Optimal transport: old and new, volume the superiority of W-RBMs/DBMs compared to their tradi- 338. Springer Science & Business Media, 2008. tional counterparts.