Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks
The Helmholtz machine revisited
Danilo Jimenez Rezende BMI-EPFL
November 8, 2012
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks
1 Introduction
2 Variational Approximation
3 Relevant special cases
4 Learning with non-factorized qs
5 Extending the model over time
6 Final remarks
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Only X0 is observed.
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz machines
Helmholtz machines [Dayan et al., 1995, Dayan and Hinton, 1996, Dayan, 2000] are directed graphical models with a layered structure:
XL Complete data likelihood:
L−1 g Y g p(X |θ ) = p(Xl |Xl+1, θ )p(XL); X1 l=0
X0
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz machines
Helmholtz machines [Dayan et al., 1995, Dayan and Hinton, 1996, Dayan, 2000] are directed graphical models with a layered structure:
XL Complete data likelihood:
L−1 g Y g p(X |θ ) = p(Xl |Xl+1, θ )p(XL); X1 l=0
Only X0 is observed. X0
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Smooth units:
g g g p(Xl |Xl+1) = N(Xl ; tanh(Wl Xl+1 + Bl ), Σl ).
Parameters θg = {W g , Bg , Σg }.
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Types of Helmholtz machines
Binary units:
g g p(Xl |Xl+1) = Bern ◦ sigmoid(Wl Xl+1 + Bl );
Parameters θg = {W g , Bg }.
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Types of Helmholtz machines
Binary units:
g g p(Xl |Xl+1) = Bern ◦ sigmoid(Wl Xl+1 + Bl );
Parameters θg = {W g , Bg }. Smooth units:
g g g p(Xl |Xl+1) = N(Xl ; tanh(Wl Xl+1 + Bl ), Σl ).
Parameters θg = {W g , Bg , Σg }.
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Goal
For a data-set of iid samples y ∈ Data, maximize data log-likelihood w.r.t θg X ln p(Data|θg ) = ln p(y|θg ), y∈Data where
Z L L−1 g Y Y g p(y|θ ) = dXl p(y|X1) p(Xl |Xl+1, θ )p(XL) l>0 l=1
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Then
g r F(X0,θ ,θ ) g z r }| { r ln p(X0|θ ) = − hln q(Xl>0|X0, θ ) − ln p(X )iq(Xl>0|X0,θ ) +KL(q; p),
where r q(Xl>0|X0, θ ) KL(q; p) = hln i r g q(Xl>0|X0,θ ) p(Xl>0|X0, θ )
g g r r KL ≥ 0 ⇒ ln p(X0, θ ) ≥ −F(X0, θ , θ ) ∀θ
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick
r Introduce a parametric family of distributions q(Xl>0|X0, θ )
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited r q(Xl>0|X0, θ ) KL(q; p) = hln i r g q(Xl>0|X0,θ ) p(Xl>0|X0, θ )
g g r r KL ≥ 0 ⇒ ln p(X0, θ ) ≥ −F(X0, θ , θ ) ∀θ
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick
r Introduce a parametric family of distributions q(Xl>0|X0, θ ) Then
g r F(X0,θ ,θ ) g z r }| { r ln p(X0|θ ) = − hln q(Xl>0|X0, θ ) − ln p(X )iq(Xl>0|X0,θ ) +KL(q; p),
where
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited g g r r KL ≥ 0 ⇒ ln p(X0, θ ) ≥ −F(X0, θ , θ ) ∀θ
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick
r Introduce a parametric family of distributions q(Xl>0|X0, θ ) Then
g r F(X0,θ ,θ ) g z r }| { r ln p(X0|θ ) = − hln q(Xl>0|X0, θ ) − ln p(X )iq(Xl>0|X0,θ ) +KL(q; p),
where r q(Xl>0|X0, θ ) KL(q; p) = hln i r g q(Xl>0|X0,θ ) p(Xl>0|X0, θ )
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick
r Introduce a parametric family of distributions q(Xl>0|X0, θ ) Then
g r F(X0,θ ,θ ) g z r }| { r ln p(X0|θ ) = − hln q(Xl>0|X0, θ ) − ln p(X )iq(Xl>0|X0,θ ) +KL(q; p),
where r q(Xl>0|X0, θ ) KL(q; p) = hln i r g q(Xl>0|X0,θ ) p(Xl>0|X0, θ )
g g r r KL ≥ 0 ⇒ ln p(X0, θ ) ≥ −F(X0, θ , θ ) ∀θ
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited q can be any distribution with same support as p Choose q so that we can calculate expectations in a reasonable time Allows to solve inference using standard optimization techniques
Why ?
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick
Redefine the learning problem as X {θˆg , θˆr } = arg min F(x, θg , θr ), g r θ ,θ x∈Data
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Choose q so that we can calculate expectations in a reasonable time Allows to solve inference using standard optimization techniques
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick
Redefine the learning problem as X {θˆg , θˆr } = arg min F(x, θg , θr ), g r θ ,θ x∈Data Why ? q can be any distribution with same support as p
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Allows to solve inference using standard optimization techniques
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick
Redefine the learning problem as X {θˆg , θˆr } = arg min F(x, θg , θr ), g r θ ,θ x∈Data Why ? q can be any distribution with same support as p Choose q so that we can calculate expectations in a reasonable time
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick
Redefine the learning problem as X {θˆg , θˆr } = arg min F(x, θg , θr ), g r θ ,θ x∈Data Why ? q can be any distribution with same support as p Choose q so that we can calculate expectations in a reasonable time Allows to solve inference using standard optimization techniques
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Yields local message-passing type of algorithms Fast convergence Bad approximation to multimodal posteriors
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q
r Y r q(Xl>0|X0, θ ) = p(Xl |θl,i ), l>0,i
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Yields local message-passing type of algorithms Fast convergence Bad approximation to multimodal posteriors
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q
r Y r q(Xl>0|X0, θ ) = p(Xl |θl,i ), l>0,i
Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999]
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Fast convergence Bad approximation to multimodal posteriors
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q
r Y r q(Xl>0|X0, θ ) = p(Xl |θl,i ), l>0,i
Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Yields local message-passing type of algorithms
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Bad approximation to multimodal posteriors
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q
r Y r q(Xl>0|X0, θ ) = p(Xl |θl,i ), l>0,i
Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Yields local message-passing type of algorithms Fast convergence
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q
r Y r q(Xl>0|X0, θ ) = p(Xl |θl,i ), l>0,i
Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Yields local message-passing type of algorithms Fast convergence Bad approximation to multimodal posteriors
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Conditioned on X0 Cannot solve any expectation analytically Resort to Monte Carlo approximations
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz’ machine choice of q
Bottom-up graph [Dayan et al., 1995, X L Dayan and Hinton, 1996, Dayan, 2000]:
L r Y r q(Xl>0|X0, θ ) = p(Xl |Xl−1, θ ) l=1 X1
X0
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Cannot solve any expectation analytically Resort to Monte Carlo approximations
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz’ machine choice of q
Bottom-up graph [Dayan et al., 1995, X L Dayan and Hinton, 1996, Dayan, 2000]:
L r Y r q(Xl>0|X0, θ ) = p(Xl |Xl−1, θ ) l=1 X1 Conditioned on X0
X0
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Resort to Monte Carlo approximations
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz’ machine choice of q
Bottom-up graph [Dayan et al., 1995, X L Dayan and Hinton, 1996, Dayan, 2000]:
L r Y r q(Xl>0|X0, θ ) = p(Xl |Xl−1, θ ) l=1 X1 Conditioned on X0 Cannot solve any expectation analytically X0
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz’ machine choice of q
Bottom-up graph [Dayan et al., 1995, X L Dayan and Hinton, 1996, Dayan, 2000]:
L r Y r q(Xl>0|X0, θ ) = p(Xl |Xl−1, θ ) l=1 X1 Conditioned on X0 Cannot solve any expectation analytically X0 Resort to Monte Carlo approximations
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Point-estimate of the free energy
g r r r F(X0, θ , θ ) = hln q(Xl>0|X0, θ ) − ln p(X )iq(Xl>0|X0,θ )
ˆ r = hFiq(Xl>0|X0,θ ),
where ˆ r F = ln q(Xl>0|X0, θ ) − ln p(X )
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Update for θr : r g r ˆ δθ ∝ −F(X0, θ , θ ) = h(F − b)∇θr ln qiq, where b ≈ hFiˆ q
Scales badly ˆ Var[(F − b)∇θr ln q] ≈ O(Nr of hidden nodes)
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks REINFORCE
Simple, unbiased, stochastic gradient estimator Update for θg : g g r δθ ∝ −∇θg F(X0, θ , θ ) = h∇θg ln piq
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Scales badly ˆ Var[(F − b)∇θr ln q] ≈ O(Nr of hidden nodes)
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks REINFORCE
Simple, unbiased, stochastic gradient estimator Update for θg : g g r δθ ∝ −∇θg F(X0, θ , θ ) = h∇θg ln piq Update for θr : r g r ˆ δθ ∝ −F(X0, θ , θ ) = h(F − b)∇θr ln qiq, where b ≈ hFiˆ q
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Scales badly ˆ Var[(F − b)∇θr ln q] ≈ O(Nr of hidden nodes)
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks REINFORCE
Simple, unbiased, stochastic gradient estimator Update for θg : g g r δθ ∝ −∇θg F(X0, θ , θ ) = h∇θg ln piq Update for θr : r g r ˆ δθ ∝ −F(X0, θ , θ ) = h(F − b)∇θr ln qiq, where b ≈ hFiˆ q
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks REINFORCE
Simple, unbiased, stochastic gradient estimator Update for θg : g g r δθ ∝ −∇θg F(X0, θ , θ ) = h∇θg ln piq Update for θr : r g r ˆ δθ ∝ −F(X0, θ , θ ) = h(F − b)∇θr ln qiq, where b ≈ hFiˆ q
Scales badly ˆ Var[(F − b)∇θr ln q] ≈ O(Nr of hidden nodes)
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited r r Update for θ (or the ”sleep” phase) : δθ ∝ h∇θr ln qip, Why it is wrong ? Minimizing KL(p; q) instead of KL(q; p) Why and when it works? g r r g If for any θ , ∃θ such that q(Xl>0|X0, θ ) = p(Xl>0|X0, θ ) then [Ikeda et al., 1999]
arg min KL(p; q) = arg min KL(q; p) q q
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Wake-Sleep: The wrong gradient that works
g g Update for θ (or the ”wake” phase) : δθ ∝ h∇θg ln piq
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Why it is wrong ? Minimizing KL(p; q) instead of KL(q; p) Why and when it works? g r r g If for any θ , ∃θ such that q(Xl>0|X0, θ ) = p(Xl>0|X0, θ ) then [Ikeda et al., 1999]
arg min KL(p; q) = arg min KL(q; p) q q
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Wake-Sleep: The wrong gradient that works
g g Update for θ (or the ”wake” phase) : δθ ∝ h∇θg ln piq r r Update for θ (or the ”sleep” phase) : δθ ∝ h∇θr ln qip,
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Why it is wrong ? Minimizing KL(p; q) instead of KL(q; p) Why and when it works? g r r g If for any θ , ∃θ such that q(Xl>0|X0, θ ) = p(Xl>0|X0, θ ) then [Ikeda et al., 1999]
arg min KL(p; q) = arg min KL(q; p) q q
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Wake-Sleep: The wrong gradient that works
g g Update for θ (or the ”wake” phase) : δθ ∝ h∇θg ln piq r r Update for θ (or the ”sleep” phase) : δθ ∝ h∇θr ln qip,
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Why and when it works? g r r g If for any θ , ∃θ such that q(Xl>0|X0, θ ) = p(Xl>0|X0, θ ) then [Ikeda et al., 1999]
arg min KL(p; q) = arg min KL(q; p) q q
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Wake-Sleep: The wrong gradient that works
g g Update for θ (or the ”wake” phase) : δθ ∝ h∇θg ln piq r r Update for θ (or the ”sleep” phase) : δθ ∝ h∇θr ln qip, Why it is wrong ? Minimizing KL(p; q) instead of KL(q; p)
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Wake-Sleep: The wrong gradient that works
g g Update for θ (or the ”wake” phase) : δθ ∝ h∇θg ln piq r r Update for θ (or the ”sleep” phase) : δθ ∝ h∇θr ln qip, Why it is wrong ? Minimizing KL(p; q) instead of KL(q; p) Why and when it works? g r r g If for any θ , ∃θ such that q(Xl>0|X0, θ ) = p(Xl>0|X0, θ ) then [Ikeda et al., 1999]
arg min KL(p; q) = arg min KL(q; p) q q
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Wake-Sleep: The wrong gradient that works
generative recognition
XL XL . . . .
X1 X1
X0 X0
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Wake-sleep scales nicely, but it is wrong None of them exploits specific properties of exponential family pdfs
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Summary
REINFORCE is unbiased but scales badly
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited None of them exploits specific properties of exponential family pdfs
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Summary
REINFORCE is unbiased but scales badly Wake-sleep scales nicely, but it is wrong
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Summary
REINFORCE is unbiased but scales badly Wake-sleep scales nicely, but it is wrong None of them exploits specific properties of exponential family pdfs
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Bonnet’s theorem: ∂ ∇µi hf (x)ip(x) = h f (x)ip(x) ∂xi Price’s theorem: 1 2 δi,j ∂ ∇Σi,j hf (x)ip(x) = ( ) h f (x)ip(x) 2 ∂xi ∂xj
If p(x) = N(x; µ, Σ) ( θ = {µ, Σ} ) we can do much better:
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Why REINFORCE is so bad?
REINFORCE is based on Likelihood-Ratio:
∇θhf (x)ip(x) = hf (x)∇θ ln p(x)ip(x) = h(f (x) − b)∇θ ln p(x)ip(x)
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Price’s theorem: 1 2 δi,j ∂ ∇Σi,j hf (x)ip(x) = ( ) h f (x)ip(x) 2 ∂xi ∂xj
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Why REINFORCE is so bad?
REINFORCE is based on Likelihood-Ratio:
∇θhf (x)ip(x) = hf (x)∇θ ln p(x)ip(x) = h(f (x) − b)∇θ ln p(x)ip(x)
If p(x) = N(x; µ, Σ) ( θ = {µ, Σ} ) we can do much better: Bonnet’s theorem: ∂ ∇µi hf (x)ip(x) = h f (x)ip(x) ∂xi
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Why REINFORCE is so bad?
REINFORCE is based on Likelihood-Ratio:
∇θhf (x)ip(x) = hf (x)∇θ ln p(x)ip(x) = h(f (x) − b)∇θ ln p(x)ip(x)
If p(x) = N(x; µ, Σ) ( θ = {µ, Σ} ) we can do much better: Bonnet’s theorem: ∂ ∇µi hf (x)ip(x) = h f (x)ip(x) ∂xi Price’s theorem: 1 2 δi,j ∂ ∇Σi,j hf (x)ip(x) = ( ) h f (x)ip(x) 2 ∂xi ∂xj
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited If y ∼ N(y|µ(x), Σ(x)) (stochastic case) then
”Drift” z }| { ∂µ(x) 1 ∂Σ(x) εx = hεy + λy i ∂x 2 ∂x N(y|µ(x),Σ(x)) | {z } ”Diffusion”
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Rules for stochastic backpropagation (SBP)
∂V (y) Let V (y) be some cost function on y and εy = ∂y and 2 = ∂ V (y) be its first and second derivatives. λy ∂y 2 If y = µ(x) (deterministic case) then ∂µ(x) εx = εy | y=µ(x) ∂x
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Rules for stochastic backpropagation (SBP)
∂V (y) Let V (y) be some cost function on y and εy = ∂y and 2 = ∂ V (y) be its first and second derivatives. λy ∂y 2 If y = µ(x) (deterministic case) then ∂µ(x) εx = εy | y=µ(x) ∂x If y ∼ N(y|µ(x), Σ(x)) (stochastic case) then
”Drift” z }| { ∂µ(x) 1 ∂Σ(x) εx = hεy + λy i ∂x 2 ∂x N(y|µ(x),Σ(x)) | {z } ”Diffusion”
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Proposed smooth recognition model
generative recognition µL x y XL XL ⌃L @V @f(x) @V . . f(x) . . @y @x @y . . µ1 T 2 X1 X1 @f(x) @2V @f(x) @V @2f(x) @ V ⌃1 + @x @y2 @x @y @x2 @y2 X0 X0
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Proposed smooth recognition model
T1(h, z)
T1(h, z)
T1(h, z)
generative recognition generative recognition µL XL XL XL XL ⌃L ...... µ1 X1 X1 X1 X1 ⌃1
X0 X0 X0 X0
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited backpropagateThroughGenerativeModel() µ εl ← ∇hl F(v, h, z) δ εΣ ← 1 i,j ∇2 F(v, h, z) l 2 hl,i ,hl,j backpropagateThroughRecognitionModel() applyGradients()
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Learning procedure: algorithm 1
while isLearning() do v ← getNewDataSample() bottomUp()
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited backpropagateThroughRecognitionModel() applyGradients()
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Learning procedure: algorithm 1
while isLearning() do v ← getNewDataSample() bottomUp() backpropagateThroughGenerativeModel() µ εl ← ∇hl F(v, h, z) δ εΣ ← 1 i,j ∇2 F(v, h, z) l 2 hl,i ,hl,j
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited applyGradients()
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Learning procedure: algorithm 1
while isLearning() do v ← getNewDataSample() bottomUp() backpropagateThroughGenerativeModel() µ εl ← ∇hl F(v, h, z) δ εΣ ← 1 i,j ∇2 F(v, h, z) l 2 hl,i ,hl,j backpropagateThroughRecognitionModel()
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Learning procedure: algorithm 1
while isLearning() do v ← getNewDataSample() bottomUp() backpropagateThroughGenerativeModel() µ εl ← ∇hl F(v, h, z) δ εΣ ← 1 i,j ∇2 F(v, h, z) l 2 hl,i ,hl,j backpropagateThroughRecognitionModel() applyGradients()
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Performance comparison on MNIST
Training set 60000 28x28 grey images; Test set 10000 images
/Users/danilo/workspace/RCHMII/simulationData/CHM_MultipleComparisonsLearningPerformanceNegLogLikelihood.pdf
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Performance comparison on MNIST
Model Parameters Log-Likelihood HM Nh = 200 + 50, Wake-Sleep -157 HM Nh = 200 + 50, REINFORCE -145 Mix. Bernoulli N = 500 -137.64 RBM Nh = 500, CD1 -125.53 HM Nh = 200 + 50, BP-SBP -108.35 RBM Nh = 500, CD3 -105.53 NADE N = 500 -88.86 RBM Nh = 500, CD25 -86.86
RBM, Mix. Bernoulli and NADE log-likelihoods from [Larochelle and Murray, 2011].
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Making the model conditional
generative r recognition µL DL 1 XL ⌃ XL L W r WL 1,DL 1 L 1 . . . . µ r 1 D1 X1 X1 ⌃1 W r W0,D0 0
X0 X0
z
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Conditioning on a RNN of LSTMs (Differentiable Memory cells)
ac ac
ι ι φ
s s
ω ω
bc bc
t - 1 t
FigureIdea 12: suggestedA di↵erentiable, (but not multiplicative implemented) memory by cell. The internal state of the [Sutskevercell s andis maintained Hinton, with 2008]. a recurrent connection of fixed weight 1.0, modulated by forget fate . The three gates , ◆ and ! collect activations from inside and Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited outside the cell, and control the cell via multiplicative units (depicted by arrows ending upon other arrows). The input ◆ and output ! gates scale the input ac and output bc of the cell while the forget gate scales the internal state s.
as the input gate remains closed (i.e. has an activation close to 0), the activation of the cell will not be overwritten by the new inputs arriving in the network, and can therefore be made available to the net much later in the sequence, by opening the output gate. The preservation over time of gradient information by LSTM is illustrated in Figure 5. The following two subsections provide the equations for the activation (forward pass) and gradient calculation (backward pass) of an LSTM memory layer within a recurrent neural network. We will provide here the exact gradient, calculated with backpropagation through time. This in contrast to previous approximations of the gradient Gers (2001) which are not used here. Weight wij is the weight of the connection from unit i to unit j, the network input to t some unit j at time t is denoted aj and the value of the same unit after the activation function t has been applied is bj. The LSTM equations are given for a single memory unit only. The subscripts ◆, and ! refer to the input gate, forget gate and output gate respectively. st is the state of the cell at time t (i.e. the activation of the linear cell unit). f is the activation function of the gates, and g and h are respectively the cell input and output activation functions (in this work, all equal to the tanh function). Let I be the number of inputs, K be the number of outputs and H be the number of cells in the memory layer. Note that only the cell outputs bt are connected to the other cells in the layer. The other LSTM activations, such as the states, the cell inputs, or the gate activations, are only visible within the cell. We use the index h to refer to cell outputs from other cells in the memory layer, and c to refer to the current cell. The forward sweep is calculated for a length T input sequence x by starting at t = 1 and recursively applying the update equations while incrementing t, and the BPTT backward pass is calculated by starting at t = T , and recursively calculating the unit derivatives while decrementing t.
15 Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Full model
time t-1 time t
LSTM layer LSTM layer
zt 1 zt LSTMN LSTMN
generative recognition generative recognition µL µL hL hL hL hL . ⌃L . ⌃L ...... µ1 µ1 h1 h1 h1 h1 ⌃1 ⌃1
LSTM1 v v LSTM1 v v
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Learning procedure: algorithm 2
while isLearning() do resetGradients() for t = 1 → T do forwardLSTM(t) vt ← getDataFrame(t) bottomUp() backpropagateThroughGenerativeModel() backpropagateThroughRecognitionModel() LSTM εt ← ∇zt F(vt , ht , zt ) incrementGradientsOfGenerativeAndRecognitionModels() for t = T → 1 do LSTM.BPTT(t) applyGradients()
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Benchmarks: temporal sequences
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Not yet competitive with RBM + CD25 Exploiting specific properties of the noise and graph is substantially better than Wake-Sleep and REINFORCE How much would we benefit from having less constrained covariance matrices?
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Final remarks
Performance is similar to RBM + CD3
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Exploiting specific properties of the noise and graph is substantially better than Wake-Sleep and REINFORCE How much would we benefit from having less constrained covariance matrices?
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Final remarks
Performance is similar to RBM + CD3
Not yet competitive with RBM + CD25
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited How much would we benefit from having less constrained covariance matrices?
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Final remarks
Performance is similar to RBM + CD3
Not yet competitive with RBM + CD25 Exploiting specific properties of the noise and graph is substantially better than Wake-Sleep and REINFORCE
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Final remarks
Performance is similar to RBM + CD3
Not yet competitive with RBM + CD25 Exploiting specific properties of the noise and graph is substantially better than Wake-Sleep and REINFORCE How much would we benefit from having less constrained covariance matrices?
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks
THANK YOU!
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Bibliography
Dayan, P. (2000). Helmholtz machines and wake-sleep learning. Handbook of Brain Theory and Neural Network. MIT Press, Cambridge, MA, 44(0).
Dayan, P. and Hinton, G. E. (1996). Varieties of Helmholtz Machine. Neural Networks, 9(8):1385–1403.
Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). The Helmholtz machine. Neural computation, 7(5):889–904.
Frey, B. J. (1996). Variational inference for continuous sigmoidal Bayesian networks. IN SIXTH INTERNATIONAL WORKSHOP ON ARTIFICIAL INTELLIGENCE AND STATISTICS. Ikeda, S., Nakahara, H., and Amari, S.-i. (1999). Convergence of The Wake-Sleep Algorithm. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 11, pages 239 – 245.
Jordan, M. I. (1999). An introduction to variational methods for graphical models. MACHINE LEARNING, 37:183 – 233.
Larochelle, H. and Murray, I. (2011). The Neural Autoregressive Distribution Estimator. Journal ofDanilo Machine Jimenez Learning Rezende . . . , 15:29–37. BMI-EPFL The Helmholtz machine revisited Sutskever, I. and Hinton, G. (2008). The recurrent temporal restricted boltzmann machine. Advances in Neural Information Processing Systems, 21:2008. Importance sampling:
1 X ln p(X ) ≈ ln exp −Fˆ , 0 K k k
where Fˆk is a point-estimate of F using a sample from q.
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Estimating the data log-likelihood
Naive: 1 X ln p(X ) ≈ ln p(X |X k ), 0 K 0 1 k k where X1 , k = 1 ... K, are samples from p.
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Estimating the data log-likelihood
Naive: 1 X ln p(X ) ≈ ln p(X |X k ), 0 K 0 1 k k where X1 , k = 1 ... K, are samples from p. Importance sampling:
1 X ln p(X ) ≈ ln exp −Fˆ , 0 K k k
where Fˆk is a point-estimate of F using a sample from q.
Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited