Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks

The Helmholtz machine revisited

Danilo Jimenez Rezende BMI-EPFL

November 8, 2012

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks

1 Introduction

2 Variational Approximation

3 Relevant special cases

4 Learning with non-factorized qs

5 Extending the model over time

6 Final remarks

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Only X0 is observed.

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz machines

Helmholtz machines [Dayan et al., 1995, Dayan and Hinton, 1996, Dayan, 2000] are directed graphical models with a layered structure:

XL Complete data likelihood:

L−1 g Y g p(X |θ ) = p(Xl |Xl+1, θ )p(XL); X1 l=0

X0

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz machines

Helmholtz machines [Dayan et al., 1995, Dayan and Hinton, 1996, Dayan, 2000] are directed graphical models with a layered structure:

XL Complete data likelihood:

L−1 g Y g p(X |θ ) = p(Xl |Xl+1, θ )p(XL); X1 l=0

Only X0 is observed. X0

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Smooth units:

g g g p(Xl |Xl+1) = N(Xl ; tanh(Wl Xl+1 + Bl ), Σl ).

Parameters θg = {W g , Bg , Σg }.

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Types of Helmholtz machines

Binary units:

g g p(Xl |Xl+1) = Bern ◦ sigmoid(Wl Xl+1 + Bl );

Parameters θg = {W g , Bg }.

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Types of Helmholtz machines

Binary units:

g g p(Xl |Xl+1) = Bern ◦ sigmoid(Wl Xl+1 + Bl );

Parameters θg = {W g , Bg }. Smooth units:

g g g p(Xl |Xl+1) = N(Xl ; tanh(Wl Xl+1 + Bl ), Σl ).

Parameters θg = {W g , Bg , Σg }.

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Goal

For a data-set of iid samples y ∈ Data, maximize data log-likelihood w.r.t θg X ln p(Data|θg ) = ln p(y|θg ), y∈Data where

Z L L−1 g Y Y g p(y|θ ) = dXl p(y|X1) p(Xl |Xl+1, θ )p(XL) l>0 l=1

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Then

g r F(X0,θ ,θ ) g z r }| { r ln p(X0|θ ) = − hln q(Xl>0|X0, θ ) − ln p(X )iq(Xl>0|X0,θ ) +KL(q; p),

where r q(Xl>0|X0, θ ) KL(q; p) = hln i r g q(Xl>0|X0,θ ) p(Xl>0|X0, θ )

g g r r KL ≥ 0 ⇒ ln p(X0, θ ) ≥ −F(X0, θ , θ ) ∀θ

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick

r Introduce a parametric family of distributions q(Xl>0|X0, θ )

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited r q(Xl>0|X0, θ ) KL(q; p) = hln i r g q(Xl>0|X0,θ ) p(Xl>0|X0, θ )

g g r r KL ≥ 0 ⇒ ln p(X0, θ ) ≥ −F(X0, θ , θ ) ∀θ

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick

r Introduce a parametric family of distributions q(Xl>0|X0, θ ) Then

g r F(X0,θ ,θ ) g z r }| { r ln p(X0|θ ) = − hln q(Xl>0|X0, θ ) − ln p(X )iq(Xl>0|X0,θ ) +KL(q; p),

where

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited g g r r KL ≥ 0 ⇒ ln p(X0, θ ) ≥ −F(X0, θ , θ ) ∀θ

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick

r Introduce a parametric family of distributions q(Xl>0|X0, θ ) Then

g r F(X0,θ ,θ ) g z r }| { r ln p(X0|θ ) = − hln q(Xl>0|X0, θ ) − ln p(X )iq(Xl>0|X0,θ ) +KL(q; p),

where r q(Xl>0|X0, θ ) KL(q; p) = hln i r g q(Xl>0|X0,θ ) p(Xl>0|X0, θ )

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick

r Introduce a parametric family of distributions q(Xl>0|X0, θ ) Then

g r F(X0,θ ,θ ) g z r }| { r ln p(X0|θ ) = − hln q(Xl>0|X0, θ ) − ln p(X )iq(Xl>0|X0,θ ) +KL(q; p),

where r q(Xl>0|X0, θ ) KL(q; p) = hln i r g q(Xl>0|X0,θ ) p(Xl>0|X0, θ )

g g r r KL ≥ 0 ⇒ ln p(X0, θ ) ≥ −F(X0, θ , θ ) ∀θ

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited q can be any distribution with same support as p Choose q so that we can calculate expectations in a reasonable time Allows to solve inference using standard optimization techniques

Why ?

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick

Redefine the learning problem as X {θˆg , θˆr } = arg min F(x, θg , θr ), g r θ ,θ x∈Data

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Choose q so that we can calculate expectations in a reasonable time Allows to solve inference using standard optimization techniques

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick

Redefine the learning problem as X {θˆg , θˆr } = arg min F(x, θg , θr ), g r θ ,θ x∈Data Why ? q can be any distribution with same support as p

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Allows to solve inference using standard optimization techniques

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick

Redefine the learning problem as X {θˆg , θˆr } = arg min F(x, θg , θr ), g r θ ,θ x∈Data Why ? q can be any distribution with same support as p Choose q so that we can calculate expectations in a reasonable time

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick

Redefine the learning problem as X {θˆg , θˆr } = arg min F(x, θg , θr ), g r θ ,θ x∈Data Why ? q can be any distribution with same support as p Choose q so that we can calculate expectations in a reasonable time Allows to solve inference using standard optimization techniques

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Yields local message-passing type of algorithms Fast convergence Bad approximation to multimodal posteriors

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q

r Y r q(Xl>0|X0, θ ) = p(Xl |θl,i ), l>0,i

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Yields local message-passing type of algorithms Fast convergence Bad approximation to multimodal posteriors

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q

r Y r q(Xl>0|X0, θ ) = p(Xl |θl,i ), l>0,i

Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999]

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Fast convergence Bad approximation to multimodal posteriors

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q

r Y r q(Xl>0|X0, θ ) = p(Xl |θl,i ), l>0,i

Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Yields local message-passing type of algorithms

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Bad approximation to multimodal posteriors

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q

r Y r q(Xl>0|X0, θ ) = p(Xl |θl,i ), l>0,i

Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Yields local message-passing type of algorithms Fast convergence

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q

r Y r q(Xl>0|X0, θ ) = p(Xl |θl,i ), l>0,i

Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Yields local message-passing type of algorithms Fast convergence Bad approximation to multimodal posteriors

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Conditioned on X0 Cannot solve any expectation analytically Resort to Monte Carlo approximations

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz’ machine choice of q

Bottom-up graph [Dayan et al., 1995, X L Dayan and Hinton, 1996, Dayan, 2000]:

L r Y r q(Xl>0|X0, θ ) = p(Xl |Xl−1, θ ) l=1 X1

X0

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Cannot solve any expectation analytically Resort to Monte Carlo approximations

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz’ machine choice of q

Bottom-up graph [Dayan et al., 1995, X L Dayan and Hinton, 1996, Dayan, 2000]:

L r Y r q(Xl>0|X0, θ ) = p(Xl |Xl−1, θ ) l=1 X1 Conditioned on X0

X0

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Resort to Monte Carlo approximations

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz’ machine choice of q

Bottom-up graph [Dayan et al., 1995, X L Dayan and Hinton, 1996, Dayan, 2000]:

L r Y r q(Xl>0|X0, θ ) = p(Xl |Xl−1, θ ) l=1 X1 Conditioned on X0 Cannot solve any expectation analytically X0

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz’ machine choice of q

Bottom-up graph [Dayan et al., 1995, X L Dayan and Hinton, 1996, Dayan, 2000]:

L r Y r q(Xl>0|X0, θ ) = p(Xl |Xl−1, θ ) l=1 X1 Conditioned on X0 Cannot solve any expectation analytically X0 Resort to Monte Carlo approximations

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Point-estimate of the free energy

g r r r F(X0, θ , θ ) = hln q(Xl>0|X0, θ ) − ln p(X )iq(Xl>0|X0,θ )

ˆ r = hFiq(Xl>0|X0,θ ),

where ˆ r F = ln q(Xl>0|X0, θ ) − ln p(X )

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Update for θr : r g r ˆ δθ ∝ −F(X0, θ , θ ) = h(F − b)∇θr ln qiq, where b ≈ hFiˆ q

Scales badly ˆ Var[(F − b)∇θr ln q] ≈ O(Nr of hidden nodes)

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks REINFORCE

Simple, unbiased, stochastic gradient estimator Update for θg : g g r δθ ∝ −∇θg F(X0, θ , θ ) = h∇θg ln piq

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Scales badly ˆ Var[(F − b)∇θr ln q] ≈ O(Nr of hidden nodes)

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks REINFORCE

Simple, unbiased, stochastic gradient estimator Update for θg : g g r δθ ∝ −∇θg F(X0, θ , θ ) = h∇θg ln piq Update for θr : r g r ˆ δθ ∝ −F(X0, θ , θ ) = h(F − b)∇θr ln qiq, where b ≈ hFiˆ q

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Scales badly ˆ Var[(F − b)∇θr ln q] ≈ O(Nr of hidden nodes)

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks REINFORCE

Simple, unbiased, stochastic gradient estimator Update for θg : g g r δθ ∝ −∇θg F(X0, θ , θ ) = h∇θg ln piq Update for θr : r g r ˆ δθ ∝ −F(X0, θ , θ ) = h(F − b)∇θr ln qiq, where b ≈ hFiˆ q

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks REINFORCE

Simple, unbiased, stochastic gradient estimator Update for θg : g g r δθ ∝ −∇θg F(X0, θ , θ ) = h∇θg ln piq Update for θr : r g r ˆ δθ ∝ −F(X0, θ , θ ) = h(F − b)∇θr ln qiq, where b ≈ hFiˆ q

Scales badly ˆ Var[(F − b)∇θr ln q] ≈ O(Nr of hidden nodes)

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited r r Update for θ (or the ”sleep” phase) : δθ ∝ h∇θr ln qip, Why it is wrong ? Minimizing KL(p; q) instead of KL(q; p) Why and when it works? g r r g If for any θ , ∃θ such that q(Xl>0|X0, θ ) = p(Xl>0|X0, θ ) then [Ikeda et al., 1999]

arg min KL(p; q) = arg min KL(q; p) q q

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Wake-Sleep: The wrong gradient that works

g g Update for θ (or the ”wake” phase) : δθ ∝ h∇θg ln piq

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Why it is wrong ? Minimizing KL(p; q) instead of KL(q; p) Why and when it works? g r r g If for any θ , ∃θ such that q(Xl>0|X0, θ ) = p(Xl>0|X0, θ ) then [Ikeda et al., 1999]

arg min KL(p; q) = arg min KL(q; p) q q

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Wake-Sleep: The wrong gradient that works

g g Update for θ (or the ”wake” phase) : δθ ∝ h∇θg ln piq r r Update for θ (or the ”sleep” phase) : δθ ∝ h∇θr ln qip,

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Why it is wrong ? Minimizing KL(p; q) instead of KL(q; p) Why and when it works? g r r g If for any θ , ∃θ such that q(Xl>0|X0, θ ) = p(Xl>0|X0, θ ) then [Ikeda et al., 1999]

arg min KL(p; q) = arg min KL(q; p) q q

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Wake-Sleep: The wrong gradient that works

g g Update for θ (or the ”wake” phase) : δθ ∝ h∇θg ln piq r r Update for θ (or the ”sleep” phase) : δθ ∝ h∇θr ln qip,

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Why and when it works? g r r g If for any θ , ∃θ such that q(Xl>0|X0, θ ) = p(Xl>0|X0, θ ) then [Ikeda et al., 1999]

arg min KL(p; q) = arg min KL(q; p) q q

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Wake-Sleep: The wrong gradient that works

g g Update for θ (or the ”wake” phase) : δθ ∝ h∇θg ln piq r r Update for θ (or the ”sleep” phase) : δθ ∝ h∇θr ln qip, Why it is wrong ? Minimizing KL(p; q) instead of KL(q; p)

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Wake-Sleep: The wrong gradient that works

g g Update for θ (or the ”wake” phase) : δθ ∝ h∇θg ln piq r r Update for θ (or the ”sleep” phase) : δθ ∝ h∇θr ln qip, Why it is wrong ? Minimizing KL(p; q) instead of KL(q; p) Why and when it works? g r r g If for any θ , ∃θ such that q(Xl>0|X0, θ ) = p(Xl>0|X0, θ ) then [Ikeda et al., 1999]

arg min KL(p; q) = arg min KL(q; p) q q

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Wake-Sleep: The wrong gradient that works

generative recognition

XL XL . . . .

X1 X1

X0 X0

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Wake-sleep scales nicely, but it is wrong None of them exploits specific properties of exponential family pdfs

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Summary

REINFORCE is unbiased but scales badly

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited None of them exploits specific properties of exponential family pdfs

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Summary

REINFORCE is unbiased but scales badly Wake-sleep scales nicely, but it is wrong

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Summary

REINFORCE is unbiased but scales badly Wake-sleep scales nicely, but it is wrong None of them exploits specific properties of exponential family pdfs

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Bonnet’s theorem: ∂ ∇µi hf (x)ip(x) = h f (x)ip(x) ∂xi Price’s theorem: 1 2 δi,j ∂ ∇Σi,j hf (x)ip(x) = ( ) h f (x)ip(x) 2 ∂xi ∂xj

If p(x) = N(x; µ, Σ) ( θ = {µ, Σ} ) we can do much better:

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Why REINFORCE is so bad?

REINFORCE is based on Likelihood-Ratio:

∇θhf (x)ip(x) = hf (x)∇θ ln p(x)ip(x) = h(f (x) − b)∇θ ln p(x)ip(x)

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Price’s theorem: 1 2 δi,j ∂ ∇Σi,j hf (x)ip(x) = ( ) h f (x)ip(x) 2 ∂xi ∂xj

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Why REINFORCE is so bad?

REINFORCE is based on Likelihood-Ratio:

∇θhf (x)ip(x) = hf (x)∇θ ln p(x)ip(x) = h(f (x) − b)∇θ ln p(x)ip(x)

If p(x) = N(x; µ, Σ) ( θ = {µ, Σ} ) we can do much better: Bonnet’s theorem: ∂ ∇µi hf (x)ip(x) = h f (x)ip(x) ∂xi

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Why REINFORCE is so bad?

REINFORCE is based on Likelihood-Ratio:

∇θhf (x)ip(x) = hf (x)∇θ ln p(x)ip(x) = h(f (x) − b)∇θ ln p(x)ip(x)

If p(x) = N(x; µ, Σ) ( θ = {µ, Σ} ) we can do much better: Bonnet’s theorem: ∂ ∇µi hf (x)ip(x) = h f (x)ip(x) ∂xi Price’s theorem: 1 2 δi,j ∂ ∇Σi,j hf (x)ip(x) = ( ) h f (x)ip(x) 2 ∂xi ∂xj

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited If y ∼ N(y|µ(x), Σ(x)) (stochastic case) then

”Drift” z }| { ∂µ(x) 1 ∂Σ(x) εx = hεy + λy i ∂x 2 ∂x N(y|µ(x),Σ(x)) | {z } ”Diffusion”

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Rules for stochastic (SBP)

∂V (y) Let V (y) be some cost function on y and εy = ∂y and 2 = ∂ V (y) be its first and second derivatives. λy ∂y 2 If y = µ(x) (deterministic case) then ∂µ(x) εx = εy | y=µ(x) ∂x

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Rules for stochastic backpropagation (SBP)

∂V (y) Let V (y) be some cost function on y and εy = ∂y and 2 = ∂ V (y) be its first and second derivatives. λy ∂y 2 If y = µ(x) (deterministic case) then ∂µ(x) εx = εy | y=µ(x) ∂x If y ∼ N(y|µ(x), Σ(x)) (stochastic case) then

”Drift” z }| { ∂µ(x) 1 ∂Σ(x) εx = hεy + λy i ∂x 2 ∂x N(y|µ(x),Σ(x)) | {z } ”Diffusion”

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Proposed smooth recognition model

generative recognition µL x y XL XL ⌃L @V @f(x) @V . . f(x) . . @y @x @y . . µ1 T 2 X1 X1 @f(x) @2V @f(x) @V @2f(x) @ V ⌃1 + @x @y2 @x @y @x2 @y2 X0 X0

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Proposed smooth recognition model

T1(h, z)

T1(h, z)

T1(h, z)

generative recognition generative recognition µL XL XL XL XL ⌃L ...... µ1 X1 X1 X1 X1 ⌃1

X0 X0 X0 X0

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited backpropagateThroughGenerativeModel() µ εl ← ∇hl F(v, h, z) δ εΣ ← 1  i,j ∇2 F(v, h, z) l 2 hl,i ,hl,j backpropagateThroughRecognitionModel() applyGradients()

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Learning procedure: algorithm 1

while isLearning() do v ← getNewDataSample() bottomUp()

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited backpropagateThroughRecognitionModel() applyGradients()

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Learning procedure: algorithm 1

while isLearning() do v ← getNewDataSample() bottomUp() backpropagateThroughGenerativeModel() µ εl ← ∇hl F(v, h, z) δ εΣ ← 1  i,j ∇2 F(v, h, z) l 2 hl,i ,hl,j

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited applyGradients()

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Learning procedure: algorithm 1

while isLearning() do v ← getNewDataSample() bottomUp() backpropagateThroughGenerativeModel() µ εl ← ∇hl F(v, h, z) δ εΣ ← 1  i,j ∇2 F(v, h, z) l 2 hl,i ,hl,j backpropagateThroughRecognitionModel()

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Learning procedure: algorithm 1

while isLearning() do v ← getNewDataSample() bottomUp() backpropagateThroughGenerativeModel() µ εl ← ∇hl F(v, h, z) δ εΣ ← 1  i,j ∇2 F(v, h, z) l 2 hl,i ,hl,j backpropagateThroughRecognitionModel() applyGradients()

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Performance comparison on MNIST

Training set 60000 28x28 grey images; Test set 10000 images

/Users/danilo/workspace/RCHMII/simulationData/CHM_MultipleComparisonsLearningPerformanceNegLogLikelihood.pdf

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Performance comparison on MNIST

Model Parameters Log-Likelihood HM Nh = 200 + 50, Wake-Sleep -157 HM Nh = 200 + 50, REINFORCE -145 Mix. Bernoulli N = 500 -137.64 RBM Nh = 500, CD1 -125.53 HM Nh = 200 + 50, BP-SBP -108.35 RBM Nh = 500, CD3 -105.53 NADE N = 500 -88.86 RBM Nh = 500, CD25 -86.86

RBM, Mix. Bernoulli and NADE log-likelihoods from [Larochelle and Murray, 2011].

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Making the model conditional

generative r recognition µL DL 1 XL ⌃ XL L W r WL 1,DL 1 L 1 . . . . µ r 1 D1 X1 X1 ⌃1 W r W0,D0 0

X0 X0

z

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Conditioning on a RNN of LSTMs (Differentiable Memory cells)

ac ac

ι ι φ

s s

ω ω

bc bc

t - 1 t

FigureIdea 12: suggestedA di↵erentiable, (but not multiplicative implemented) memory by cell. The internal state of the [Sutskevercell s andis maintained Hinton, with 2008]. a recurrent connection of fixed weight 1.0, modulated by forget fate . The three gates , ◆ and ! collect activations from inside and Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited outside the cell, and control the cell via multiplicative units (depicted by arrows ending upon other arrows). The input ◆ and output ! gates scale the input ac and output bc of the cell while the forget gate scales the internal state s.

as the input gate remains closed (i.e. has an activation close to 0), the activation of the cell will not be overwritten by the new inputs arriving in the network, and can therefore be made available to the net much later in the sequence, by opening the output gate. The preservation over time of gradient information by LSTM is illustrated in Figure 5. The following two subsections provide the equations for the activation (forward pass) and gradient calculation (backward pass) of an LSTM memory layer within a recurrent neural network. We will provide here the exact gradient, calculated with backpropagation through time. This in contrast to previous approximations of the gradient Gers (2001) which are not used here. Weight wij is the weight of the connection from unit i to unit j, the network input to t some unit j at time t is denoted aj and the value of the same unit after the activation function t has been applied is bj. The LSTM equations are given for a single memory unit only. The subscripts ◆, and ! refer to the input gate, forget gate and output gate respectively. st is the state of the cell at time t (i.e. the activation of the linear cell unit). f is the activation function of the gates, and g and h are respectively the cell input and output activation functions (in this work, all equal to the tanh function). Let I be the number of inputs, K be the number of outputs and H be the number of cells in the memory layer. Note that only the cell outputs bt are connected to the other cells in the layer. The other LSTM activations, such as the states, the cell inputs, or the gate activations, are only visible within the cell. We use the index h to refer to cell outputs from other cells in the memory layer, and c to refer to the current cell. The forward sweep is calculated for a length T input sequence x by starting at t = 1 and recursively applying the update equations while incrementing t, and the BPTT backward pass is calculated by starting at t = T , and recursively calculating the unit derivatives while decrementing t.

15 Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Full model

time t-1 time t

LSTM layer LSTM layer

zt 1 zt LSTMN LSTMN

generative recognition generative recognition µL µL hL hL hL hL . ⌃L . ⌃L ...... µ1 µ1 h1 h1 h1 h1 ⌃1 ⌃1

LSTM1 v v LSTM1 v v

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Learning procedure: algorithm 2

while isLearning() do resetGradients() for t = 1 → T do forwardLSTM(t) vt ← getDataFrame(t) bottomUp() backpropagateThroughGenerativeModel() backpropagateThroughRecognitionModel() LSTM εt ← ∇zt F(vt , ht , zt ) incrementGradientsOfGenerativeAndRecognitionModels() for t = T → 1 do LSTM.BPTT(t) applyGradients()

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Benchmarks: temporal sequences

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Not yet competitive with RBM + CD25 Exploiting specific properties of the noise and graph is substantially better than Wake-Sleep and REINFORCE How much would we benefit from having less constrained covariance matrices?

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Final remarks

Performance is similar to RBM + CD3

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Exploiting specific properties of the noise and graph is substantially better than Wake-Sleep and REINFORCE How much would we benefit from having less constrained covariance matrices?

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Final remarks

Performance is similar to RBM + CD3

Not yet competitive with RBM + CD25

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited How much would we benefit from having less constrained covariance matrices?

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Final remarks

Performance is similar to RBM + CD3

Not yet competitive with RBM + CD25 Exploiting specific properties of the noise and graph is substantially better than Wake-Sleep and REINFORCE

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Final remarks

Performance is similar to RBM + CD3

Not yet competitive with RBM + CD25 Exploiting specific properties of the noise and graph is substantially better than Wake-Sleep and REINFORCE How much would we benefit from having less constrained covariance matrices?

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks

THANK YOU!

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Bibliography

Dayan, P. (2000). Helmholtz machines and wake-sleep learning. Handbook of Brain Theory and Neural Network. MIT Press, Cambridge, MA, 44(0).

Dayan, P. and Hinton, G. E. (1996). Varieties of Helmholtz Machine. Neural Networks, 9(8):1385–1403.

Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). The Helmholtz machine. Neural computation, 7(5):889–904.

Frey, B. J. (1996). Variational inference for continuous sigmoidal Bayesian networks. IN SIXTH INTERNATIONAL WORKSHOP ON AND STATISTICS. Ikeda, S., Nakahara, H., and Amari, S.-i. (1999). Convergence of The Wake-Sleep Algorithm. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 11, pages 239 – 245.

Jordan, M. I. (1999). An introduction to variational methods for graphical models. MACHINE LEARNING, 37:183 – 233.

Larochelle, H. and Murray, I. (2011). The Neural Autoregressive Distribution Estimator. Journal ofDanilo Machine Jimenez Learning Rezende . . . , 15:29–37. BMI-EPFL The Helmholtz machine revisited Sutskever, I. and Hinton, G. (2008). The recurrent temporal restricted . Advances in Neural Information Processing Systems, 21:2008. Importance sampling:

1 X ln p(X ) ≈ ln exp −Fˆ , 0 K k k

where Fˆk is a point-estimate of F using a sample from q.

Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Estimating the data log-likelihood

Naive: 1 X ln p(X ) ≈ ln p(X |X k ), 0 K 0 1 k k where X1 , k = 1 ... K, are samples from p.

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Estimating the data log-likelihood

Naive: 1 X ln p(X ) ≈ ln p(X |X k ), 0 K 0 1 k k where X1 , k = 1 ... K, are samples from p. Importance sampling:

1 X ln p(X ) ≈ ln exp −Fˆ , 0 K k k

where Fˆk is a point-estimate of F using a sample from q.

Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited