The Helmholtz Machine Revisited, EPFL2012
Total Page:16
File Type:pdf, Size:1020Kb
Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The Helmholtz machine revisited Danilo Jimenez Rezende BMI-EPFL November 8, 2012 Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks 1 Introduction 2 Variational Approximation 3 Relevant special cases 4 Learning with non-factorized qs 5 Extending the model over time 6 Final remarks Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Only X0 is observed. Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz machines Helmholtz machines [Dayan et al., 1995, Dayan and Hinton, 1996, Dayan, 2000] are directed graphical models with a layered structure: XL Complete data likelihood: L−1 g Y g p(X jθ ) = p(Xl jXl+1; θ )p(XL); X1 l=0 X0 Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz machines Helmholtz machines [Dayan et al., 1995, Dayan and Hinton, 1996, Dayan, 2000] are directed graphical models with a layered structure: XL Complete data likelihood: L−1 g Y g p(X jθ ) = p(Xl jXl+1; θ )p(XL); X1 l=0 Only X0 is observed. X0 Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Smooth units: g g g p(Xl jXl+1) = N(Xl ; tanh(Wl Xl+1 + Bl ); Σl ): Parameters θg = fW g ; Bg ; Σg g. Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Types of Helmholtz machines Binary units: g g p(Xl jXl+1) = Bern ◦ sigmoid(Wl Xl+1 + Bl ); Parameters θg = fW g ; Bg g. Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Types of Helmholtz machines Binary units: g g p(Xl jXl+1) = Bern ◦ sigmoid(Wl Xl+1 + Bl ); Parameters θg = fW g ; Bg g. Smooth units: g g g p(Xl jXl+1) = N(Xl ; tanh(Wl Xl+1 + Bl ); Σl ): Parameters θg = fW g ; Bg ; Σg g. Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Goal For a data-set of iid samples y 2 Data, maximize data log-likelihood w.r.t θg X ln p(Datajθg ) = ln p(yjθg ); y2Data where Z L L−1 g Y Y g p(yjθ ) = dXl p(yjX1) p(Xl jXl+1; θ )p(XL) l>0 l=1 Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Then g r F(X0,θ ,θ ) g z r }| { r ln p(X0jθ ) = − hln q(Xl>0jX0; θ ) − ln p(X )iq(Xl>0jX0,θ ) +KL(q; p); where r q(Xl>0jX0; θ ) KL(q; p) = hln i r g q(Xl>0jX0,θ ) p(Xl>0jX0; θ ) g g r r KL ≥ 0 ) ln p(X0; θ ) ≥ −F(X0; θ ; θ ) 8θ Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick r Introduce a parametric family of distributions q(Xl>0jX0; θ ) Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited r q(Xl>0jX0; θ ) KL(q; p) = hln i r g q(Xl>0jX0,θ ) p(Xl>0jX0; θ ) g g r r KL ≥ 0 ) ln p(X0; θ ) ≥ −F(X0; θ ; θ ) 8θ Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick r Introduce a parametric family of distributions q(Xl>0jX0; θ ) Then g r F(X0,θ ,θ ) g z r }| { r ln p(X0jθ ) = − hln q(Xl>0jX0; θ ) − ln p(X )iq(Xl>0jX0,θ ) +KL(q; p); where Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited g g r r KL ≥ 0 ) ln p(X0; θ ) ≥ −F(X0; θ ; θ ) 8θ Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick r Introduce a parametric family of distributions q(Xl>0jX0; θ ) Then g r F(X0,θ ,θ ) g z r }| { r ln p(X0jθ ) = − hln q(Xl>0jX0; θ ) − ln p(X )iq(Xl>0jX0,θ ) +KL(q; p); where r q(Xl>0jX0; θ ) KL(q; p) = hln i r g q(Xl>0jX0,θ ) p(Xl>0jX0; θ ) Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick r Introduce a parametric family of distributions q(Xl>0jX0; θ ) Then g r F(X0,θ ,θ ) g z r }| { r ln p(X0jθ ) = − hln q(Xl>0jX0; θ ) − ln p(X )iq(Xl>0jX0,θ ) +KL(q; p); where r q(Xl>0jX0; θ ) KL(q; p) = hln i r g q(Xl>0jX0,θ ) p(Xl>0jX0; θ ) g g r r KL ≥ 0 ) ln p(X0; θ ) ≥ −F(X0; θ ; θ ) 8θ Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited q can be any distribution with same support as p Choose q so that we can calculate expectations in a reasonable time Allows to solve inference using standard optimization techniques Why ? Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick Redefine the learning problem as X fθ^g ; θ^r g = arg min F(x; θg ; θr ); g r θ ,θ x2Data Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Choose q so that we can calculate expectations in a reasonable time Allows to solve inference using standard optimization techniques Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick Redefine the learning problem as X fθ^g ; θ^r g = arg min F(x; θg ; θr ); g r θ ,θ x2Data Why ? q can be any distribution with same support as p Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Allows to solve inference using standard optimization techniques Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick Redefine the learning problem as X fθ^g ; θ^r g = arg min F(x; θg ; θr ); g r θ ,θ x2Data Why ? q can be any distribution with same support as p Choose q so that we can calculate expectations in a reasonable time Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks The variational trick Redefine the learning problem as X fθ^g ; θ^r g = arg min F(x; θg ; θr ); g r θ ,θ x2Data Why ? q can be any distribution with same support as p Choose q so that we can calculate expectations in a reasonable time Allows to solve inference using standard optimization techniques Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Yields local message-passing type of algorithms Fast convergence Bad approximation to multimodal posteriors Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q r Y r q(Xl>0jX0; θ ) = p(Xl jθl;i ); l>0;i Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Yields local message-passing type of algorithms Fast convergence Bad approximation to multimodal posteriors Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q r Y r q(Xl>0jX0; θ ) = p(Xl jθl;i ); l>0;i Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Fast convergence Bad approximation to multimodal posteriors Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q r Y r q(Xl>0jX0; θ ) = p(Xl jθl;i ); l>0;i Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Yields local message-passing type of algorithms Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Bad approximation to multimodal posteriors Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q r Y r q(Xl>0jX0; θ ) = p(Xl jθl;i ); l>0;i Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Yields local message-passing type of algorithms Fast convergence Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Fully factorized q r Y r q(Xl>0jX0; θ ) = p(Xl jθl;i ); l>0;i Expectations are analytically tractable for sigmoid/tanh nonlinearities [Frey, 1996, Jordan, 1999] Yields local message-passing type of algorithms Fast convergence Bad approximation to multimodal posteriors Danilo Jimenez Rezende BMI-EPFL The Helmholtz machine revisited Conditioned on X0 Cannot solve any expectation analytically Resort to Monte Carlo approximations Introduction Variational Approximation Relevant special cases Learning with non-factorized qs Extending the model over time Final remarks Helmholtz' machine choice of q Bottom-up graph [Dayan et al., 1995, X L Dayan and Hinton, 1996, Dayan, 2000]: L r Y r q(Xl>0jX0; θ ) = p(Xl jXl−1; θ