Unsupervised Learning

Unsupervised Learning Lecture 6: Hierarchical and Nonlinear Models Zoubin Ghahramani [email protected] Carl Edward Rasmussen [email protected] Gatsby Computational Neuroscience Unit PhD Course at IMM, DTU Fall 2001 Why we need nonlinearities Linear systems have limited modelling capability. ¢ ¡ X 1 X 2 X 3 X T ¢ ¡ Y1 Y2 Y 3 YT Consider linear-Gaussian state-space models. Only certain dynamics can be modelled. Why we need hierarchical models Many generative processes can be naturally described at different levels of detail. (e.g. objects, illumination, pose) (e.g. object parts, surfaces) (e.g. edges) (retinal image, i.e. pixels) Biology seems to have developed hierarchical representations. Why we need distributed representations ¢ ¡ S 1 S 2 S 3 S T ¢ ¡ Y1 Y2 Y 3 YT Consider a hidden Markov model. To capture N bits of information about the history of the sequence, a HMM requires K = 2N states! Factorial Hidden Markov Models and Dynamic Bayesian Networks (1) (1) (1) ¡ ¡ S t-1 S ¡ t S t+1 (2) (2) (2) ¡ ¡ S t-1 S ¡ t S t+1 (3) (3) (3) ¡ ¡ S t-1 S ¡ t S t+1 ¡ ¡ Yt-1 Yt Y¡ t+1 At At+1 At+2 Bt Bt+1 Bt+2 ... Ct Ct+1 Ct+2 Dt Dt+1 Dt+2 These are hidden Markov models with many state variables (i.e. a distributed representation of the state). Independent Components Analysis X 1 XK Λ Y1 Y 2 YD P (xk) is non-Gaussian. • Equivalently P (xk) is Gaussian, with a nonlinearity g( ): • · K y = Λ g(x ) + d X dk k d k=1 For K = D, and observation noise assumed to be zero, inference and learning are easy • (standard ICA). Many extensions are possible (e.g. with noise IFA). ) ICA Nonlinearity Generative model: x = g(w) y = Λx + v where w and v are zero-mean Gaussian noises with covariances I and R respectively. The density of x can be written in terms of g( ), · (0; 1) 1 p (x) = N jg− (x) x g (g 1(x)) j 0 − j 1 For example, if px(x) = π cosh(x) we find that setting: π g(w) = ln tan 1 + erf(w=p2) 4 generates vectors x in which each component is distributed according to 1=(π cosh(x)). So, ICA can be seen either as a linear generative model with non-Gaussian priors for the hidden variables, or as a nonlinear generative model with Gaussian priors for the hidden variables. ICA: The magic of unmixing How ICA Relates to Factor Analysis and Other Models Factor Analysis (FA): Assumes the factors are Gaussian. • Principal Components Analysis (PCA): Assumes no noise on the observations: • Ψ = lim 0 I ! Independent Components Analysis (ICA): Assumes the factors are non-Gaussian • (and no noise). Mixture of Gaussians: A single discrete-valued \factor": xk = 1 and xj = 0 for all • j = k. 6 Mixture of Factor Analysers: Assumes the data has several clusters, each of which is • modeled by a single factor analyser. Linear Dynamical Systems: Time series model in which the factor at time t depends • linearly on the factor at time t 1, with Gaussian noise. − Boltzmann Machines Undirected graphical model (i.e. a Markov network) over a vector of binary variables si 0; 1 . 2 f g 1 P (s W; b) = exp Wijsisj bisi j Z {− X − X g ij i Learning algorithm: a version of EM, using Gibbs sampling to approximate the E step. Sigmoid Belief Networks Directed graphical model (i.e. a Bayesian network) over a vector of binary variables si 0; 1 . 2 f g P (s W; b) = P (si sj j<i; W; b) (1) j Y jf g i Intractability For many probabilistic models of interest, exact inference is not computationally feasible. This occurs for two (main) reasons: distributions may have complicated forms (non-linearities in generative model) • \explaining away" causes coupling from observations • observing the value of a child induces dependencies amongst its parents (high order interactions) X1 X2 X3 X4 X5 1 1 2 1 3 Y Y = X1 + 2 X2 + X3 + X4 + 3 X5 We can still work with such models by using approximate inference techniques to estimate the latent variables. Approximate Inference Sampling: Approximate posterior distribution over hidden variables by a set of random • samples. We often need Markov chain Monte carlo methods to sample from difficult distributions. Linearization: Approximate nonlinearities by Taylor series expansion about a point (e.g. • the approximate mean of the hidden variable distribution). Linear approximations are particularly useful since Gaussian distributions are closed under linear transformations. Recognition Models: Approximate the hidden variable posterior distribution using an • explicit bottom-up recognition model/network. Variational Methods: Approximate the hidden variable posterior p(H) with a tractable • form q(H). This gives a lower bound on the likelihood that can be maximised with respect to the parameters of q(H). Recognition Models a model is trained in a supervised way to recover the hidden causes (latent variables) • from the observations this may take the form of explicit recognition network (e.g. Helmholtz machine) which • mirrors the generative network (tractability at the cost of restricted approximating distribution) inference is done in a single bottom-up pass (no iteration required) • Appendix Practical Bayesian Approaches Laplace approximations • Large sample approximations (e.g. BIC) • Markov chain Monte Carlo methods • Variational approximations • Laplace Approximation data set Y , models 1 :::; n, parameter sets θ1 : : : ; θn M M Model Selection: P ( i Y ) P ( i)P (Y i) M j / M jM For large amounts of data (relative to number of parameters, d) the parameter posterior is ^ approximately Gaussian around the MAP estimate θi: d 1 1 − ^ > ^ P (θi Y; i) (2π) 2 A 2 exp (θi θi) A(θi θi) j M ≈ j j −2 − − P (θi;Y i) P (Y i) = jM jM P (θi Y; i) j M ^ Evaluating the above expression for ln P (Y i) at θi: jM d 1 ln P (Y i) ln P (θî i) + ln P (Y θî; i) + ln 2π ln A jM ≈ jM j M 2 − 2 j j where A is the d d negative Hessian matrix of the log posterior. This can be used for× model selection. BIC The Bayesian Information Criterion (BIC) can be obtained from the Laplace approximation d 1 ln P (Y i) ln P (θî i) + ln P (Y θî; i) + ln 2π ln A jM ≈ jM j M 2 − 2 j j by taking the large sample limit: ^ d ln P (Y i) ln P (Y θi; i) + ln N jM ≈ j M 2 where N is the number of data points. Properties: Quick and easy to compute • It does not depend on the prior • We can use the ML estimate of θ instead of the MAP estimate • It assumes that in the large sample limit, all the parameters are well-determined (i.e. the • model is identifiable; otherwise, d should be the number of well-determined parameters) It is equivalent to the MDL criterion •.

Unsupervised Learning

Equivalence Between Minimal Generative Model Graphs and Directed Information Graphs

A Hierarchical Generative Model of Electrocardiogram Records

Bias Correction of Learned Generative Models Using Likelihood-Free Importance Weighting

Probabilistic PCA and Factor Analysis

Learning a Generative Model of Images by Factoring Appearance and Shape

Neural Decomposition: Functional ANOVA with Variational Autoencoders

Note 18 : Gaussian Discriminant Analysis

Logistic Regression, Generative and Discriminative Classifiers

A Probabilistic Generative Model for Typographical Analysis of Early Modern Printing

Discriminative and Generative Models for Clinical Risk Estimation: an Empirical Comparison

Acoustic Feature Learning with Deep Variational Canonical Correlation

Time Series Simulation by Conditional Generative Adversarial Net