Probabilistic Models for Unsupervised Learning

Zoubin Ghahramani Sam Roweis

Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/

NIPS Tutorial December 1999 Learning

Imagine a machine or organism that experiences over its

lifetime a series of sensory inputs: ¢¡¤£¥ §¦¨£© £©  £

Supervised learning: The machine is also given desired

¡¤£ ¦¨£  outputs  , and its goal is to learn to produce the correct output given a new input.

Unsupervised learning: The goal of the machine is to build representations from that can be used for reasoning, decision making, predicting things, communicating etc.

Reinforcement learning: The machine can also

¡¤£ ¦¨£ 

produce actions  which affect the state of the

¡ ¦

£ £  world, and receives rewards (or punishments)  . Its goal is to learn to act in a way that maximises rewards in the long term. a r Goals of Unsupervised Learning

To find useful representations of the data, for example:

 finding clusters, e.g. k-means, ART

 dimensionality reduction, e.g. PCA, Hebbian learning, multidimensional scaling (MDS)

 building topographic maps, e.g. elastic networks, Kohonen maps

 finding the hidden causes or sources of the data

 modeling the data density

We can quantify what we mean by “useful” later. Uses of Unsupervised Learning

 data compression

 outlier detection

 classification

 make other learning tasks easier

 a theory of human learning and perception Probabilistic Models

 A probabilistic model of sensory inputs can: – make optimal decisions under a given loss function – make inferences about missing inputs – generate predictions/fantasies/imagery – communicate the data in an efficient way

 Probabilistic modeling is equivalent to other views of learning: – information theoretic: finding compact representations of the data

– physical analogies: minimising free energy of a corresponding statistical mechanical system Bayes rule

 — data set — models (or parameters)

The probability of a model given data set  is:

 "  # 

   !

 "

$ 

 is the evidence (or likelihood )



 is the prior probability of

 

 is the posterior probability of

 $ ! %& "  '  )(

Under very weak and reasonable assumptions, Bayes rule is the only rational and consistent way to manipulate uncertainties/beliefs (Poly´ a, Cox axioms, etc). Bayes, MAP and ML

Bayesian Learning:

Assumes a prior over the model parameters.Computes

+-,/.0 1 the posterior distribution of the parameters: * .

Maximum a Posteriori

(MAP) Learning: +2,31 Assumes a prior over the model parameters * .

Finds a parameter setting that

+2, .0 14 * +-,51 * +"0 .6,51 maximises the posterior: * .

Maximum Likelihood (ML) Learning: Does not assume a prior over the model parameters.

Finds a parameter setting that

+"0 .7,51 maximises the likelihood of the data: * . Modeling Correlations

Y2

Y1 8

Consider a set of variables 9;:¤<===¢<©95> .

8 A very simple model: D29@FE

means ?A@CB and

@IH @ H @ H

B D-9 9 EKJ D-9 ED-9 E correlations G

8 This corresponds to fitting a Gaussian to the data

L M$NPO M`N O'b M$N O¤d

:

X§Y[Z]\ ^

B Q6RSCG QUTWV J _ J a GcT J a

R

M Ohg

8

e f R

There are e parameters in this model

_ 8

What if e is large? Factor Analysis

X 1 XK

Λ

Y1 Y2 YD

l m j j

Linear generative model:ikj

n stnvu w

noPprq

jj j

x

stn z2{r|}¤~

are independent y Gaussian factors

x

w

z-{]|€ ~

are independent y Gaussian noise

x ‚ ƒ „

l l

So, is Gaussian with:

† † †

u

z`‡P~ z"ˆ‰~ z$‡WŠˆ‰~Œ‹3ˆ y z2{r| &~

qKqŽ

„  ‚

where is a matrix, and  is diagonal. q

Dimensionality Reduction: Finds a low-dimensional projection of high dimensional data that captures most of the correlation structure of the data. Factor Analysis: Notes

X 1 XK

Λ

Y1 Y2 YD

 ’ ML learning finds ‘ and given data

 parameters (with correction from symmetries):

“ —2“ • ˜™ ” —2” – ˜™

“ ” • “ –

š › š

 no closed form solution for ML params

 ’ Bayesian treatment would integrate over all ‘ and and would find posterior on number of factors; however it is intractable. Network Interpretations

output ^ ^ ^ Y1 Y Y units 2 D

decoder "generation"

hidden X X units 1 K

encoder "recognition" input Y1 Y Y units 2 D

œ neural network

œ if trained to minimise MSE, then we get PCA

œ if MSE + output noise, we get PPCA

œ if MSE + output noises + reg. penalty, we get FA Graphical Models

A directed acyclic graph (DAG) in which each node

corresponds to a .

ž Ÿ ¡ ¢  ž$£¥¤[¡# ž"£C¦¨§£¥¤[¡

x3  ž$£ª©3§£¥¤¤«©£)¦¬¡ x1 

x2 x5

 ž$£A­r§£C¦¬¡# ž"£C®¨§£ª© «©£¯­¡ x4 Definitions: children, parents, descendents, ancestors

Key quantity: joint probability distribution over nodes.

° °

ž²±£¥¤¤«¥£C¦¨«³³³C«´£ªµ¶¡P¢ ž`Ÿ ¡

(1) The graph specifies a factor· ization of this joint pdf:

° °

ž`Ÿ ¡K¢ · ž"£ § ¡

pa( ¸5¹ )

(2) Each node stores a conditional distribution over its own value given the values of its parents.

(1) & (2) completely specify the joint pdf numerically.

Semantics: Given its parents, each node is conditionally independent from its non-descendents

(Also known as Bayesian Networks, Belief Networks, Probabilistic Independence Networks.) Two Unknown Quantities

In general, two quantities in the graph may be unknown:

º

¼"½ª¾À¿ Á5 à parameter values in the distributions » pa( )

º hidden (unobserved) variables not present in the data

Assume you knew one of these:

º Known hidden variables, unknown parameters Ä this is complete data learning (decoupled problems) θ=? θ=?

θ=? θ=? θ=?

º Known parameters, unknown hidden variables Ä this is called inference (often the crux) ? ? ? ?

But what if both were unknown simultaneously... Learning with Hidden Variables: The EM

θ1

X1

θ

θ2 3

Å Å X2 X3

θ 4 Æ Y

Assume a model parameterised by Ç with observable É variables È and hidden variables

Goal: maximise log likelihood of observables.

Ê Ë Ë Ë

Ç5Ì!Í ÎIÏÑÐ È Ò7Ç5Ì!Í ÎÓÏ Ô Ð ÈÖÕ¥É Ò6Ç5Ì

Ë

×

É ÒØÈÖÕÙÇÛÚÝÜßÞ Ì

E-step: first infer Ð , then ×

M-step: find Çàráãâ using complete data learning

The E-step requires solving the inference problem: È finding explanations, É , for the data,

given the current model Ç . EM algorithm & -function

θ1 Å X1

θ θ2 3 X2 X3

θ4

Y

å$æ ç

Any distribution ä over the hidden variables defines a

å`ë ì6í3ç î å2ä ïÙí5ç

lower bound on èIéÑê called :

ê å$æ ï´ë ì7í5ç

èÓécê å ë ì7í5ç ð èÓé ñ ê å$æ ï¥ë ì7í5çòð èÓé ñ ä å$æ ç

ä å$æ ç

ê å$æ ï[ë ì6í5ç

ó

ñ ä å$æ ç]èÓé ð î å2ä ïÙí5ç

ä å$æ ç

ä í

E-step: Maximise î w.r.t. with fixed

ä ôå$æ çòð ê å$æ ìØëÖïÙí5ç

í ä

M-step: Maximise î w.r.t. with fixed

írôõð ö ÷¤ø ä ôúå$æ çûèÓéÑê å$æ ï[ë ì6í3ç

ñ

ù

å2ä ïÙí5ç èÓéÑê å`ë ì6í5ç NB: max of î is max of Two Intuitions about EM

I. EM decouples the parameters

θ1 The E-step “fills in” values for the hidden vari- Å X1 ables. With no hidden variables, the likeli- hood is a simpler function of the parameters. θ θ2 3 Å The M-step for the parameters at each n- X2 X3 ode can be computed independently, and de- pends only on the values of the variables at θ4 Y that node and its parents.

II. EM is coordinate ascent in ü EM for Factor Analysis

X 1 XK

Λ

Y1 Y2 YD

ý þ2ÿ ¢¡¤£¦¥ ÿ þ¨§©£  þ¨§ ¡£§  ÿ þ§£ Ñÿ þ§£§

ÿ ¡

E-step: Maximise ý w.r.t. with fixed

ÿ úþ§£ ¥  þ§  !¡£¦¥ " þ¨#$ !%  #$&'£

# ¥ &'(õþ¨&)&'( * +,£.-0/

¡ ÿ

M-step: Maximise ý w.r.t. with fixed:

/

12436587:9<;>=@?BADC 7IHJ7LK 5M; 7OAHQP$RTS5M; 7OAQK 120=UP'=UVLK W

- - N - N EGF

X The E-step reduces to computing the Gaussian posterior distribution over the hidden variables.

X The M-step reduces to solving a weighted linear regression problem. Inference in Graphical Models

W W

X Y X Y

Z Z Singly connected nets Multiply connected nets The The junction tree algorithm. algorithm.

These are efficient ways of applying Bayes rule using the conditional independence relationships implied by the . How Factor Analysis is Related to Other Models

Y Principal Components Analysis (PCA): Assume

[ \^] _ `ba c>dfe no noise on the observations: Z

Y Independent Components Analysis (ICA): Assume the factors are non-Gaussian (and no noise).

Y Mixture of Gaussians: A single discrete-valued

[ i gDj [ k l [ m n factor: g>h and for all .

Y Mixture of Factor Analysers: Assume the data has several clusters, each of which is modeled by a single factor analyser.

Y Linear Dynamical Systems: Time series model in

which the factor at time o depends linearly on the i factor at time oqp , with Gaussian noise. A Generative Model for Generative Models

v SBN, mix : mixture Boltzmann Machines red-dim : reduced dimension Factorial HMM dyn : dynamics hier dyn distrib : distributed representation s Cooperative t Vector hier : hierarchical

u distrib Quantization nonlin : nonlinear switch : switching distrib dyn HMM Mixture of r Gaussians (VQ) mix red-dim Mixture of HMMs mix

Mixture of Gaussian Factor Analyzers red-dim dyn

mix Factor Analysis (PCA) Switching State-space dyn Models nonlin

switch Linear

ICA Dynamical v Systems (SSMs) mix

dyn hier nonlin Mixture of LDSs Nonlinear Gaussian Nonlinear Belief Nets Dynamical Systems Mixture of Gaussians and K-Means

Goal: finding clusters in data.

To generate data from this model, assuming w clusters:

x

z { |~}€T‚}fw ƒ „† Pick cluster y with probability

x Generate data according to a Gaussian with ‰G

mean ‡ˆ and covariance

Š ‹¨Œq Ž  Š ‹¨’ Ž Š ‹¨Œ“ ’ Ž 

y y

Tq‘

Ž  ‹¨Œ“ 

„ ‡ }!‰

6” Tq‘

E-step: Compute responsibilities for each data vec. Œ'• –˜—

• –¨—

‹˜Œ “ 

• –˜—

Š ‹¨’ Ž “›Œ Ž

„† ‡œ O}‰G

”

™

–‚š

y

‹¨Œ “ 

ž

• –˜—

Ÿ Ÿ Ÿ

Ÿ

‡ }‰ „

”



¦‘

‡ˆ ‰G M-step: Estimate „† , and using data weighted by the responsibilities.

The k-means algorithm for clustering is a special case of

Ž ¡ ¢ £¤ ¥§¦!¨ ‰ EM for mixture of Gaussians where Mixture of Factor Analysers

Assumes the model has several clusters

(indexed by a discrete hidden variable © ).

Each cluster is modeled by a factor analyser:

ª «¨¬q­q® ¯ ª « ® ´ ­ª «˜¬µ ® ´ ­

© © °²±q³

where

ª «¨¬µ ® ´ ­q® ¶ «¸· ­

° ° ¼ ½ ¾

° ¹»º º ©

¿ it’s a way of fitting a mixture of Gaussians to high-dimensional data

¿ clustering and dimensionality reduction

¿ Bayesian learning can infer a posterior over the number of clusters and their intrinsic dimensionalities. Independent Components Analysis

X 1 XK

Λ

Y1 Y2 YD

Á ˜ÃÅÄÇÆ

À is non-Gaussian.

Á ¨ÃÅÄQÆ

À Equivalently is Gaussian and

ÄÑІ¨ÃÅÄÒÆÅÓ Ô

ÈÊÉ Ë Ì É É ÄTͦΤÏ

where ІÂÖÕMÆ is a nonlinearity.

À

Ë Ø

For × , and observation noise assumed to be zero, inference and learning are easy (standard ICA).

Many extensions possible (e.g. with noise Ù IFA). PSfrag replacements

Hidden Markov Models/Linear Dynamical Systems

ÚÜÛ

ÚßÞ

Úáà

Úáâ

Ý

Þ

Ý

Û Ý

â

Ýáà

ã äÑè§æç Hidden states äÑå§æç , outputs Joint probability factorises:

PSfrag replacements ñ

éëê éëê éëê

å å ï è å ï äTåìç íîäÑèçÇïqð

æõô æõô æ

æö÷ó æ<òqó

ã you can think of this as: Markov chain with stochastic measurements.

Gauss-Markov process in a pancake.

øúù

ø¸ü øýþ

PSfrag replacements øýÿ

ù ûýü

û

û˜ÿ û˜þ

or Mixture model with states coupled across time.

Factor analysis through time.

ù

ø

þ

ø ü ø

øýÿ

û

ü ù

û

û˜ÿ

þ û PSfrag replacements

HMM Generative Model

¡£¢

¡§¦

¡¨©

¡¨

¤¨¦ ¤

¤¥¢

¤

©

plain-vanilla HMM “probabilistic function of a Markov chain”:

1. Use a 1st-order Markov chain to generate a

hidden state sequence (path):

   







   !  "#  $&% (*)+

2. Use a set of output prob. distributions ' (one per state) to convert this state path into a

sequence of observable symbols or vectors

 

,  -. /  0 1-2 '

3 Notes:

– Even though hidden state seq. is 1st-order Markov, the output process is not Markov of any order

[ex. 1111121111311121111131 4#454 ]

– Discrete state, discrete output models can approximate any continuous dynamics and observation mapping even if nonlinear; however lose ability to interpolate PSfrag replacements

LDS Generative Model

6£7

6§9 6¨:

6¨;

; 8¨9

8

8¥7

: 8

< Gauss-Markov continuous state process:

>&F G >

=?>@BADC E =

observed through the “lens” of a

noisy linear embedding:

> >JF KL>

H C I =

G M KNM < Noises and are temporally white and uncorrelated with everything else

< Think of this as “matrix flow in a pancake”

(Also called state-space models, Kalman filter models.)

EM applied to HMMs and LDSs

O O O X 1 X 2 X 3 X T

Y1 Y2 Y3 YT Q£RNSUTWVXVWVTYRJZ[ Given a sequence of P observations

E-step. Compute the posterior probabilities:

\ ^_Q¥`a[cbdQ£eD[gf

HMM: Forward-backward algorithm: ]

\ ^_Qihj[cb+Q¥eD[gf LDS: Kalman smoothing recursions: ]

M-step. Re-estimate parameters:

\ HMM: Count expected frequencies. \ LDS: Weighted linear regression.

Notes: 1. forward-backward and Kalman smoothing recursions

are special cases of belief propagation.

] ^`kUb+Q£R TiVXVWVTYRlk5[gf 2. online (causal) inference S is done by the forward algorithm or the Kalman filter.

3. what sets the (arbitrary) scale of the hidden state? n Scale of m (usually fixed at ).

Trees/Chains o

Tree-structured p each node has exactly one parent. o Discrete nodes or linear-Gaussian. o Hybrid systems are possible: mixed discrete & continuous nodes. But, to remain tractable, discrete nodes must have discrete parents. o Exact & efficient inference is done by belief propagation (generalised Kalman Smoothing). o Can capture multiscale structure (e.g. images) Polytrees/Layered Networks

q more complex models for which junction-tree algorithm would be needed to do exact inference q discrete/linear-Gaussian nodes are possible q case of binary units is widely studied: Sigmoid Belief Networks q but usually intractable Intractability

For many probabilistic models of interest, exact inference is not computationally feasible. This occurs for two (main) reasons:

r distributions may have complicated forms (non-linearities in generative model)

r “explaining away” causes coupling from observations observing the value of a child induces dependencies amongst its parents (high order interactions)

X1 X2 X3 X4 X5 1 1 2 1 3

Y

Y = X1 + 2 X2 + X3 + X4 + 3 X5

We can still work with such models by using approximate inference techniques to estimate the latent variables. Approximate Inference

s Sampling: approximate true distribution over hidden variables with a few well chosen samples at certain values

s Linearization: approximate the transformation on the hidden variables by one which keeps the form of the distribution closed (e.g. Gaussians and linear)

s Recognition Models: approximate the true distribution with an approximation that can be computed easily/quickly by an explicit bottom-up inference model/network

s Variational Methods: approximate the true distribution with an approximate form that is tractable; maximise a lower bound on the likelihood with respect to free parameters in this form Sampling

Gibbs Sampling

uwvx¨yzv|{¥yW}X}W}yzv~ €

To sample from a joint distribution t :

‚„ƒ uv ‚ y†v ‚ yi}X}W}yzv ‚ €

{ ~ Start from some initial state x :

Then iterate the following procedure:

‡

–

•

˜š™

ˆŠ‰‹Œ Ž‘ˆ ˆ ˆ ˆ ˆ

“†” ‰ ” ‰ ”#—5—5—1” ‰

Pick from ‰

Œ*’

Œ

‡

–

• ˜š™

“

ˆ ˆ ˆ ˆŠ‰‹Œ Ž‘ˆ ˆŠ‰‹Œ

” ‰ ” ‰ ”#—5—5—1” ‰

Pick “ from

’ Œ

‡ .

‡

˜

™

•

˜ ˜2œ

ˆŠ‰‹Œ Ž‘ˆ ˆŠ‰‹Œ ˆŠ‰‹Œ ˆŠ‰‹Œ ˆŠ‰‹Œ

” ” ”#—5—#—›”

Pick from “

’

Œ

Œ

x

Ÿ  W¡

This procedure goes from ž , creating a Markov u¢ € chain which converges to t

Gibbs sampling can be used to estimate the expectations under the posterior distribution needed for E-step of EM.

It is just one of many Markov chain Monte Carlo (MCMC) methods. Easy to use if you can easily update subsets of latent variables at a time.

Key questions: how many iterations per sample? how many samples?

Particle Filters

§°¯ ¯

£ ¤Š¥+¦¥§©¨ ª¬«®­ «®­µ´ ¥+¦i§·¶

Assume you have weighted samples ¥+¦¥§²±5³#³5³1± from

¸¹dº

¯

¥+¦¥§¼»µ½¥+¦i§¿¾

­µÁ À

, with normalised weights ¥+¦i§ ¤Â

1. generate a new sample set ¥+¦i§ by sampling with replacement

¯

¥+¦i§

¤ Àí!Á

from with probabilities proportional to ¥+¦i§ Â

2. for each element of ¤ predict, using the stochastic dynamics,

º ¸¹dº

¯ ¯

Â

¾ »

¥+¦i§ ¥

¨ «®­µÁ «®­!Á ¥ by sampling from ¥+¦i§

3. Using the measurement model, weight each new sample by

¸¹ÅÄ º

¯ ¯

¥Æ» ¥ ¾

ÀíµÁ ¨ ¨ «®­µÁ

¥ ¥

¯

Ç Àí!Á ¨ È

(the likelihoods) and normalise so that ¥ . Á Samples need to be weighted by the ratio of the distribution we draw them from to the true posterior (this is importance sampling). An easy way to do that is draw from prior and weight by likelihood.

(Also known as CONDENSATION algorithm.) Linearization

Extended Kalman Filtering and Smoothing

É É É É

U 1 U 2 U 3 U 4

Ê Ê Ê Ê X 1 X 2 X 3 X 4

Y1 Y2 Y3 Y4

ÌÓÒÓÔÌÖÕ&× Ø Ì

Ë?Ì·Í0ÎÏ ÐžÑË

Ì Ì¬ÒÓÔ.ÌÖÕ&× Û?Ì

Ù Ï Ú|Ñ¢Ë

̬ҮÔ.Ì Ë

Linearise about the current estimate, i.e. given Ü :

â

Ð

â

â

̬ÒßÔÌàÕ&× á Ìaç ÌÖÕ|× Ø Ì

Ë ÌÍ0ÎÞÝ Ð Ñ Ë ÑË Ë

â

Ü Ü

Ì

Ë

â¿ã

äæå

á

â

Ú

â

â

ÌÖÕ&× Û?Ì Ì Ì¬ÒßÔ.ÌàÕ&× Ìaç

á

Ë Ù Ý Ú|Ñ Ë Ñ¢Ë

â

Ü Ü

Ì

Ë

âwã

ä

å á

Run the Kalman smoother (belief propagation for linear-Gaussian systems) on the linearised system. This approximates non-Gaussian posterior by a Gaussian. Recognition Models

è a function approximator is trained in a supervised way to recover the hidden causes (latent variables) from the observations

è this may take the form of explicit recognition network (e.g. Helmholtz machine) which mirrors the generative network (tractability at the cost of restricted approximating distribution)

è inference is done in a single bottom-up pass (no iteration required)

Variational Inference

í¿î ïñðóò

Goal: maximise é‘êìë .

íõ ò

Any distribution ô over the hidden variables defines a

íwî ïöðæò

lower bound on é‘êìë :

ü

ë íõ úYî ïöðóò

é‘ê÷ë íwî ïñðóòø ù ô íõ ò1éÅê íýô ú¬ðóò

ô íõ ò û

íõ ò Constrain ô to be of a particular tractable form (e.g.

factorised) and maximise ü subject to this constraint

ü

þ ð E-step: Maximise w.r.t. ô with fixed, subject to

the constraint on ô , equivalently minimise:

ü

ô íwõ ò

ù

éÅêÿë íwî ïöðóò¡ íýô ú¬ðóò ô íõ ò1é‘ê

û ë íõ ï/î ò

¢

ë ò

KL íýô û

The inference step therefore tries to find ô closest to

the exact posterior distribution.

ü

þ ô M-step: Maximise w.r.t. ð with fixed

(related to mean-field approximations) Beyond Maximum Likelihood: Finding Model Structure and Avoiding Overfitting

M = 1 M = 2 M = 3

40 40 40

20 20 20

0 0 0

−20 −20 −20 0 5 10 0 5 10 0 5 10 M = 4 M = 5 M = 6

40 40 40

20 20 20

0 0 0

−20 −20 −20 0 5 10 0 5 10 0 5 10 Model Selection Questions

How many clusters in this data set?

What is the intrinsic dimensionality of the data?

What is the order of my autoregressive process?

How many sources in my ICA model?

How many states in my HMM?

Is this input relevant to predicting that output?

Is this relationship linear or nonlinear?

Bayesian Learning and Ockham’s Razor

¤ ¥§¦¨¦©¦ ¤ ¥§¦©¦©¦ © data £ , models , parameter sets

(let’s ignore hidden variables  for the moment; they will just introduce another level of averaging/integration)

Total Evidence:

 

           

Model Selection:

        

   

  

        "!#©$ %& '!(  



    ) is the probability that randomly selected parameter values from model class  would generate data set  .

) Model classes that are too simple will be very unlikely to generate that particular data set.

) Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random. Ockham’s Razor

too simple ) a t a D ( P

"just right" too complex

Y Data Sets

(adapted from D.J.C. MacKay) Overfitting

M = 1 M = 2 M = 3

40 40 40

20 20 20

0 0 0

−20 −20 −20 0 5 10 0 5 10 0 5 10 M = 4 M = 5 M = 6

40 40 40

20 20 20

0 0 0

−20 −20 −20 0 5 10 0 5 10 0 5 10

1

0.8

0.6 P(M) 0.4

0.2

0 0 1 2 3 4 5 6 M Practical Bayesian Approaches

* Laplace approximations

* Large sample approximations (e.g. BIC)

* Markov chain Monte Carlo methods

* Variational approximations

Laplace Approximation

, -/.©.¨.102, 3 4-§.©.©.0 4©3 data set + , models , parameter sets

Model Selection:

5 6 789 :<; 5 6 7 :&5 69 8 7 :

For large amounts of data (relative to number of

parameters, = ) the parameter posterior is approximately

? 7

Gaussian around the MAP estimate > :

A A @BA AMLON

a e a e T T[Z2\^]`_badc

L 4©C 4©CIHgfhU 4©C 4©CiH©j 4©CEDF+G0 , CIHKJ HQPSR DVUWDYX

?

5 6 9 8 :

7Qm 7

5 69 8 :lk

7

?

5 6 8n9 :

7 m 7

?

5 69 8 :

7 7 oip

Evaluating the above expression for at > :

q"r q"r q"r q"r q"r

@BA @BA @BA LON

a c

e e

4 DF, Hts +uD 4 02, Hts v + DF, HKJ DVUWD

C C C C C L L

where w is the negative Hessian of the log posterior.

This can be used for model selection.

= x = (Note: w is size .) BIC

The Bayesian Information Criterion (BIC) can be obtained

from the Laplace approximation

‹OuŽ 

† †

y"z|{B}i~ F€ I‚Kƒ y"z|{B} „ F€ I‚t‡ y"z|{B}i~u„ ‰ˆ2€ I‚t‡ Š ‹Œy"z ‹y"z‘V’W

by taking the large sample limit:

“•”W– —˜ ™ š ›œ “•”ž– —˜ ™¨Ÿ š©¡ š%›G¢ £ ¤¥“•”W¦

where ¦ is the number of data points.

Properties:

§ Quick and easy to compute

§ It does not depend on the prior

§ We can use the ML estimate of instead of the MAP estimate

§ It assumes that in the large sample limit, all the parameters are well-determined (i.e. the model is

identifiable; otherwise, £ should be the number of well-determined parameters)

§ It is equivalent to the MDL criterion

MCMC © Assume a model with parameters ¨ , hidden variables

and observable variables ª

Goal: to obtain samples from the (intractable) posterior

¬'¨®­ª ¯ distribution over the parameters, «

Approach: to sample from a Markov chain whose

¬'¨®­nª ¯ equilibrium distribution is « .

One such simple Markov chain can be obtained by Gibbs sampling, which alternates between:

° Step A: Sample from parameters given hidden

± « ¬'¨®­n© ²³ª ¯ variables and observables: ¨

° Step B: Sample from hidden variables given

± « ¬E© ­´¨K²³ª ¯ parameters and observables: ©

Note the similarity to the EM algorithm! Variational Bayesian Learning

Lower bound the evidence:

µ ¶ ·i¸º¹ »¼ ½

·i¸ ¹ »¼B¿À½ÂÁÃÀ

¾

¹ »¼B¿Àƽ

»'À½1·i¸ ÁÃÀ

Ä Å

»ÇÀƽ

Å

¹ »'À½

»'À½ÉÈ Á1À ·•¸ž¹ »¼ Ê´À½ Ë ·•¸

Å

¾

»'À½¨Ì

Å

¹ »Í ¿³¼ Ê´À½ ¹ »ÇÀƽ

È È

»'À½ »EÍ ½1·i¸ ÁÍ Ë ·i¸ ÁÃÀ

Ä Å Å

»EÍ ½ »'À½

Ì Ì

Å Å

¶ Î » »'À½¿ »EÍ ½h½

Å Å

Assumes the factorisation:

¹ »ÇÀÏ¿OÍ Ê¼ ½<Ð »'À½ »EÍ ½

Å Å

(also known as “ensemble learning”) Variational Bayesian Learning

EM-like optimisation:

Ò ÓEÔ Õ Ò Ó'ÖÕ

“E-step”: Maximise Ñ w.r.t. with fixed

Ò ÓÇÖÆÕ Ò ÓEÔ Õ “M-step”: Maximise Ñ w.r.t. with fixed

Finds an approximation to the posterior over parameters

Ó'ÖÕl× Ø Ó'Ö®ÙÚ Õ Ò ÓEÔ Õ× Ø ÓEÔ ÙÚ Õ Ò and hidden variables

Û Maximises a lower bound on the log evidence Û

Convergence can be assessed by monitoring Ñ

Û Global approximation Û

Ñ transparently incorporates model complexity penalty (i.e. coding cost for all the parameters of the

model) so it can be compared across models

Û Ó'ÖÕ Optimal form of Ò falls out of free-form variational optimisation (i.e. not assumed to be Gaussian)

Û Often simple modification of the EM algorithm Summary

Ü Why probabilistic models?

Ü Factor analysis and beyond

Ü Inference and the EM algorithm

Ü Generative Model for Generative Models

Ü A few models in detail

Ü Approximate inference

Ü Practical Bayesian approaches Appendix Desiderata (or Axioms) for

Computing Plausibilities

àâá ã Paraphrased from E.T. Jaynes, using the notation Ý¡Þ'ß

is the plausibility of statement ß given that you know that

statement á is true.

ä Degrees of plausibility are represented by real numbers

ä Qualitative correspondence with common sense, e.g.

àæå çgãéè Ý¡ÞÇß àæå ã Ý¡Þ2á à"ß ê å çëãlì Ý¡Þ2á à´ß êíå ã

– If Ý¡ÞÇß but

Ý¡ÞÇß ê á àâå ãéî Ý¡Þ'ß ê á àæå ã then ç

ä Consistency: – If a conclusion can be reasoned in more than one way, then every possible way must lead to the same result. – All available evidence should be taken into account when inferring a plausibility. – Equivalent states of knowledge should be represented with equivalent plausibility statements. Accepting these desiderata leads to Bayes Rule being the only way to manipulate plausibilities. Learning with Complete Data

θ1

Y1

θ

θ2 3

ï ï Y2 Y3

θ4 Y4

Assume a data set of i.i.d. observations

ð ñ òôó õEö&÷¨ø#ùúù#ù¡ø ó õ•û÷&ü

and a parameter vector ý .

û

£

õ ÷

ð ñ ó

ÿ ý¢¡ þ ÿ ý¢¡

Goal is to maximise likelihood: þ

£¥¤ ö

Equivalently, maximise log likelihood:

û

£

õ ÷

¦

¨ ©

ñ ó

ÿ'ý¢¡ þ ÿ ý¢¡

£§¤ ö

Using the graphical model factorisation:

£ £

£

õ ÷ õ ÷

õ ÷

ó ó ñ ó ø

ÿ ý ¡ þ ÿ ý ¡

þ pa( )

û

£ £

õ ÷ õ ÷

¦

¨ ©

ñ ø ó nó

ÿ'ý¢¡ ý ¡ ÿ

So: þ pa( )

£§¤ ö

In other words, the parameter estimation problem breaks into many independent, local problems (uncoupled). Building a Junction Tree

 Start with the recursive factorization from the DAG:



§   ¥ 

pa( ¢ )  Convert these local conditional probabilities into potential functions over both  and all its parents.

 This is called moralising the DAG since the parents get connected. Now the product of the potential functions gives the correct joint

 When evidence is absorbed, potential functions must agree on the prob. of shared variables: consistency.

 This can be achieved by passing messages between potential functions to do local marginalising and rescaling.

 Problem: a variable may appear in two non-neighbouring cliques. To avoid this we need to triangulate the original graph to give the potential functions the running intersection property. Now local consistency will imply global consistency.

Bayesian Networks: Belief Propagation  e+ (n) p1 p2 p3

n

c1 -

c2 c3 e (n)

Each node  divides the evidence, , in the graph into '&("§%$

two disjoint sets: ! #"¥%$ and )*,+.-0/1/0/-2*43 5

Assume a node  with parents and

)768+9-1/0/1/-:6;95

_

<>= ?A@CBED F GH IKJ2LKMONPNPNQM JSRUT9<>= ?A@ VXWSY[Z\Z\Z]Y^V8_`D <>=cV @CB\de=fV D§Dhgi j

a a

W

acb

k

<>=Un Y2Bpo¢=Un Dq@P?]D

l l

W

lmb

F <>= ?4@CB\dr= ?]D§Dm<>=UBpo = ?]Dq@P?]D ICA Nonlinearity

Generative model:

15

s t uwv¥x y

10

t { s | } z 5

0

g(w)

x } where and are zero- −5 mean Gaussian noises −10 −15 −5 −4 −3 −2 −1 0 1 2 3 4 5

w  with covariances ~ and

respectively. uwvq‚y

The density of € can be written in terms of ,

vˆ‡r‰‹ŠŒy,Ž7’‘E“

„'”

v yt †

ƒ „

€

•ue–—v˜uA™›š9v yœy9

€

š

v yt

ƒ

„

“

„'” €

For example, if Ÿž¡ E¢m£ we find that setting:

uwv¥¤ yt ¥ ¦ §8¨ª©«¦ §­¬ ® ¯9Š°| ±.²´³v¥¤ µe¶ ·8y¡¸¢¹º¹

generates vectors s in which each component is

Š‹µ»v v yœy

distributed exactly according to € .

¬ ¼¾½«¿ÁÀ

So, ICA can be seen either as a linear generative model with non-Gaussian priors for the hidden variables, or as a nonlinear generative model with Gaussian priors for the hidden variables. HMM Example

 Character sequences (discrete outputs)

A B C D E AB C D E − A BCD E − AB C D −F G H I J −F G H J F G H I J F G H I EJ * K L M N O * K L M NIO * K L M O * K L M N O Q P R T Q R N P Q P R T Q S P ST R ST 9 U V SX 9 V W X Y 9 Y 9 U V W X W Y U UVWX Y

 Geyser data (continuous outputs)

State output functions 110

100

90

80 y2 70

60

50

40 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 y1 LDS Example

à Population model:

state Ä population histogram

first row of A Ä birthrates

subdiagonal of A Ä 1-deathrates

Q Ä immigration/emmigration

C Ä noisy indicators Noisy Measurements

250

200

150

100

50

Inferred Population 0 50 100 150 200 250 300 350 400 450 250

200

150

100

50

True Population 0 50 100 150 200 250 300 350 400 450 Time

Viterbi Decoding Å

The numbers ÆÈÇrÉ¥ÊªË in forward-backward gave the posterior probability distribution over all states at any

time. Å

By choosing the state Æ ÌÈÉ¥ÊEË with the largest probability at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states.

Å But it is not the single path with the highest likelihood of generating the data. In fact it may be a path of probability zero!

Å To find the single best path, we do Viterbi decoding which is just Bellman’s dynamic programming algorithm applied to this problem.

Å The recursions look the same, except with

Í Î9Ï

instead of Ð .

Å There is also a modified Baum-Welch training based on the Viterbi decode. HMM Pseudocode

Ñ Forward-backward including scaling tricks

ÒhÓ2ÔÖÕh׫ØÚÙXÓÛÔÝܾÞß×

à'ÔUáâ×ãØÚä0å¡æÈÒEÔUáâ× çèÔUáâ×ãØ à'Ôéá—× à'ÔUáâ׫ØÚà'Ôéá—×hêÛçèÔéá—×

à'ÔßÕé× à'ÔÖÕh×8ØÚà'ÔßÕé×éê2çèÔÖÕh× ðñÕ8Øóòõô÷ö[ø à'ÔÖÕh×«Ø Ôßëíì`æ'à'ÔÖÕ!îïáâ×éטå2æÈÒÁÔÖÕh× çèÔÖÕé×8Ø

ù

ÔÝöp×«Ø á

ù ù

ÔÖÕh׫Øïë#æúÔ ÔÖÕüûýáâטåÛæ'ÒÁÔÖՌûÚáâ×é×éê2çèÔÖÕ.ûþáâ× ðñÕ8Ø ÔÝöAîïáâ×ÿô:á¥ø

Ø¢¡

ù

Ø û ë¢å2æ Ô à'ÔßÕé×9æúÔ ÔßÕüûýá—×¥åÛæÈÒEÔÖՌûÚáâ×é× ×hêÛçèÔßÕ.ûþá—× ðñÕ8Ø á­ôèÔÝö îÚáâ× ø

ì

ù

£Ø Ô à7å´æ ×

¤¦¥¨§ © ¤¥¨§

ÔßÜ §×8Ø ÔÝçèÔÖÕé×h×

Ñ Baum-Welch parameter updates



Ó Ó

Ø¢¡ ë Ø¡ ä Ø¢¡ ÙýØ ¢¡



for each sequence, run forward backward to get £ and , then

 

£!ÔßÕé× ëÚØ ë  û ä Ø ä û£ãÔUáâ× Ø û

 

Þ

Ó Ó Þ

Ù ÔÝÜ0×«Ø £ ÔÖÕé× ÙóØ Ù#û Ü £ãÔÖÕh×

 or

Þ ¦ Þ



ë  ä Ø äŒê ä Ù'ÓXØ ÙXÓ¥ê Ó ë  ÓXØ ë  Ó¥ê

  



    LDS Pseudocode

 Kalman filter/smoother including scaling tricks

! #"$ &% '( )"*')%

43

".- /102 05' 056¨798):<;>=? @BAC" DFEHGJI

+¨,

K

"L' 0 05' 0 798)OQPSR

6NM 6

K K

"$ 7 02 O ' " 0ZO['

, MTU,WV , MNXYV

! #"*\] '( )"\ ' \ 7_^

, , 6

`

`

Uab"$ a 'ca "*'ca

' "*\ ' \ 7_^ @BAC" G DdOeEWDI

6 M V

,

f

"L' \ ' OQPSR

, 6gM

` `

` `

f f f

"$ 7 \h O ' "*' 7 ' ' O

M V M V 6

, , , , , , ,

R R

a

i¦j¨k l i¦j¨k

AQOnO Om"

MT MN+M R

 EM parameter updates

o

"*p qr"¢p s#"¢p tr"¢p %Y"p

`

`

3 3

' '

, , ,

for each sequence, run Kalman smoother to get , and

,

` `

`

% %

"L 7 ' "w' \ \ ' \ 7$^]OQPSR '

, 6gM , 6 ,

,

Rnu1v PSR PSR

PSR

a

,

`

o o

` ` `

qr"Lqx7 7 ' PSR t|"t)7 " 7

6 6 T T T&6

, , ,

, , ,

,

PSR

, ,¦y{z ,

` ` `

` ` ` ` ` `

7 ' 7 ' s~a"ws~a€7 Ua 7 'ca s "*s 7 s#"ws}7

6 6 6

, ,

a

,

R R R R

R

,

o

0 " s&PSR \ "Lq s s~aOQPSR

M V

o

8 " t 0 6‚O G„ƒ ^ " s s \]q&6 O Gƒ O

M V M V V M V

u u v R Selected References

Graphical Models and the EM algorithm:

† Learning in Graphical Models (1998) Edited by M.I. Jordan. Dordrecht: Kluwer Academic Press. Also available from MIT Press (paperback).

‡ Motivation for Bayes Rule: ”Probability Theory - the Logic of Science,” E.T.Jaynes, http://www.math.albany.edu:8008/JaynesBook.html

‡ EM:

Baum, L., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41:164–171;

Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society Series B, 39:1–38;

Neal, R. M. and Hinton, G. E. (1998). A new view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models.

‡ Factor Analysis and PCA:

Mardia, K.V., Kent, J.T., and Bibby J.M. (1979) Multivariate Analysis Academic Press, London

Roweis, S. T. (1998). EM algorthms for PCA and SPCA. NIPS98

Ghahramani, Z. and Hinton, G. E. (1996). The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1

[http://www.gatsby.ucl.ac.uk/ ˆ zoubin/papers/tr-96-1.ps.gz] Department of Computer Science, University of Toronto.

Tipping, M. and Bishop, C. (1999). Mixtures of probabilistic principal component analyzers. Neural Computation, 11(2):435–474. ‰ Belief propagation:

Kim, J.H. and Pearl, J. (1983) A computational model for causal and diagnostic reasoning in inference systems. In Proc of the Eigth International Joint Conference on AI: 190-193;

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA.

‰ Junction tree: Lauritzen, S. L. and Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. J. Royal Statistical Society B, pages 157–224.

Other graphical models:

Š Roweis, S.T and Ghahramani, Z. (1999) A unifying review of linear Gaussian models. Neural Computation 11(2): 305–345.

‰ ICA:

Comon, P. (1994). Independent component analysis: A new concept. Signal Processing, 36:287–314;

Baram, Y. and Roth, Z. (1994) Density shaping by neural networks with application to classification, estimation and forecasting. Technical Report TR-CIS-94-20, Center for Intelligent Systems, Technion, Israel Institute for Technology, Haifa, Israel.

Bell, A. J. and Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6):1129–1159.

‰ Trees:

Chou, K.C., Willsky, A.S., Benveniste, A. (1994) Multiscale recursive estimation, data fusion, and regularization. IEEE Trans. Automat. Control 39:464-478;

Bouman, C. and Shapiro, M. (1994). A multiscale random field model for Bayesian segmenation. IEEE Transactions on Image Processing 3(2):162–177. ‹ Sigmoid Belief Networks:

Neal, R. M. (1992). Connectionist learning of belief networks. Artificial Intelligence, 56:71–113.

Saul, L.K. and Jordan, M.I. (1999) Attractor dynamics in feedforward neural networks. To appear in Neural Computation.

Approximate Inference & Learning

‹ MCMC: Neal, R. M. (1993). Probabilistic inference using Markov chain monte carlo methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto.

‹ Particle Filters:

Gordon, N.J. , Salmond, D.J. and Smith, A.F.M. (1993) A novel approach to nonlinear/non-Gaussian Bayesian state estimation IEE Proceedings on Radar, Sonar and Navigation 140(2):107-113;

Kitagawa, G. (1996) Monte Carlo filter and smoother for non-Gaussian nonlinear state space models, Journal of Computational and Graphical Statistics 5(1):1–25.

Isard, M. and Blake, A. (1998) CONDENSATION – conditional density propagation for visual tracking Int. J. Computer Vision 29 1:5–28.

‹ Extended Kalman Filtering: Goodwin, G. and Sin, K. (1984). Adaptive filtering prediction and control. Prentice-Hall.

‹ Recognition Models:

Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268:1158–1161.

Dayan, P., Hinton, G. E., Neal, R., and Zemel, R. S. (1995) The Helmholtz Machine . Neural Computation, 7, 1022-1037.

‹ Ockham’s Razor and Laplace Approximation:

Jefferys, W.H., Berger, J.O. (1992) Ockham’s Razor and Bayesian Analysis. American Scientist 80:64-72; MacKay, D.J.C. (1995) Probable Networks and Plausible Predictions - A Review of Practical Bayesian Methods for Supervised Neural Networks. Network: Computation in Neural Systems. 6 469-505

ΠBIC: Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6:461-464.

ΠMDL: Wallace, C.S. and Freeman, P.R. (1987) Estimation and inference by compact coding. J. of the Royal Stat. Society series B 49(3): 240-265; J. Rissanen (1987)

ΠBayesian Gibbs sampling software: The BUGS Project Рhttp://www.iph.cam.ac.uk/bugs/

ΠVariational Bayesian Learning:

Hinton, G. E. and van Camp, D. (1993) Keeping Neural Networks Simple by Minimizing the Description Length of the Weights. In Sixth ACM Conference on Computational Learning Theory, Santa Cruz.

Waterhouse, S., MacKay, D., and Robinson, T. (1995) Bayesian methods for Mixtures of Experts. NIPS95. (See also several unpublished papers by MacKay).

Bishop, C.M. (1999) Variational PCA. In Proc. Ninth Int. Conf. on Artificial Neural Networks ICANN99 1:509 - 514.

Attias, H. (1999). Inferring parameters and structure of latent variable models by variational Bayes. In Proc. 15th Conf. on Uncertainty in Artificial Intelligence;

Ghahramani, Z. and Beal, M.J. (1999) Variational Bayesian Inference for Mixture of Factor Analysers. NIPS99;

http://www.gatsby.ucl.ac.uk/