Introrudction to Probabilistic

Modeling, Inference, and Learning

Overview

• What’s Probabilistic Graphical Model for ?

• Tasks in Graphical Model: – Modeling – Learning – Inference

• Examples – Topic Model – Hidden Markov Model –

Overview

• What’s Probabilistic Graphical Model for ?

• Tasks in Graphical Model: – Modeling – Inference – Learning

• Examples – Topic Model – Hidden Markov Model – Markov Random Field

What’s PGM for ?

• Use “Probability” to “Model” dependencies among target of interest as a “Graph”. A unified framework for :

– Prediction (Classification / Regression)

– Discovery (Clustering / Pattern Recognition / System Modeling)

– State Tracking (Localization/ Monitoring/ MotionTracking)

– Ranking ( Search Engine/ Recommendation for Text/ Image/ Item ) What’s PGM for ?

• Use “Probability” to “Model” dependencies among target of interest as a “Graph”. A unified framework for :

– Prediction (Classification / Regression)

– Discovery (Clustering / Pattern Recognition / System Modeling)

– State Tracking (Localization/ Monitoring/ MotionTracking)

– Ranking ( Search Engine/ Recommendation for Text/ Image/ Item ) Prediction (Lectures Before …) Variables of interest ?

Perceptron SVM

Linear K - Nearest Regression Neighbor Prediction with PGM

One prediction are dependent on others. ( Collective Classification )

Data:

(x1,x2,….,xd, y) (x1,x2,….,xd, y) (x1,x2,….,xd, y) …

(x1,x2,….,xd, y) Prediction with PGM

Labels are missing for most of data. ( Semi-supervised Learning )

Data:

(x1,x2,….,xd, y=0) (x1,x2,….,xd, y=1) (x1,x2,….,xd, y=?) …

(x1,x2,….,xd, y=?) What’s PGM for ?

• Use “Probability” to “Model” dependencies among target of interest as a “Graph” . A unified framework for :

– Prediction (Classification / Regression)

– Discovery (Clustering / Pattern Recognition / System Modeling)

– State Tracking (Localization/ Monitoring/ MotionTracking)

– Ranking ( Search Engine/ Recommendation for Text/ Image/ Item ) Discovery with PGM

We are interested about hidden variables. ( Unsupervised Learning )

Data: Y=1

(x1,x2,….,xd, y=?) (x1,x2,….,xd, y=?) … Y=2 1 2 … 3 4 (x1,x2,….,xd, y=?) Y=3 5 6

7 8 Discovery with PGM

Pattern Mining / Pattern Recognition. Discovery with PGM Learn a whole picture of the problem domain.

Some domain knowledge in hand about the generating process.

Data:

(x1,x2,….,xd) (x1,x2,….,xd) … …

(x1,x2,….,xd) What’s PGM for ?

• Use “Probability” to “Model” dependencies among target of interest as a “Graph” . A unified framework for :

– Prediction (Classification / Regression)

– Discovery (Clustering / Pattern Recognition / System Modeling)

– State Tracking (Localization/ Mapping/ MotionTracking)

– Ranking ( Search Engine/ Recommendation for Text/ Image/ Item ) State Tracking with PGM

Given Observation from Sensors (ex. Camera, Laser, Sonar …….) Localization : Locate sensor itself. Mapping : Locate Feature Points of environment. Tracking: Locate Moving Object in the environment. What’s PGM for ?

• Use “Probability” to “Model” dependencies among target of interest as a “Graph” . A unified framework for :

– Prediction (Classification / Regression)

– Discovery (Clustering / Pattern Recognition / System Modeling)

– State Tracking (Localization/ Mapping/ MotionTracking)

– Ranking ( Search Engine/ Recommendation for Text/ Image/ Item ) Ranking with PGM Who needs the scores ?  Search Engine / Recommendation. Variables of interest ? Overview

• What’s Probabilistic Graphical Model for ?

• Tasks in Graphical Model: – Modeling – Learning – Inference

• Examples – Topic Model – Hidden Markov Model – Markov Random Field

A Simple Example

?

X ~ Mul(p1~p6) Modeling A Simple Example Inference Training Data ?

X ~ Mul(p1~p6) Learning 0 0 .5 0 .5 0 A Simple Example

Training Data ?

X ~ Mul(p1~p6) Is this the best model ? Why not .1 .1 .3 .1 .3 .1 ? 0 0 .5 0 .5 0

Maximum Likelihood Estimation ( MLE ) Criteria :

Best Model = argmax Likelihood(Data) = P(Data|model) = P(X1) * P(X2) * …… * P(X10) model

5 5 = (p3) * (p5) Sub. to p1 + p2 + p3 + p4 + p5 + p6 = 1

 p3=5/(5+5) , p5=5/(5+5) A Simple Example

Training Data ?

X ~ Mul(p1~p6) Is the “MLE” model best ? 0 0 .5 0 .5 0

Compute “Likelihood” on Testing Data. Testing Data

X1 X2 X3 X4 X5

P( Data | Your Model ) = P(X1)* P(X2)* P(X3)* P(X4)* P(X5)

“Likelihood” tends to overflow so practically using “Log Likelihood”:

ln[ P( Data | Your Model ) ] = lnP(X1)+ lnP(X2) + lnP(X3) + lnP(X4) + lnP(X5) A Simple Example

Training Data ?

X ~ Mul(p1~p6) Is the “MLE” model best ? 0 0 .5 0 .5 0

Compute “Likelihood” on Testing Data. Testing Data

P( Data | Your Model ) = .5* .5* .5* .5* .5 = 0.0312 (1 is best)

ln[ P( Data | Your Model ) ] = ln(0.5) * 5 = -3.46 (0 is best) A Simple Example

Training Data ?

X ~ Mul(p1~p6) Is the “MLE” model best ? 0 0 .5 0 .5 0

Compute “Likelihood” on Testing Data. Testing Data Overfit Training Data!!

P( Data | Your Model ) = .5* .5* .5* .5* .0 = 0

ln[ P( Data | Your Model ) ] = ln(0.5) * 4 + ln(0)*1 = -∞ Bayesian Learning

Training Data ?

X ~ Mul(p1~p6) 1/6 1/6 1/6 1/6 1/6 1/6 Prior Knowledge ?

1 1 1 1 1 1 Prior = P(p1~p6) = const. * (p1) * (p2) * (p3) * (p4) * (p5) * (p6)

5 5 Likelihood =P(Data|p1~p6) = (p3) * (p5)

P(p1~p6|Data) = const. * Piror * Likelihood 1 1 1+5 1 1+5 1 = const. * (p1) * (p2) * (p3) * (p4) * (p5) * (p6) Dirichlet Distribution

1 1 1 Prior = P(p1,p3,p5) = const. * (p1) * (p3) * (p5)

p3=1

p1=1 p5=1 Dirichlet Distribution

5 5 5 Prior = P(p1,p3,p5) = const. * (p1) * (p3) * (p5)

p3=1

p1=1 p5=1 Dirichlet Distribution

100 100 100 Prior = P(p1,p3,p5) = const. * (p1) * (p3) * (p5)

p3=1

p1=1 p5=1 Dirichlet Distribution 0 0 0 Prior = P(p1,p3,p5) = const. * (p1) * (p3) * (p5)

p3=1

p1=1 p5=1 Dirichlet Distribution 0 5 5 Likelihood = P(p1,p3,p5) = (p1) * (p3) * (p5)

p3=1

p1=1 p5=1 Dirichlet Distribution

1 1+5 1+5 P(p1,p3,p5|Data) = C*Prior*Likelihood = const.* (p1) * (p3) * (p5)

Observation: 5 5

Prior Likelihood Posterior Dirichlet Distribution

1 1+30 1+30 P(p1,p3,p5|Data) = C*Prior*Likelihood = const.* (p1) * (p3) *(p5)

Observation: 30 30

Prior Likelihood Posterior Dirichlet Distribution

1 1+300 1+300 P(p1,p3,p5|Data) = C*Prior*Likelihood = const.* (p1) * (p3) *(p5)

Observation: 300 300

Prior Likelihood Posterior Bayesian Learning

Training Data ?

X ~ Mul(p1~p6) How to Predict with Posterior ?

P(p1~p6|Data) = const. * Piror * Likelihood 1 1 1+5 1 1+5 1 = const. * (p1) * (p2) * (p3) * (p4) * (p5) * (p6)

1. Maximum Posterior (MAP) :

1 1 6 1 6 1 (p ~p ) = argmax P(p ~p |Data)  ( , , , , , ) 1 6 1 6 16 16 16 16 16 16 p1 ~ p6

Testing Data 6 6 1 6 6 P(Data | p ~ p )  * * * *  0.0012 1 6 16 16 16 16 16

ln P(Data | p1 ~ p6 )  6.7 (much better) Bayesian Learning

Training Data ?

X ~ Mul(p1~p6) How to Predict with Posterior ?

P(p1~p6|Data) = const. * Piror * Likelihood 1 1 1+5 1 1+5 1 = const. * (p1) * (p2) * (p3) * (p4) * (p5) * (p6)

2. Evaluate Uncertainty :

Var [ p ...p ]  (.003, .003, .014, .003, .014, .003) P( p1~ p6|Data) 1 6

Stderr [ p ...p ]  (0.05, 0.05, 0.1, 0.05, 0.1, 0.05) P( p1~ p6|Data) 1 6 Bayesian Learning on Regression

W P(Y|X, W) = N( Y ; XTW , σ2 ) Y XTW

X

N Likelihood(W)  P(Data |W)  P(Yn | X n , W) n1 P(W | Data)  const.*Prior(W)*Likelihood(W) Bayesian Learning on Regression

P(W) P( Y|X, Data )

P(Data|W) P(W|Data) Bayesian Learning on Regression

P(Data|W) P(W|Data) P( Y|X, Data ) on Regression

Predictive Distribution : P( Y|X , Data ) = ∫ P(Y|X, W) * P(W|Data) dW

Modeling Uncertainty : Overview

• What’s Probabilistic Graphical Model for ?

• Tasks in Graphical Model: – Modeling (Simple Probability Model) – Learning (MLE, MAP, Bayesian) – Inference (?) I have seen “Probabilistic Model”. But where is the “Graph” ? • Examples – Topic Model – Hidden Markov Model – Markov Random Field

Overview

• What’s Probabilistic Graphical Model for ?

• Tasks in Graphical Model: – Modeling (Simple Probability Model) – Learning (MLE, MAP, Bayesian) – Inference (?) I have seen “Probabilistic Model”. But where is the “Graph” ? • Examples – Topic Model – Hidden Markov Model – Markov Random Field

How to Model a Document of Text ?

What you want to do ?

Build Search Engine ?

Natural Language Understanding ?

Machine Translation ?

Bag of Words Model:

Not consider words position, just like a bag of words. Model a Document of Text

?

X ~ Mul(p1~p6) Modeling Document

W1 W2 W3 … W W ~ Mul( p1 ~ p|Vocabulary| ) …

… Model a Document of Text Doc1 Doc2

dog, learning, puppy, intelligence, breed, algorithm,

MLE from data  pW = 1/6 for all w. What’s the problem ?  Likelihood = P(Data) = P(Doc1,Doc2) = (1/6)3 * (1/6)3 = 2*10-5

Each Doc. has its own distribution of words ?! Each Doc. has its own distribution of words ??

Documents on “the same Topic” has similar distribution of words. A Topic Model ---”Word” depends on “Topic” Template Representation Ground Representation

Document 1/2 μ 1/2 μ Doc1 Doc2 “2” Topic Topic[d] “1” dog, learning, Pos. puppy, intelligence, Topic θ breed, algorithm, Word dog puppy breed learning intelligence algorithm Topic 1 1 1 1 1 1 1 θ θ dog θ puppy θ breed θ learning θ intelligence θ algorithm 1/3 1/3 1/3 0.0 0.0 0.0

Topic 2 2 2 2 2 2 2 θ θ dog θ puppy θ breed θ learning θ intelligence θ algorithm For all Documents d 0.0 0.0 0.0 1/3 1/3 1/3 1. draw Topic[d] ~ Multi (μ) Likelihood = P(Data) = P(Doc1)*P(Doc2) For all Position w = { P(T )P(W ~W |T ) } * { P(T )P(W ~W |T ) } 2. draw W[d,w]~ Multi (θTopic[d]) 1 1 3 1 2 1 3 1 = (1/2) (1/3)3 (1/2) (1/3)3 = 3*10-3 Learning with Hidden Variables 1. Given Topics, we can learn word’s distribution. Document μ Topic P(W|Topic1) P(W|Topic2)

Pos. θTopic Word W W w1 w2 w3 w1 w2 w3

1 1 1 2 2 2

T Doc1 Doc2 Doc3 Doc4 Doc5 Doc6

A A E A A C W A B C C B C W B D F D C C W B A F C A F W W E B C F A F

Learning with Hidden Variables

Document μ 2. Given words distribution P(W|Topic), Topic we can infer Topics. Pos. θTopic Word P(T|W1~W5) = const. * P(T) * P(W1|T) P(W2|T) …P(W5|T)

? ? ? ? ? ?

T Doc1 Doc2 Doc3 Doc4 Doc5 Doc6

A A E A A C W A B C C B C W B D F D C C W B A F C A F W W E B C F A F

Learning with Hidden Variables

Document μ Topic Both of them are unknown, how to Learn?

Pos. Topic Using EM algorithm. θ Word (here is simplified version.)

T Doc1 Doc2 Doc3 Doc4 Doc5 Doc6

A A E A A C W A B C C B C W B D F D C C W B A F C A F W W E B C F A F

Learning with Hidden Variables

Document Topic is unknown (No one can tell us.) , μ how to learn word distribution ? Topic

Pos. θTopic Word Random Initalize :

1 2 2 1 1 2

T Doc1 Doc2 Doc3 Doc4 Doc5 Doc6

A A E A A C W A B C C B C W B D F D C C W B A F C A F W W E B C F A F

Learning with Hidden Variables

Document Topic is unknown (No one can tell us.) , M-Step: μ how to learn word distribution ? Topic

Pos. P(T) P(W|T) A B C D E F Topic 1/2 T1 6/15 3/15 3/15 1/15 1/15 2/15 Word θ 1/2 T2 2/15 2/15 5/15 1/15 1/15 4/15

1 2 2 1 1 2

T Doc1 Doc2 Doc3 Doc4 Doc5 Doc6

A A E A A C W A B C C B C W B D F D C C W B A F C A F W W E B C F A F

Learning with Hidden Variables

Document Topic is unknown (No one can tell us.) , μ how to learn word distribution ? Topic

Pos. P(T) P(W|T) A B C D E F Topic 1/2 T1 6/15 3/15 3/15 1/15 1/15 2/15 Word θ 1/2 T2 2/15 2/15 5/15 1/15 1/15 4/15 E-Step: 1 P(T|W1~W5) = const.* P(W1~W5|T) * P(T) 1/2, 1/2 2 T Doc1 Doc2 Doc3 Doc4 Doc5 Doc6

A A E A A C W A B C C B C W B D F D C C W B A F C A F W W E B C F A F

Learning with Hidden Variables

Document Topic is unknown (No one can tell us.) , μ how to learn word distribution ? Topic

Pos. P(T) P(W|T) A B C D E F Topic 1/2 T1 6/15 3/15 3/15 1/15 1/15 2/15 Word θ 1/2 T2 2/15 2/15 5/15 1/15 1/15 4/15 E-Step: 1 6*6*3*3*1 1 2 2*2*2*2*1 T Doc1 Doc2 Doc3 Doc4 Doc5 Doc6

A A E A A C W A B C C B C W B D F D C C W B A F C A F W W E B C F A F

Learning with Hidden Variables

Document Topic is unknown (No one can tell us.) , μ how to learn word distribution ? Topic

Pos. P(T) P(W|T) A B C D E F Topic 1/2 T1 6/15 3/15 3/15 1/15 1/15 2/15 Word θ 1/2 T2 2/15 2/15 5/15 1/15 1/15 4/15 E-Step: 1 6*3*1*6*3 1 1 2 2*2*1*2*2 T Doc1 Doc2 Doc3 Doc4 Doc5 Doc6

A A E A A C W A B C C B C W B D F D C C W B A F C A F W W E B C F A F

Learning with Hidden Variables

Document Topic is unknown (No one can tell us.) , μ how to learn word distribution ? Topic

Pos. P(T) P(W|T) A B C D E F Topic 1/2 T1 6/15 3/15 3/15 1/15 1/15 2/15 Word θ 1/2 T2 2/15 2/15 5/15 1/15 1/15 4/15 E-Step: 1 1*3*2*2*3 1 1 2 2 1*5*4*4*5 T Doc1 Doc2 Doc3 Doc4 Doc5 Doc6

A A E A A C W A B C C B C W B D F D C C W B A F C A F W W E B C F A F

Learning with Hidden Variables

Document Topic is unknown (No one can tell us.) , μ how to learn word distribution ? Topic

Pos. P(T) P(W|T) A B C D E F Topic 1/2 T1 6/15 3/15 3/15 1/15 1/15 2/15 Word θ 1/2 T2 2/15 2/15 5/15 1/15 1/15 4/15 E-Step: 1 6*3*1*3*2 1 1 2 2 2 2*5*1*5*4 T Doc1 Doc2 Doc3 Doc4 Doc5 Doc6

A A E A A C W A B C C B C W B D F D C C W B A F C A F W W E B C F A F

Learning with Hidden Variables

Document Topic is unknown (No one can tell us.) , μ how to learn word distribution ? Topic

Pos. P(T) P(W|T) A B C D E F Topic 1/2 T1 6/15 3/15 3/15 1/15 1/15 2/15 Word θ 1/2 T2 2/15 2/15 5/15 1/15 1/15 4/15 E-Step: 1 6*3*3*6*6 1 1 2 2 1 2 2*2*5*2*2 T Doc1 Doc2 Doc3 Doc4 Doc5 Doc6

A A E A A C W A B C C B C W B D F D C C W B A F C A F W W E B C F A F

Learning with Hidden Variables

Document Topic is unknown (No one can tell us.) , μ how to learn word distribution ? Topic

Pos. P(T) P(W|T) A B C D E F Topic 1/2 T1 6/15 3/15 3/15 1/15 1/15 2/15 Word θ 1/2 T2 2/15 2/15 5/15 1/15 1/15 4/15 E-Step: 1 3*3*3*2*2 1 1 2 2 1 2 2 5*5*5*4*4 T Doc1 Doc2 Doc3 Doc4 Doc5 Doc6

A A E A A C W A B C C B C W B D F D C C W B A F C A F W W E B C F A F

Learning with Hidden Variables

Document Topic is unknown (No one can tell us.) , μ how to learn word distribution ? Topic

Pos. P(T) P(W|T) A B C D E F Topic 1/2 T1 6/15 3/15 3/15 1/15 1/15 2/15 Word θ 1/2 T2 2/15 2/15 5/15 1/15 1/15 4/15 1 2 2 1 1 2 Not Converge: 1 1 2 2 1 2

T Doc1 Doc2 Doc3 Doc4 Doc5 Doc6

A A E A A C W A B C C B C W B D F D C C W B A F C A F W W E B C F A F

Learning with Hidden Variables

Document Topic is unknown (No one can tell us.) , μ how to learn word distribution ? M-Step: Topic

Pos. P(T)P(T) P(W|T) A B C D E F Topic 1/21/2 T1 7/156/15 5/153/15 1/153/15 1/15 1/15 0/152/15 Word θ 1/21/2 T2 1/152/15 0/152/15 7/155/15 1/15 1/15 5/154/15

1 1 2 2 1 2

T Doc1 Doc2 Doc3 Doc4 Doc5 Doc6

A A E A A C W A B C C B C W B D F D C C W B A F C A F W W E B C F A F

Learning with Hidden Variables

Document Topic is unknown (No one can tell us.) , μ how to learn word distribution ? Topic

Pos. P(T) P(W|T) A B C D E F Topic 1/2 T1 7/15 5/15 1/15 1/15 1/15 0/15 Word θ 1/2 T2 1/15 0/15 7/15 1/15 1/15 5/15

E-Step: 1 1 2 2 1 2

T Doc1 Doc2 Doc3 Doc4 Doc5 Doc6

A A E A A C W A B C C B C W B D F D C C W B A F C A F W W E B C F A F

Learning with Hidden Variables

Document Topic is unknown (No one can tell us.) , μ how to learn word distribution ? Topic

Pos. P(T) P(W|T) A B C D E F Topic 1/2 T1 7/15 5/15 1/15 1/15 1/15 0/15 Word θ 1/2 T2 1/15 0/15 7/15 1/15 1/15 5/15 1 1 2 2 1 2 Converge 1 1 2 2 1 2

T Doc1 Doc2 Doc3 Doc4 Doc5 Doc6

A A E A A C W A B C C B C W B D F D C C W B A F C A F W W E B C F A F

Learning with Hidden Variables

Learning & Inference Jointly.

P(T) P(W|T) A B C D E F 2 Tasks : 1/2 T1 7/15 5/15 1/15 1/15 1/15 0/15 1. Infer Topic of documents. 1/2 T2 1/15 0/15 7/15 1/15 1/15 5/15

2. Learn Topic’s “word distribution” ? EM 1 1 1 2 2 2

Doc1 Doc2 Doc5 Doc3 Doc4 Doc6 Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 A A E A A C A A A E A C A B C C B C A B B C C C B D F D C C B D C F D C B A F C A F E B C F A F B A A F C F E B A C F F

Topic Model

Problem Solved ?

If……. Topic1 Topic2

word1 word2 word3 word4 word5 Word6 Topic1 Doc1 1 1 1 1 0 0 Topic2 ? Doc2 0 0 0 0 1 1 Topic2 Doc3 0 0 1 1 1 1 ????????? Doc4 1 1 0 0 1 1 ????????? Doc5 0 0 1 1 0 0 Topic Model

Problem Solved ?

If……. Topic1 Topic2 Topic3

word1 word2 word3 word4 word5 Word6 Topic 1 or 2 ? Doc1 1 1 1 1 0 0 Topic3 Doc2 0 0 0 0 1 1 Topic 2 or 3 ? Doc3 0 0 1 1 1 1 Topic 1 or 3 ? Doc4 1 1 0 0 1 1 Topic2 Doc5 0 0 Obvious1 1that 0 0 a doc. has mix of Topics instead of only 1 Topic.

How to model each doc. as a “Mix of Topics”? How to Model Mix of Topics ?

Document μ Topic

Pos. Topic Topic Document θ Word W1 W2 W3 … How to Model Mix of Topics ?

Document Relax “one Topic per doc.” assumption. μ Instead, every doc. has a “Mix” of topics. Topic Mix = (μ1, μ2, μ3,…, μK) , 0<=μk <=1, Σk μk=1 Pos. Topic Topic Document θ Word W1 W2 W3 … How to Model Mix of Topics ?

Document Relax “one Topic per doc.” assumption. ??? Instead, every doc. has a “Mix” of topics. Mix Mix = (μ1, μ2, μ3,…, μK) , 0<=μk <=1, Σk μk=1 Pos.

Topic Topic Document Mix θ Word W1 W2 W3 … Topic 1

Topic 1 Topic 2

Topic 2

Topic 3

Topic 3 Topic 1 How to Model Mix of Topics ?

Document Relax “one Topic per doc.” assumption. ??? Instead, every doc. has a “Mix” of topics. Mix Mix = (μ1, μ2, μ3,…, μK) , 0<=μk <=1, Σk μk=1 Pos. 40% 33% 27% Topic θTopic Document Topic1 Topic2 Topic3 Word W W W W W W W W W

For all Documents d W W W W W W W W 1. draw Mix[d] ~ ????? For all Position p in d 2. draw Topic[d,p]~ Multinomial( Mix[d] ) 3. draw Word[d,p]~ Multinomial(θTopic[d,p]) How to Model Mix of Topics ?

Document Relax “one Topic per doc.” assumption. ??? Instead, every doc. has a “Mix” of topics. Mix Mix = (μ1, μ2, μ3,…, μK) , 0<=μk <=1, Σk μk=1 Pos. Mix = (μ1, μ2, μ3,…, μK) defines Topic θTopic a Multi(Mix) Dist. On Topics. Document Topic1 Topic2 Topic3 Word While we need a W W W W W W W W W “Distribution on Mix” ?? For all Documents d W W W W W W W W 1. draw Mix[d] ~ ?????  Use “Dirichlet distribution”. For all Position p in d 2. draw Topic[d,p]~ Multinomial( Mix[d] ) 3. draw Word[d,p]~ Multinomial(θTopic[d,p]) Dirichlet Distribution Dir(α) Document Due to “Hidden Mix” of a doc. , Mix Topicthis 1 model is well-known as “Latent Dirichlet Allocation (LDA)”. Pos.

Topic Topic Documents = Mix of Topics θ Word

Topic 2 Topic 3 From: David M. Blei etal. , Latent Dirichlet Allocation, 2003 So, How to learn a LDA ?

Document Dir(α)

Mix The “E-Step” & “M-Step” is Much more difficult.

Pos. It’s out of scope. Please see :

Topic Topic David M. Blei etal. , Latent Dirichlet Allocation, 2003 θ Word

Luckily, You don’t need to know exactly how to Learn & Infer. There are tools online for Learning & Inference. What important if knowing how to “design a model”. Application for LDA : Search Engine

Retrieve in “Topic Level” instead of “Word Level”.

Given words in a document, Given words in a query, we can infer we can infer its “topics”. the “topics” user want.

Mix Mix Mix Mix

X X X X doc1 doc2 doc3 query

Score   P(Mix | Document)*P(Query | Mix) dMix Application for LDA : Social Network

Topic 2 Topic 1 Application for LDA : Social Network

Topic 2 1 1 1 1 1 Topic 1 1 1

1 1

1 1 1

1 1 1

1 1 1

1 1 1 1 1 1

1 1 1

1 1 1 吉他社 Topic Model on Social Network 熱舞社 資管系

1 1 1 1 1 吉 他 1 1

社 1 1

1 1 1

熱 1 1 1 舞 社 1 1 1

1 1 1 1 1 1 資 管 1 1 1 系 1 1 1 Application for LDA : Recommendation

Movie 1 1 1 1 1

1 1 1

1 1 1

1 1 1 1

User 1 1 1 1

1 1 1 1

1 1 1 1 1 1 1 1 1

1 1

1 1 Application for LDA : Image Categorize/Retrieval

Fei-Fei et al. ICCV 2005 Overview

• What’s Probabilistic Graphical Model for ?

• Tasks in Graphical Model: – Modeling (Simple Probability Model) – Learning (MLE, MAP, Bayesian) Where is the Graph ? – Inference (Bayes Rule ??) Is that important ? • Examples – Topic Model (EM algorithm) – Hidden Markov Model – Markov Random Field

Overview

• What’s Probabilistic Graphical Model for ?

• Tasks in Graphical Model: – Modeling (Simple Probability Model) – Learning (MLE, MAP, Bayesian) Where is the Graph ? – Inference (Bayes Rule ??) Is that important ? • Examples – Topic Model (EM algorithm) – Hidden Markov Model – Markov Random Field

Given a Problem: Binary Coding/Decoding

I L o v e Y o u

001 00 0100 0100 0110 011 00 0111 0100 101

( Coding with different length, ex. Huffman Coding. )

Application Example : 1. Decode words from codes. (given coding pattern) 2. Learn Coding pattern from data. 3. Decide which coding method a data sequence uses. A Naïve Model ---1order Markov Chain 2TBN Template PC t-1 Design a model explaining this data: 010 1011 010 1001 010 1001 010 1011

Ct-1 Ct

t-1 t A 2TBN template specify how “before” affect “after”. A Naïve Model ---1order Markov Chain 2TBN Template PC t-1 Design a model explaining this data: 010 1011 010 1001 010 1001 010 1011

Ct-1 Ct Sample Generating Procedure:

t-1 t 1. Draw C[t=1] ~ Bernoulli( p0 ) For t = 2 ~ T 2. Draw C[t] ~ Bernoulli( pC[t-1] ) Ground Representation

C=0 C=1 1 0 1 0 1 0.2 0.8 ……

C[t-1] p Ct=0 Ct=1

Ct-1=0 0.01 0.99

Ct-1=1 0.99 0.01 A Naïve Model ---1order Markov Chain 2TBN Template PC t-1 Design a model explaining this data: 010 1011 010 1001 010 1001 010 1011

Ct-1 Ct How to explain data with high likelihood ? MLE estimate: t-1 t p0 C=0 C=1 C[t-1] p Ct=0 Ct=1 1 0 # 00 # 01 Ct-1=0 # 0? # 0? # 10 # 11 Ct-1=1 # 1? # 1? A Naïve Model ---1order Markov Chain 2TBN Template PC t-1 Design a model explaining this data: 010 1011 010 1001 010 1001 010 1011

Ct-1 Ct How to explain data with high likelihood ? MLE estimate: t-1 t p0 C=0 C=1 C[t-1] p Ct=0 Ct=1 1 0

Ct-1=0 2/14 12/14

Ct-1=1 11/13 2/13

Likelihood = P( Data ) Can we do better ? = P(0) P(0|0)2 P(1|0)12 P(0|1)11 P(1|1)2 Pattern “010” “1011” ... = 1*(2/14)2 (12/14)12(11/13)11(2/13)2 not explicitly handled. How to Model a Pattern ?

Assume in data, “101110” is a frequent pattern. ( or we have known it is a coding for some word. ) A Naïve Approach : 5order Markov Chian

1 0 1 1 1 0

C ~C C =0 C =1 t-1 t-5 t t Table Size = #of params = 26 00000 ? ? 00001 ? ? Intractability & Overfitting …… 10111 High Low When handling a coding with …… 10 values = {1,2….,10}  Size=106 !!! 11111 ? ? How to Model a Pattern ?

Assume in data, “101110” is more frequent than usual. ( or we have known it is a coding for some word. ) Observation : a pattern can be produced with a State Machine

A State Machine is a special case of 1order Markov Chain: CPD of CPD of Red variable : Green variable : S1 S2 S3 S4 S5 S6

S S S S S S 1 2 3 4 5 6 0 1 S1 1 1 0 1 1 1 0 S1 0 1 S2 1 S2 1 0 S 1 3 S 0 1 order 3 1 “Hidden” Markov Chain S4 1 S4 0 1 Suffice to produce the pattern ! S5 1 S5 0 1

If there are K patterns, we can have K State Machines for them . S6 1 0 How to Model Multiple Patterns?

Design 3 state machines (A,B,and C) for the 3 patterns : A C A B A B A C 010 1011 010 1001 010 1001 010 1011

0.5 0.5

A1 A2 A3 B1 B2 B3 B3 C1 C2 C3 C3

0 1 0 1 0 0 1 1 0 1 1

Entry Exit Entry1 Exit Entry Exit

1

Consider “Transition Probability” among patterns. How to Model Multiple Patterns? Transition Table of the “Hidden Markov Chain”

A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4 A1 1 A2 1 A3 .5 .5 B1 1 B2 1 B3 1 B4 1 C1 1 C2 1 C3 1 Transition Diagram of C4 1 “Hidden Markov Chain” 0.5 0.5

A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

0 1 0 1 0 0 1 1 0 1 1

1 1 How to decoding (Inference) ?

1 0 0 1 0 1 0 1 0 1 1

Transition Diagram of “Hidden Markov Chain” 0.5 0.5

A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

0 1 0 1 0 0 1 1 0 1 1

1 1 In terms of difficulty, there are 3 types of inference problem.

• Inference which is easily solved with Bayes rule.

• Inference which is tractable using some dynamic programming technique. (e.g. Variable Elimination or J-tree algorithm )

• Inference which is proved intractable & should be solved using some Approximate Method. (e.g. Approximation with Optimization or Sampling technique.) Most Probable Assignment

• Given Data = {X1=x1,…, XD=xD} and some other variables Z={Z1, …, Zk} unspecified , Most Probable Assignment of Z is given by:

MPA(Z | X )  arg max P(Z | X ) Z P(X | Z)P(Z)  arg max  arg max P(X | Z)P(Z) Z P(X) Z

 arg max P(1| Z1)*P(0 | Z2 )P(Z2 | Z1)...... *P(0 | Z5 )P(Z5 | Z4 ) Z1~Z5

Z Z1 Z2 Z3 Z4 Z5

X 1 0 0 1 0 M(B) = maxA F(A,B) : maxMarginal of B

M(B)  max F(A, B) A

For every choice of B, we decide an A*(B)=argmaxA F(A,B) with M(B) = F( A*(B), B ).

F(A,B) a1 a2 a3 b1 1 2 4 b2 3 5 7 b3 9 8 6 M(B) = maxA F(A,B) : maxMarginal of B

M(B)  max F(A, B) A

For every choice of B, we decide an A*(B)=argmaxA F(A,B) with M(B) = F( A*(B), B ).

B A(B) M(B) F(A,B) a1 a2 a3 b1 a3 4 b1 1 2 4 b2 b2 3 5 7 b3 b3 9 8 6 M(B) = maxA F(A,B) : maxMarginal of B

M(B)  max F(A, B) A

For every choice of B, we decide an A*(B)=argmaxA F(A,B) with M(B) = F( A*(B), B ).

B A(B) M(B) F(A,B) a1 a2 a3 b1 a3 4 b1 1 2 4 b2 a3 7 b2 3 5 7 b3 b3 9 8 6 M(B) = maxA F(A,B) : maxMarginal of B

M(B)  max F(A, B) A

For every choice of B, we decide an A*(B)=argmaxA F(A,B) with M(B) = F( A*(B), B ).

B A(B) M(B) F(A,B) a1 a2 a3 b1 a3 4 b1 1 2 4 b2 a3 7 b2 3 5 7 b3 a1 9 b3 9 8 6 Most Probable Assignment F(A,B) E D C B A max P(E  e, D,C, B, A) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A F(A,B)

F(A,B) a1 a2 a3 b1 … … … b2 … … … b3 … … … Most Probable Assignment on a Chain M(B) E D C B A max P(E  e, D,C, B, A) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A

M(B) = maxA F(A,B)

B A*(B) M(B) F(A,B) a1 a2 a3 b1 a1 … b1 … … … b2 a3 … b2 … … … b3 a2 … b3 … … … Most Probable Assignment on a Chain F(B,C) E D C B A max P(E  e, D,C, B, A) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A F(B,C)=P(C|B)M(B)

P(C|B) b1 b2 b3 B A*(B) M(B) c1 … … … b1 a1 … c2 … … … b2 a3 … c3 … … … b3 a2 … Most Probable Assignment on a Chain F(B,C) E D C B A max P(E  e, D,C, B, A) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A F(B,C)=P(C|B)M(B)

F(B,C) b1 b2 b3 c1 … … … c2 … … … c3 … … … Most Probable Assignment on a Chain M(C) E D C B A max P(E  e, D,C, B, A) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A

M(C) = maxB F(B,C)

C B*(C) M(C) F(B,C) b1 b2 b3 c1 b3 … c1 … … … c2 b1 … c2 … … … c3 b2 … c3 … … … Most Probable Assignment on a Chain F(C,D) E D C B A max P(E  e, D,C, B, A) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A F(C,D) = P(D|C)M(C)

P(D|C) c1 c2 c3 C B*(C) M(C) d1 … … … c1 b3 … d2 … … … c2 b1 … d3 … … … c3 b2 … Most Probable Assignment on a Chain F(C,D) E D C B A max P(E  e, D,C, B, A) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A F(C,D) = P(D|C)M(C)

F(C,D) c1 c2 c3 d1 … … … d2 … … … d3 … … … Most Probable Assignment on a Chain M(D) E D C B A max P(E  e, D,C, B, A) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A

M(D)=maxC F(C,D)

D C*(D) M(D) F(C,D) c1 c2 c3 d1 c1 … d1 … … … d2 c2 … d2 … … … d3 c3 … d3 … … … Most Probable Assignment on a Chain F(D,E=e) E D C B A max P(E  e, D,C, B, A) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A F(D)=P(E=e|D)M(D)

P(E=e|D) d1 d2 d3 D C*(D) M(D) e … … … d1 c1 … d2 c2 … d3 c3 … Most Probable Assignment on a Chain M E D C B A max P(E  e, D,C, B, A) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A

M= maxD F(D) D* M F(D,E=e) d1 d2 d3 d2 … e … … …

What we get ?  M = maxABCD P(A,B,C,D,E=e)

What we want ?  (A*,B*,C*,D*) = argmaxABCD P(A,B,C,D,E=e) Most Probable Assignment on a Chain M E D C B A max P(E  e, D,C, B, A) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A What we want ?  (A*,B*,C*,D*) = argmaxABCD P(A,B,C,D,E=e) (a1 , b1, c2, d2)

D* M D C*(D) M(D) C B*(C) M(C) B A*(B) M(B) d2 … d1 c1 … c1 b3 … b1 a1 … d2 c2 … c2 b1 … b2 a3 … d3 c3 … c3 b2 … b3 a2 … How to decode (Inference) ?

1 0 1 0 1 0 0

0.5 0.5

A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

0 1 0 1 0 0 1 1 0 1 1

1 1 How to decode (Inference) ?

1 C4  A1

1 C1  C2

1 B4  A1

1 B1  B2

1 A2  A3

1 0 1 0 1 0 0

0.5 0.5

A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

0 1 0 1 0 0 1 1 0 1 1

1 1 How to decode (Inference) ?

1 C4  A1

1 C1  C2

B4  A1

1 B1  B2

1 A2  A3

1 0 1 0 1 0 0

0.5 0.5

A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

0 1 0 1 0 0 1 1 0 1 1

1 1 How to decode (Inference) ?

1 C4  A1  A2

1 C1  C2  C3

B4  A1

.5 B1  B2 C1

.5 A2  A3  B1

1 0 1 0 1 0 0

0.5 0.5

A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

0 1 0 1 0 0 1 1 0 1 1

1 1 How to decode (Inference) ?

1 C4  A1  A2

1 C1  C2  C3

B4  A1

.5 B1  B2 C1

.5 A2  A3  B1

1 0 1 0 1 0 0

0.5 0.5

A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

0 1 0 1 0 0 1 1 0 1 1

1 1 How to decode (Inference) ?

1 C4  A1  A2  A3

C1  C2  C3

B4  A1

.5 B1  B2 C1  C2

.5 A2  A3  B1  B2

1 0 1 0 1 0 0

0.5 0.5

A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

0 1 0 1 0 0 1 1 0 1 1

1 1 How to decode (Inference) ?

1 C4  A1  A2  A3

C1  C2  C3

B4  A1

.5 B1  B2 C1  C2

.5 A2  A3  B1  B2

1 0 1 0 1 0 0

0.5 0.5

A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

0 1 0 1 0 0 1 1 0 1 1

1 1 How to decode (Inference) ?

.5 C4  A1  A2  A3  B1 .5 C  C  C 1 2 3 C1 B4  A1

.5 B1  B2 C1  C2  C3

A2  A3  B1  B2

1 0 1 0 1 0 0

0.5 0.5

A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

0 1 0 1 0 0 1 1 0 1 1

1 1 How to decode (Inference) ?

.5 C4  A1  A2  A3  B1 .5 C  C  C 1 2 3 C1 B4  A1

.5 B1  B2 C1  C2  C3

A2  A3  B1  B2

1 0 1 0 1 0 0

0.5 0.5

A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

0 1 0 1 0 0 1 1 0 1 1

1 1 How to decode (Inference) ?

.5 C4  A1  A2  A3  B1  B2 .5 C  C  C 1 2 3 C1  C2 B4  A1

B1  B2 C1  C2  C3

A2  A3  B1  B2

1 0 1 0 1 0 0

0.5 0.5

A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

0 1 0 1 0 0 1 1 0 1 1

1 1 How to decode (Inference) ?

.5 C4  A1  A2  A3  B1  B2 .5 C  C  C 1 2 3 C1  C2 B4  A1

B1  B2 C1  C2  C3

A2  A3  B1  B2

1 0 1 0 1 0 0

0.5 0.5

A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

0 1 0 1 0 0 1 1 0 1 1

1 1 How to decode (Inference) ?

.5 C4  A1  A2  A3  B1  B2  B3 C  C  C MPA Likelihood1 2= Max3 P(X|Z) C1  C2 Z B4  A1

B1  B2 C1  C2  C3

A2  A3  B1  B2

1 0 1 0 1 0 0

0.5 0.5

A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

0 1 0 1 0 0 1 1 0 1 1

1 1 How to decode (Inference) ?

.5 C4  A1  A2  A3  B1  B2  B3 C  C  C 1 2 3 C1  C2 B4  A1

B1  B2 C1  C2  C3

A2  A3  B1  B2

(or B4 ) C4 A1 A2 A3 B1 B2 B3

1 0 1 0 1 0 0

0.5 0.5

A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

0 1 0 1 0 0 1 1 0 1 1

1 1 How to Model Multiple Patterns? Transition Table of the “Hidden Markov Chain”

A1 A2 A3 B1 B2 B3 B4 C1 UsingC2 C 3single C4 -layer HMM to implement, A1 1 Transition Table is O(|state|^2) A2 1 A3 .5 .5 with most of entries “zero”. B1 1 B2 1 B3 1 B4 1 C1 1 C2 1 C3 1 Transition Diagram of C4 1 “Hidden Markov Chain” 0.5 0.5

A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

0 1 0 1 0 0 1 1 0 1 1

1 1 How to Model Multiple Patterns? Exploit the Hierarchical Structure, we can have more compact represent A B C and separate the 2 layers. A .5 .5 B 1 C 1

1 1 A B C 0.5 0.5

C1 C2 C3 C4 A1 A2 A3 A A A B B B B C C C C A1 1 C1 1 0 1 0 C2 1 A2 1 1 0 0 1 1 0 1 1 C3 1

B1 B2 B3 B4 B1 1 B2 1 B3 1 How can we knowHow to Model Multiple Patterns? Here will be a “A” pattern before decoding data ??? Is this a Legal Graphical Model ?

A C A A B C A .5 .5 B 1 1 2 3 1 2 3 4 1 2 3 C 1 0 1 0 1 0 1 1 0 1 0

A1 A2 A3 C1 C2 C3 C4 A1 A2 A3 A1 1 C1 1 A1 1 A2How can we1 knowC2 1 A2 1 Here is no dependencyC3 1 before decoding data ? No !! Note Structure of a GM should How to encode structural uncertainty be fixed in advance. into some variables ? Uncertainty can only involve the value of variable. Solution of Hierarchical HMM (2001, K.P.Murphy)

t t t t t  1, if Z layer2  Exit State of Z layer1 Introducing “Control Variable” = F (Z layer1, Z layer2 )    0, otherwise

t+1 t Ft = 0  pattern not ending  Layer 1 : Z = Z Layer 2 : Transit according to State Machine

t+1 t Ft = 1  pattern ending  Layer 1 : Draw from P(Z |Z ) Layer 2 : Draw independently from previous.

Layer 1 A A A B B B B 0 0 1 0 0 0 1

Layer 2 1 2 3 1 2 3 4

t t t F (Z layer1, Z layer2) Solution of Hierarchical HMM (2001, K.P.Murphy)

t t t t t  1, if Z layer2  Exit State of Z layer1 Introducing “Control Variable” = F (Z layer1, Z layer2 )    0, otherwise

t+1 t Ft = 0  pattern not ending  Layer 1 : Z = Z Layer 2 : Transit according to State Machine

t+1 t Ft = 1  pattern ending  Layer 1 : Draw from P(Z |Z ) Layer 2 : Draw independently from previous.

Layer 1 A A A B B B B 0 0 1 0 0 0 1

Layer 2 1 2 3 1 2 3 4

0 1 0 1 0 0 1 Solution of Hierarchical HMM θZ (2001, K.P.Murphy)

Zt-1 Zt Layer 1 A A A B B B B 0 0 1 0 0 0 1 Ft-1 Ft Layer 2 1 2 3 1 2 3 4 t-1 Yt Y θY 0 1 0 1 0 0 1

t X θX

t-1 2TBN t Explain Data: 010 1011 010 1001 010 1001 010 1011 Template A C A B A B A C

Likelihood = P(A)*P(C|A) …… *P(C|A) *P(transition in A|A)*P(transition in C|C) ……*P(transition in C|C) *P(0 |A,1) P(1|A,2)P(0|A,3) * P(1|C,1) P(0|C,2)P(1|C,3) P(1|C,4)……

= P(A)*P(C|A) …… *P(C|A) = (0.5)4 Deterministic Behavior Much better than Naïve Model Comparison of Single & Multi-Layer HMM

Note: Both method can yield the same likelihood to explain data. θ Single-Layer HMM Multi-Layer HMM Z

t-1 t Yt-1 Yt θy Z Z Ft-1 Ft

t θX t-1 Yt X Y θY

t-1 t t X θX

t-1 t Assume there are K patterns, where each pattern’s state machine has D states. Size of Transition Table:

2 2 θY  O( (K*D) ) θZ  O( K ) 2 θY  O( D ) Application in Speech Recognition ( Speech Signal Processing 2010 Fall , 李琳山 )

Hierarchical HMM is the foundation of Speech Recognition.

Layer 1 I enjoy attending classes Language Models Decode using Parameters Layer 2 [aɪ] [ɪnˋdʒɔɪ] [əˋtɛndɪŋ] [klæ s] Trained Lexicon Trained Model Separately

Observe Acoustic Models

When decoding low layer, high layer’s model is taken into consideration. (ex. Some “incorrect” Pronunciation can be realized with knowledge of high-layer Language Model. ) Comparison of Single & Multi-Layer HMM

Note: Both method can yield the same likelihood to explain data. θ Single-Layer HMM Multi-Layer HMM Z

t-1 t Yt-1 Yt θy Z Z Ft-1 Ft

t θX t-1 Yt X Y θY

t-1 t t X θX

t-1 t 1. We may want training different layers separately. ( ex. high layer: Language model ; low layer : Lexical Model )

2. We may want introduce prior on only some layers. (see next)

See “2001, K.P.Murphy” for more advantages of this approach. Application in Speech Recognition ( Speech Signal Processing 2010 Fall , 李琳山 )

Hierarchical HMM is the foundation of Speech Recognition.

Layer 1 I enjoy attending classes Language Models Decode using Parameters Layer 2 [aɪ] [ɪnˋdʒɔɪ] [əˋtɛndɪŋ] [klæ s] Trained Lexicon Trained Model Separately

Observe Acoustic Models

When decoding low layer, high layer’s model is taken into consideration. How to handle different persons’ (ex. Some “incorrect” Pronunciation can be realized pronunciation ?? with knowledge of high-layer Language Model. ) Application in Speech Recognition

How to handle different persons’ Transition Parameters : θY , θZ pronunciation ??

Technique similar to LDA. (Latent Pronunciation Model) Dir(α) θZ

Person For every person, θXY Introduce a Pronunciation Layer 1 Zt-1 Zt Pattern Variable θ XY .

Ft-1 Ft In the beginning, use Layer 2 Yt-1 Yt “prior of θXY ” to decode speech of a person.

Observation Xt As # of samples from the person grows, new θXY t-1 t can be learned/inferred to adapt to that person. Application : Natural Language Understanding

Multi-Layer HMM

Vt-1 Vt Ft-1 Ft

Zt-1 Zt Ft-1 Ft

Yt-1 Yt

Xt

t-1 t Gene Decoding

Pattern Mining / Pattern Recognition. Overview

• What’s Probabilistic Graphical Model for ?

• Tasks in Graphical Model: – Modeling (Simple Probability Model) – Learning (MLE, MAP, Bayesian) Is there any other Graph – Inference (Bayes Rule ??) than “a Chain” ?

• Examples – Topic Model (EM algorithm) – Hidden Markov Model (Variable Elimination) – Markov Random Field

Overview

• What’s Probabilistic Graphical Model for ?

• Tasks in Graphical Model: – Modeling (Simple Probability Model) – Learning (MLE, MAP, Bayesian) Is there any other Graph – Inference (Bayes Rule ??) than “a Chain” ?

• Examples – Topic Model (EM algorithm) – Hidden Markov Model (Variable Elimination) – Markov Random Field ( All of previous models are special cases of . ) Given Domain Problem : Modeling Spatial Pattern

Distribution of Shape Distribution of Texture

In sequential data, we model “before as cause” and “after as effect ”. Dir(α) Document 1 0 1 0 1 Mix

1 0 0 Pos.

Topic Topic θ In spatial data, who is the “cause” ? 0 1 1 Word 0 1 1 Modeling with Markov Random Field Global Structure:

Define types of Node (variable) : X Val(X ) {0,1}

Without “cause & effect”, MRF defines a “Neighborhood” a variable will interact with :

X

Along with Neighborhood, types of “Cliques” are defined : X X X Y Y Modeling with Markov Random Field

Local Structure :

Define Potential function φC(variables in C) for all types of clique C we care about. The Gibbs Distribution of the MRF is given by: 1 P(X )  (X C ), where Z   P(X ) is for normalize. Z Cclique XVal( X )

X X Y X Y X Y Y

φ1(X,Y) 0 1 φ2(X,Y) 0 1 φ1(X,Y) 0 1 φ2(X,Y) 0 1 0 2 1 0 2 1 0 1 7 0 7 1 1 1 2 1 1 2 1 7 1 1 1 7

X X Y Y

φ1(X,Y) 0 1 φ2(X,Y) 0 1 0 1 4 0 1 4 1 4 1 1 4 1 How MRF Generate Samples

How BN generate samples ? How MRF generate samples ?

W W X

X Y

Z Y Z

P( X|Pa(X) ) is available given Pa(X). P( X|N(X) ) can be derived from Bayes Rule.  Sampling follows Topological Order. But how to find an order ? How BN Generate Samples ?

B E P(A) t t 0.95 t f 0.94 P(B) P(E) f t 0.29 0.001 0.002 f f 0.01 Burglary Earthquake

Alarm

John Calls Mary Calls A P(J) A P(M) t 0.90 t 0.70 f 0.05 f 0.01 How BN Generate Samples ?

B E P(A) t t 0.95 t f 0.94 P(E) P(B) a ~b f t 0.29 e 0.001 0.002 f f 0.01 Burglary Earthquake

Alarm

John Calls Mary Calls A P(J) A P(M) t j 0.90 t ~m0.70 f 0.05 f 0.01 i.i.d Sample : (~b,e,a,j,~m) How MRF Generate Samples

How BN generate samples ? How MRF generate samples ?

W W X

X Y

Z Y Z

P( X|Pa(X) ) is available given Pa(X). P( X|N(X) ) can be derived from Bayes Rule.  Sampling follows Topological Order. But how to find an order ? Gibbs Sampling for MRF

Gibbs Sampling : t=1

1. Initialize all variables randomly. 0 0 1 for t = 1~M for every variable X 0 1 1 2. Draw Xt from P( X | N(X)t-1 ). end end 0 0 1

(X 1,Y) P( X 1| N(X ) )  YN ( X ) (X 1,Y)  (X  0,Y) φ(X,Y) 0 1 YN ( X ) YN ( X ) 0 5 1 1 1 9 Gibbs Sampling for MRF

Gibbs Sampling : t=2

1. Initialize all variables randomly. 0 01 1 for t = 1~M for every variable X 0 1 1 2. Draw Xt from P( X | N(X)t-1 ). end end 0 01 1

(X 1,Y) P( X 1| N(X ) )  YN ( X ) (X 1,Y)  (X  0,Y) φ(X,Y) 0 1 YN ( X ) YN ( X ) For the central node: 0 5 1 1*9*9*1 P( X 1| N(X ) )   0.76 1*9*9*1 5*1*1*5 1 1 9 Gibbs Sampling for MRF

Gibbs Sampling : t=3

1. Initialize all variables randomly. 01 1 1 for t = 1~M for every variable X 01 1 1 2. Draw Xt from P( X | N(X)t-1 ). end 1 end 01 1 1

(X 1,Y) P( X 1| N(X ) )  YN ( X ) (X 1,Y)  (X  0,Y) φ(X,Y) 0 1 YN ( X ) YN ( X ) For the central node: 0 5 1 9*9*9*9 P( X 1| N(X ) )   0.99 9*9*9*9 1*1*1*1 1 1 9 Gibbs Sampling for MRF

Gibbs Sampling : t=3

1. Initialize all variables randomly. 01 1 1 for t = 1~M for every variable X 01 1 1 2. Draw Xt from P( X | N(X)t-1 ). end 1 end 01 1 1

When M is large enough, X(M) follows stationary dist. : φ(X,Y) 0 1 1 T (X )  P(X )  (X C ) Z C 0 5 1

(Regularity: All entries in the Potential are positive.) 1 1 9 Modeling with Markov Random Field

To model texture with multi-label, we have: (ex.) X Val(X) {0,1,2}

For texture, we extend Neighborhood to be:

X

We may use only part of them. With Neighborhood, Possible “Cliques” are defined :

X X X X X Y Y Y Y

X Y Z X X X

Z Y Z Y Y Z Modeling with Markov Random Field

φ(X,Y) 0 1 2 X X X X Y X 0 1 7 7 Y Y Y X Y 1 7 1 7 Y φ(X,Y) 0 1 2 2 7 7 1

0 7 1 1 φ(X,Y) 0 1 2 1 1 7 1 0 9 1 1 2 1 1 7 X X 1 1 9 1 Y Y 2 1 1 9

φ(X,Y) 0 1 2

0 1 7 7 X 1 7 1 7 Y 2 7 7 1

φ(X,Y) 0 1 2

X X Y 0 7 1 1 X 1 1 7 1 Y Y 2 1 1 7 Infer The Lost Segment

φ(X,Y) 0 1 2

0 1 7 7 X 1 7 1 7 Y 2 7 7 1

φ(X,Y) 0 1 2 0 7 1 1 X Y X X 1 1 7 1 Y Y 2 1 1 7 Some Model are Intractable for Exact Inference

Example: A Grid MRF

B

A C

argmax P(A,B)*P(B,C) = F(A,B) C Some Model are Intractable for Exact Inference

Example: A Grid MRF

B

A Some Model are Intractable for Exact Inference Example: A Grid MRF Some Model are Intractable for Exact Inference Example: A Grid MRF Some Model are Intractable for Exact Inference Example: A Grid MRF

Approximate Inference needed.

Generally, we will have clique of “size N” for a N*N grid, which is indeed intractable. Gibbs Sampling for Inference

Gibbs Sampling : t=1

1. Initialize all variables randomly. 0 0 ? for t = 1~M for every variable X 0 ? ? 2. Draw Xt from P( X | N(X)t-1 ). end end 0 ? ?

(X 1,Y) P( X 1| N(X ) )  YN ( X ) (X 1,Y)  (X  0,Y) φ(X,Y) 0 1 YN ( X ) YN ( X ) 0 5 1 1 1 9 Gibbs Sampling for Inference

Gibbs Sampling : t=2

1. Initialize all variables randomly. 0 0 1 for t = 1~M for every variable X 0 10 10 2. Draw Xt from P( X | N(X)t-1 ). end 1 end 0 0 10

(X 1,Y) P( X 1| N(X ) )  YN ( X ) (X 1,Y)  (X  0,Y) φ(X,Y) 0 1 YN ( X ) YN ( X ) For the central node: 0 5 1 1*9*1*1 P( X 1| N(X ) )   0.06 1*9*1*1 5*1*5*5 1 1 9 Gibbs Sampling for Inference

Gibbs Sampling : t=3

1. Initialize all variables randomly. 0 0 10 for t = 1~M for every variable X 0 10 10 2. Draw Xt from P( X | N(X)t-1 ). end 0 end 0 1 10

(X 1,Y) P( X 1| N(X ) )  YN ( X ) (X 1,Y)  (X  0,Y) φ(X,Y) 0 1 YN ( X ) YN ( X ) For the central node: 0 5 1 9*9*9*9 P( X 1| N(X ) )   0.99 9*9*9*9 1*1*1*1 1 1 9 Gibbs Sampling for Inference

Gibbs Sampling : t=3

1. Initialize all variables randomly. 0 10 10 for t = 1~M for every variable X 0 10 10 2. Draw Xt from P( X | N(X)t-1 ). end 0 end 0 1 10

When M is large enough, X(M) follows stationary dist. : φ(X,Y) 0 1 1 T (X )  P(X )  (X C ) Z C 0 5 1

(Regularity: All entries in the Potential are positive.) 1 1 9 Application of MRF : Joint Segmentation & Classification

A A B B Layer 1 A A B B

1 0 2 1 Layer 2 0 1 1 2

Texture 1 Texture 2 Application of MRF : Collective Classification

F: Faculty Member S: Student C: Course webpage

S S hyperlink F S

C C

F S

S Collaborative Classification on Network

F: Faculty Member S: Student Webpage Webpage C: Course webpage Text Text S S hyperlink Type Type { S, F, C } { S, F, C } F S

C C Link from to F S

S Define Global Dependency

We can define a MRF on the Schema: F: Faculty Member S: Student C: Course Webpage Webpage

Text Text S S φ(Text,Type) φ(Text,Type) φ(Type[W1],Type[W2]) Type Type F S { S, F, C } { S, F, C }

C C Link from to F S

for each W1, W2, s.t. Link(W1,W2), define φ(Type[W1],Type[W2]) S Define Local Potential Function

F: Faculty Member φ(Type[w1], Type[w2]) S: Student C: Course F S C F often links to S,C S S F 0 3 5

S 2 0 1 C often F S links to F C 5 0 0 C C 1 P(X )  (X C ) Z Cclique F S

S Exact Inference on Graphical Model

Reference:

Probabilistic Graphical Model Ch.9 , Ch. 10 (Koller & Friedman) CMU, 10-708, Fall 2009 Probabilistic Graphical Models Lectures 8,9,10 (Eric Xing)

Probabilistic Inference

• A Graphical Model specifies a joint distribution PM(X) over a collection of variables X.

• How can we answer queries/questions about P(X) ? That is, how can we inference using P(X) ?

• Type of queries:

– 1. Likelihood of evidence/assignments on variables – 2. Conditional Probability of some variables (given others). – 3. Most Probable Assignment for some variables (given others ). Query 1: Likelihood

• Given Evidence E = {X1=x1,…, XD=xD} specifying some variables’ value and let Z={Z1, …, Zk} be variables unspecified , the likelihood of a model M yielding this evidence can be computed by:

likelihood of E

 PM (X )  ... PM (Z1,..., ZK , x1,..., xD ) Z1 ZK

Naïve algorithm yield O( |Z|K ) complexity… Query 1: Likelihood

likelihood of E  ...PM (Z1,..., ZK , x1,..., xD ) Z1 ZK • This measure is often used as criteria for Model Selection. Ex. In speech recognition, Z: words (unspecified) , X: wave sample (specified evidence E)

How likely a person M=(Language, Pronunciation) produce this wave sample can be Computed by: A person’s “Language Model” P(Zt+1|Zt) likelihood of M produce E Z1 Z2 Z3  P (X ) M A person’s “Pronunciation   PM (Z1, Z2 , Z3, x1, x2 , x3 ) x1 x2 x3 Z1 Z2 Z3 Model” P(X |Z ) Summation over all possible words Z producing X t t Query 1: Likelihood likelihood of E

 PM (X )  ... PM (Z1,..., ZK , x1,..., xD ) Z1 ZK Taking special case E = empty , it can also be used to compute Normalizing Const. = Z in MRF as following :

~ 1 ~ ( let P(Z1...ZK )   (C) be unnormalized dist. , P(Z1...ZK )  P(Z1...ZK ) ) clique C in M Z 1 ~ ... P(Z1,..., ZK )  ... P(Z1,..., ZK )  1 Z1 ZK  Z1 ZK ~  Z  ... P(Z1,..., ZK ) Z1 ZK Query 2: Conditional (marginal) Probability

• Given Evidence E = {X1=x1,…, XD=xD} and some other variables Z={Z1, …, Zk} unspecified , Conditional Probability of Z is given by: P(Z, X ) P(Z | X )  , where P(X ) is given by Query1 P(X ) • Sometimes we are interested in only some variables Y in Z, where Z = { Y ,W }, then conditional (marginal) prob. of Y is

P(Y | X )  P(Z | X )  ...P(Y,W1...WK | X ) W W1 WK

Naïve summation over uninterested variables W yield O( |W|K ) complexity… Query 2: Conditional (marginal) Probability Ex. In speech recognition, Z: words (unspecified) , X: wave sample (specified evidence E) P(Z, X ) P(Z | X )  , where P(X ) is given by Query1 P(X )

A word sequence Z1…ZK ‘s prob. given the wave sample X1…XK

If we only care the 1st word, then: A person’s “Language Model” P(Zt+1|Zt)

P(Z1 | X )  P(Z1,Z2 ,Z3 | X )  Z Z Z Z2 Z3 1 2 3 The 1st word Z ’s marginal distribution A person’s 1 “Pronunciation given the wave sample X …X x1 x2 x3 1 K Model” (naïve method is intractable for large K) P(Xt|Zt) Query 3: Most Probable Assignment

• Given Evidence E = {X1=x1,…, XD=xD} and some other variables Z={Z1, …, Zk} unspecified , Most Probable Assignment of Z is given by: MPA(Z | X )  arg max P(Z | X ) Z P(X | Z)P(Z)  arg max  arg max P(X | Z)P(Z) Z P(X) Z • MPA is also called “maximum a posteriori configuration” or “MAP inference”. Note: 1. Even if we have computed Query 2 = P(Z|X) , it’s intractable to

enumerate all possible Z to get argmaxZ P(Z|X).

arg max Z1 P(Z1 | X ) 2. MPA cares “Joint Maximum”,  argmax P(Z| X)  ...... not “Marginal Maximum”.  Z  arg max ZK P(ZK | X ) Query 3: Most Probable Assignment

We often just want to “decode words ” from the wave sample,

That is, we care Z* = argmaxZ P(Z|X) but not P(Z|X) itself.

Marginal Maximum:

arg max Z1 P(Z1 | X )  arg max Z 2 P(Z2 | X )  may give Z1 'I', Z2 'comes', Z3 ' front'  arg max Z 3 P(Z3 | X ) (inconsistent decoding)

Joint Maximum (MPA) : A person’s “Language Model” P(Zt+1|Zt)

Z Z Z arg max P(Z1, Z2 , Z3 | X ) 1 2 3 Z ,Z ,Z 1 2 3 A person’s

 may give 'I' 'come' ' from' “Pronunciation x1 x2 x3 Model”

(consistent decoding) P(Xt|Zt) In terms of difficulty, there are 3 types of inference problem.

• Inference which is easily solved with Bayes rule.

Today’s focus

• Inference which is tractable using some dynamic programming technique. (e.g. Variable Elimination or J-tree algorithm )

• Inference which is proved intractable & should be solved using some Approximate Method. (e.g. Approximation with Optimization or Sampling technique.) Agenda

• Introduce the concept of “Variable Elimination” in special case of Tree-structured Factor Graph.

• Extend the idea of “VE” to general Factor graph with concept of “Clique Tree”.

• See how to extend “VE” to “Most Probable Assignment” (MAP configuration) Problem. Variable Elimination: Inference on a Chain

How to get P(E=e) ? P(E  e)  P(A, B,C, D, E  e) A B C D By structure of the BN: P(A, B,C, D, E)  P(E | D)P(D | C)P(C | B)P(B | A)P(A) P(E)  P(E | D)P(D | C)P(C | B)P(B | A)P(A) D C B A

We can put summation as right as possible… Variable Elimination: Inference on a Chain

How to get P(E=e) ? P(E  e)  P(A, B,C, D, E  e) A B C D By structure of the BN: P(A, B,C, D, E)  P(E | D)P(D | C)P(C | B)P(B | A)P(A) P(E)  P(E | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  P(E | D)P(D | C)P(C | B)P(B | A)P(A) D C B A Variable Elimination: Inference on a Chain F(A,B)

P(E  e)  P(A)P(B | A)P(C | B)P(D | C)P(E  e | D) A B C D  P(E | D)P(D | C)P(C | B)P(B | A)P(A) D C B A F(A,B)

A Table size=|A||B| Variable Elimination: Inference on a Chain M(B)

P(E  e)  P(A)P(B | A)P(C | B)P(D | C)P(E  e | D) A B C D  P(E | D)P(D | C)P(C | B)P(B | A)P(A) D C B A

∑AF(A,B)=M( B )

Eliminate “A”. A Table size=|B|. Variable Elimination: Inference on a Chain F(B,C)

P(E  e)  P(A)P(B | A)P(C | B)P(D | C)P(E  e | D) A B C D  P(E | D)P(D | C)P(C | B)P(B | A)P(A) D C B A P(C|B)*M( B ) = F(B,C)

A Table size=|B||C|. Variable Elimination: Inference on a Chain M(C)

P(E  e)  P(A)P(B | A)P(C | B)P(D | C)P(E  e | D) A B C D  P(E | D)P(D | C)P(C | B)P(B | A)P(A) D C B A

∑B F(B,C)=M(C)

Eliminate “B”. A Table size=|C|. Variable Elimination: Inference on a Chain F(C,D)

P(E  e)  P(A)P(B | A)P(C | B)P(D | C)P(E  e | D) A B C D  P(E | D)P(D | C)P(C | B)P(B | A)P(A) D C B A P(D|C)M(C) = F(C,D)

A Table size=|C||D|. Variable Elimination: Inference on a Chain M(D)

P(E  e)  P(A)P(B | A)P(C | B)P(D | C)P(E  e | D) A B C D  P(E | D)P(D | C)P(C | B)P(B | A)P(A) D C B A

∑C F(C,D)= M(D)

Eliminate “C”. A Table size=|D|. Variable Elimination: Inference on a Chain F(D,E)

P(E  e)  P(A)P(B | A)P(C | B)P(D | C)P(E  e | D) A B C D  P(E | D)P(D | C)P(C | B)P(B | A)P(A) D C B A P(E|D)M(D)=F(D,E)

A Table size=|D||E|. Variable Elimination: Inference on a Chain M(E)

P(E  e)  P(A)P(B | A)P(C | B)P(D | C)P(E  e | D) A B C D  P(E | D)P(D | C)P(C | B)P(B | A)P(A) D C B A

∑D F(D,E)=M(E)=P(E)

Eliminate D. Get the answer.

Both Time & Space Complexity are O( |A||B|+|B||C|+|C||D|+|D||E| )  O( |Range|2 )

Naïve method complexity = O(|A||B||C||D||E|)  O( |Range|N ) Variable Elimination: Inference on a Chain

A B C D E

How about inference on Undirected Model (MRF) ? 1 P(A, B,C, D, E)  (E, D)(D,C)(C, B)(B, A) Z 1 P(E)  (E, D)(D,C)(C, B)(B, A) Z D C B A 1  (E, D)(D,C)(C, B)(B, A) Z D C B A The same idea applies !! Variable Elimination: Inference on a Chain

A B C D E f(A,B) f(B,C) f(C,D) f(D,E)

From now on, we won’t distinguish between BN & MRF. The same algorithm applies to them in a “Factor View”.

1 P(A, B,C, D, E)  (E, D)(D,C)(C, B)(B, A) Z P(A, B,C, D, E)  1* P (E | D)P(D | C)P(C | B)P(B | A)P(A)

1 All viewed as: f (E, D) f (D,C) f (C, B) f (B, A) Z Variable Elimination: Inference on a Chain

F(A,B) A B C D E f(A,B) f(B,C) f(C,D) f(D,E) f(A,B)=F(A,B)

The elimination process is sometimes called “Sum-Product algorithm”. Variable Elimination: Inference on a Chain

M(B) A B C D E f(A,B) f(B,C) f(C,D) f(D,E)

∑AF(A,B)=M( B ) Sum !

The elimination process is sometimes called “Sum-Product algorithm”. Variable Elimination: Inference on a Chain F(B,C) A B C D E f(A,B) f(B,C) f(C,D) f(D,E) M( B )*f(B,C)= F(B,C) Product !

The elimination process is sometimes called “Sum-Product algorithm”. Variable Elimination: Inference on a Chain

M(C) A B C D E f(A,B) f(B,C) f(C,D) f(D,E)

∑BF(B,C)=M(C) Sum !

The elimination process is sometimes called “Sum-Product algorithm”. Variable Elimination: Inference on a Chain F(C,D) A B C D E f(A,B) f(B,C) f(C,D) f(D,E) M( C )*f(C,D) = F(C,D) Product !

The elimination process is sometimes called “Sum-Product algorithm”. Variable Elimination: Inference on a Chain

M(D) A B C D E f(A,B) f(B,C) f(C,D) f(D,E)

∑CF(C,D)=M(D) Sum !

The elimination process is sometimes called “Sum-Product algorithm”. Variable Elimination: Inference on a Chain F(D,E) A B C D E f(A,B) f(B,C) f(C,D) f(D,E) M( D )*f(D,E) = F(D,E) Product !

The elimination process is sometimes called “Sum-Product algorithm”. Variable Elimination: Inference on a Chain

M(E) A B C D E f(A,B) f(B,C) f(C,D) f(D,E)

∑DF(D,E)=M(E)=P(E) Sum !

The elimination process is sometimes called “Sum-Product algorithm”. Variable Elimination: Inference on a Tree If the factor graph is a tree without cycle, then VE can be applied in a similar way.

E

A B C D

F F(F,G,H)=f(F,G,H)

G H Elimination Order: Let E be the root ; Eliminate from leaves to root. Variable Elimination: Inference on a Tree If the factor graph is a tree without cycle, then VE can be applied in a similar way.

E

A B C D

F M(F) = ∑G,H f(F,G,H) Sum !

G H Elimination Order: Let E be the root ; Eliminate from leaves to root. Variable Elimination: Inference on a Tree If the factor graph is a tree without cycle, then VE can be applied in a similar way.

E

A B C D

F(B,F)=f(B,F)M(F) F Product !

G H Elimination Order: Let E be the root ; Eliminate from leaves to root. Variable Elimination: Inference on a Tree If the factor graph is a tree without cycle, then VE can be applied in a similar way.

E

A B C D

M(B)=∑F F(B,F) Sum ! F

G H Elimination Order: Let E be the root ; Eliminate from leaves to root. Variable Elimination: Inference on a Tree If the factor graph is a tree without cycle, then VE can be applied in a similar way.

E F(A,B)=f(A,B)

A B C D

M(B)

F

G H Elimination Order: Let E be the root ; Eliminate from leaves to root. Variable Elimination: Inference on a Tree If the factor graph is a tree without cycle, then VE can be applied in a similar way.

Sum ! E M’(B)=∑AF(A,B)

A B C D

M(B)

F

G H Elimination Order: Let E be the root ; Eliminate from leaves to root. Variable Elimination: Inference on a Tree If the factor graph is a tree without cycle, then VE can be applied in a similar way.

E

A B C D

F(B,C)=M(B)M’(B)f(B,C) Product ! F

G H Elimination Order: Let E be the root ; Eliminate from leaves to root. Variable Elimination: Inference on a Tree If the factor graph is a tree without cycle, then VE can be applied in a similar way.

E

A B C D

M(C)=∑BF(B,C) Sum ! F

G H Elimination Order: Let E be the root ; Eliminate from leaves to root. Variable Elimination: Inference on a Tree If the factor graph is a tree without cycle, then VE can be applied in a similar way.

E F(C,D)=f(C,D)

A B C D

M(C)

F

G H Elimination Order: Let E be the root ; Eliminate from leaves to root. Variable Elimination: Inference on a Tree If the factor graph is a tree without cycle, then VE can be applied in a similar way.

E Sum !

M’(C)=∑DF(C,D)

A B C D

M(C)

F

G H Elimination Order: Let E be the root ; Eliminate from leaves to root. Variable Elimination: Inference on a Tree If the factor graph is a tree without cycle, then VE can be applied in a similar way.

E Product ! F(C,E)=f(C,E)M(C)M’(C)

A B C D

F

G H Elimination Order: Let E be the root ; Eliminate from leaves to root. Variable Elimination: Inference on a Tree If the factor graph is a tree without cycle, then VE can be applied in a similar way. Sum !

E M(E)=∑CF(C,E)=P(E)

A B C D

F

G H Elimination Order: Let E be the root ; Eliminate from leaves to root. Variable Elimination: Inference on a Tree

Follow the Elimination Process, we can build a “Clique Tree”. In which:

1. Every node is a F(.) before elimination. 2. Every edge is a “message” M(.) passed from F(.) to F(.).

∑CF(C,E)=P(E) F(C,E) E M(C) M’(C) M’(B) A B C D F(A,B) F(B,C) F(C,D)

M(B) F(B,F) F M(F)

G H F(F,G,H) Variable Elimination: Inference on a Tree Why is Clique Tree useful ? If we want to know P(D): Only this part different !!

F(C,E) M(C) M’(C) M’(B) ∑ F(C,E)=P(E) F(A,B) F(B,C) F(C,D) C ∑CF(C,D)=P(D) F(C,E) M(B) F(B,F) M(C) M’(C) M’(B) M(F) F(A,B) F(B,C) F(C,D) F(F,G,H) M(B) F(B,F)

M(F)

F(F,G,H) Variable Elimination: Inference on a Tree Why is Clique Tree useful ? If we want to know P(B):

F(C,E)

M(C) M’(C) M(.) M’(B) F(A,B) F(B,C) F(C,D) F(.)

M(B) M(.) F(B,F) ∑FF(B,F)=P(B) M(F) The queried node should get messages from all nodes on F(F,G,H) the tree to get the marginal distribution. Variable Elimination: Inference on a Tree Why is Clique Tree useful ? To get marginal distribution of N nodes, we don’t need run VE “N times”, “2 times” are enough to get all possible messages.

F(C,E) 1st pass: Take a node as root. Run Sum-Product from leaves to root. F(A,B) F(B,C)F(B,C) F(C,D) 2nd pass:

f(B,F) F(B,F) Run Sum-Product from root to leaves.

M(F) All marginal dist. can be derived from

F(F,G,H) 1. Multiply all M(.) from neighbors by f(.) on this node. P(B, F)  M(B) f (B, F)M(F) 2. Eliminate unwanted variables. Variable Elimination: Inference on a Tree Why is Clique Tree useful ? To get marginal distribution of N nodes, we don’t need run VE “N times”, “2 times” are enough to get all possible messages.

F(C,E) Complexity : Elimination on a node F(A,B,C) F(A,B) F(B,C)F(B,C) F(C,D) takes O(|A||B||C|) space & time.

F(B,F) So the algorithm’s bottleneck is on bottleneck: elimination for the “Largest Node” O(|F||G||H| ) on clique tree. F(F,G,H) time & space. Agenda

• Introduce the concept of “Variable Elimination” in special case of Tree-structured Factor Graph.

• Extend the idea of “VE” to General Factor Graph with concept of “Clique Tree”.

• See how to extend “VE” to “Most Probable Assignment” (MAP configuration) Problem. Variable Elimination: Inference on General Graph

Some problem ignored earlier: Different “Elimination Orders” have different effect.

Elimination Order 1: B C D E A

B AB M1(A) F C M2(A) A AF AC

M3(A) E D M4(A) AE AD P(F) 1   f (A, F) f (A, E) f (A, D) f (A,C) f (A, B) Maximum Node Size=2 Z A E D C B 1   f (A, F) f (A, E) f (A, D) f (A,C) f (A, B) Z A E D C B Variable Elimination: Inference on General Graph

Some problem ignored earlier: Different “Elimination Orders” have different effect.

Elimination Order 2: A B C D E

B M(B,C,D,E,F) F C A ABCDEF

E D

P(F) 1   f (A, F) f (A, E) f (A, D) f (A,C) f (A, B) Z E D C B A M(B,C,D,E,F) Variable Elimination: Inference on General Graph

Some problem ignored earlier: Different “Elimination Orders” have different effect.

Elimination Order 2: A B C D E

M(C,D,E,F) B BCDEF M(B,C,D,E,F) F C ABCDEF

E D

P(F) 1   f (A, F) f (A, E) f (A, D) f (A,C) f (A, B) Z E D C B A M(C,D,E,F) Variable Elimination: Inference on General Graph

Some problem ignored earlier: Different “Elimination Orders” have different effect.

Elimination Order 2: A B C D E

M(C,D,E,F) BCDEF M(B,C,D,E,F) CDEF F C ABCDEF M(D,E,F) E D

P(F) 1   f (A, F) f (A, E) f (A, D) f (A,C) f (A, B) Z E D C B A M(D,E,F) Variable Elimination: Inference on General Graph

Some problem ignored earlier: Different “Elimination Orders” have different effect.

Elimination Order 2: A B C D E

M(C,D,E,F) BCDEF M(B,C,D,E,F) CDEF F ABCDEF M(D,E,F) E D M(E,F) P(F) DEF 1   f (A, F) f (A, E) f (A, D) f (A,C) f (A, B) Z E D C B A M(E,F) Variable Elimination: Inference on General Graph

Some problem ignored earlier: Different “Elimination Orders” have different effect.

Elimination Order 2: A B C D E

M(C,D,E,F) BCDEF M(B,C,D,E,F) CDEF F ABCDEF M(F) M(D,E,F) E M(E,F) EF P(F) DEF 1   f (A, F) f (A, E) f (A, D) f (A,C) f (A, B) Z E D C B A M(F) Variable Elimination: Inference on General Graph

Some problem ignored earlier: Different “Elimination Orders” have different effect.

Elimination Order 2: A B C D E

M(C,D,E,F) BCDEF F M(B,C,D,E,F) CDEF F ABCDEF M(F) M(D,E,F)

M(E,F) EF P(F) DEF 1   f (A, F) f (A, E) f (A, D) f (A,C) f (A, B) Z E D C B A Maximum Node Size=6 Variable Elimination: Inference on General Graph

Some problem ignored earlier: Different “Elimination Orders” have different effect.

Elimination Order 1: B C D E A

AB B M1(A) M (A) AF 2 AC F C A M3(A) M4(A) AE AD E D

In “Tree” structure factor graph, the optimal “Elimination Order” is just “Elimination from leaves”.

If factor graph is not Tree, what’s the best elimination order ??? Variable Elimination: Inference on General Graph

When factor graph is not Tree, we want a Elimination Order “introducing as fewer edges as possible” (then we will have factor size smaller).

Eliminate “C”  fill 1 edge Eliminate “A”  fill 2 edges

D This is better !! D f(C,D) M(A,D,E) B A C B A C f(A,C) M(B,C,D,E) f(C,E) E E

M (A, D, E)   f (A,C) f (C, D) f (C, E) M (B,C, D, E) C   f (A, B) f (A,C) f (A, D) f (A, E) Produce a factor of size 3 A Produce a factor of size 4 Variable Elimination: Inference on General Graph

Unfortunately, Finding Elimination order with “smallest maximum factor” is NP-hard.

It’s fortunate that greedy algorithm works quite well in practical, in which, we just search for the “least-cost” variable to eliminate:

1. If variables have same cardinality  cost = (# of edges introduced by elimination).

2. If variables have different cardinality  cost = (# of edges)*(weight by cardinality of node involved)

Example: Factorial HMM

Language Model of “words sequence” Speech Recognition: Z1 Z2 Z3 Pronunciation

x1 x2 x3

Decoding 2 person’s speech from waves:

st Y1 Y2 Y3 Language Model of 1 person

Language Model of 2nd person Z1 Z2 Z3

Superposition of “2 waves” x1 x2 x3 Example: Factorial HMM Because a factor is a “clique” in undirected representation, we transform Factorial HMM into “undirected” before running VE.

Y Y Y Y1 Y2 Y3 1 2 3

Moralize Z Z Z Z1 Z2 Z3 1 2 3

x x x x1 x2 x3 1 2 3 Not a Tree.

Review:

A B A B A B factor graph moralize P(C|A,B) C C C Example: Factorial HMM

Finding Elimination Order: Find elimination adding as fewer edges as possible. (greedily)

Y1 Y2 Y3 f(Y1,Z1) f(Y1,X1) Z1 Z2 Z3 f(Z1,X1)

x1 x2 x3

M (Z1,Y2 , X1)   f (Y1,Z1) f (Y1,Y2 ) f (Y1, X1) Y1 Example: Factorial HMM

Finding Elimination Order: Find elimination adding as fewer edges as possible. (greedily)

Intro. 2 edges !!

Y2 Y3 M(Z1,Y2,X2)

Z1 Z2 Z3

x1 x2 x3

M (Z1,Y2 , X1)   f (Y1,Z1) f (Y1,Y2 ) f (Y1, X1) Y1 Example: Factorial HMM

Finding Elimination Order: Find elimination adding as fewer edges as possible. (greedily)

Y1 Y2 Y3 f(Y1,Z1) f(Y1,X1) Z1 Z2 Z3 f(Z1,X1)

x1 x2 x3

M (Y1,Z2 , X1)   f (Y1,Z1) f (Z1,Z2 ) f (Z1, X1) Z1 Example: Factorial HMM

Finding Elimination Order: Find elimination adding as fewer edges as possible. (greedily)

Intro. 2 edges !!

Y1 Y2 Y3

M(Y1,Z2,X1) Z2 Z3

x1 x2 x3

M (Y1,Z2 , X1)   f (Y1,Z1) f (Z1,Z2 ) f (Z1, X1) Z1 Example: Factorial HMM

Finding Elimination Order: Find elimination adding as fewer edges as possible. (greedily)

Y1 Y2 Y3 f(Y1,X1) Z1 Z2 Z3 f(Z1,X1)

x1 x2 x3

M (Y1,Z1)   f (Y1, X1) f (Z1, X1) X1 Example: Factorial HMM

Finding Elimination Order: Find elimination adding as fewer edges as possible. (greedily)

Intro. no edges !!  Let’s eliminate !!

Y1 Y2 Y3 M(Y1,Z1)

Z1 Z2 Z3

x2 x3

M (Y1,Z1)   f (Y1, X1) f (Z1, X1) X1 Example: Factorial HMM

Finding Elimination Order: Find elimination adding as fewer edges as possible. (greedily)

Y1 Y2 Y3 M(Y1,Z1) M(Y2,Z2) M(Y3,Z3)

Z1 Z2 Z3 f(Z1,Z2) f(Z2,Z3)

M (Z1,Y2 ,Z3 )  M (Y2 ,Z2 ) f (Z1,Z2 ) f (Z2 ,Z3 ) Z2

M(Y1,Z1) M(Y2,Z2) M(Y3,Z3)

Y1 Y2 Y3 Z1 Z2 Z3 X1 X2 X3 Example: Factorial HMM

Finding Elimination Order: Find elimination adding as fewer edges as possible. (greedily)

Y1 Y2 Y3

M(Z1,Y2,Z3) Z1 Z3 Intro. 3 edges !!

M (Z1,Y2 ,Z3 )  M (Y2 ,Z2 ) f (Z1,Z2 ) f (Z2 ,Z3 ) Z2

M(Y1,Z1) M(Y2,Z2) M(Y3,Z3)

Y1 Y2 Y3 Z1 Z2 Z3 X1 X2 X3 Example: Factorial HMM

Finding Elimination Order: Find elimination adding as fewer edges as possible. (greedily)

Y1 Y2 Y3 M(Y1,Z1) M(Y2,Z2) M(Y3,Z3)

Z1 Z2 Z3 f(Z1,Z2)

M (Y1, Z2 )  M (Y1, Z1) f (Z1,Z2 ) Z1

M(Y1,Z1) M(Y2,Z2) M(Y3,Z3)

Y1 Y2 Y3 Z1 Z2 Z3 X1 X2 X3 Example: Factorial HMM

Finding Elimination Order: Find elimination adding as fewer edges as possible. (greedily)

Y1 Y2 Y3 M(Y2,Z2) M(Y3,Z3) M(Y1,Z2) Intro. 1 edge !! Z2 Z3 M(Y1,Z2) (the best)

Y1 M (Y1, Z2 )  M (Y1, Z1) f (Z1,Z2 )  Z1Z2 Z1

M(Y1,Z1) M(Y2,Z2) M(Y3,Z3)

Y1 Y2 Y3 Z1 Z2 Z3 X1 X2 X3 Example: Factorial HMM

Finding Elimination Order: Find elimination adding as fewer edges as possible. (greedily)

f(Y1,Y2)

Y1 Y2 Y3 M(Y2,Z2) M(Y3,Z3) M(Y1,Z2)

Z2 Z3 M(Y1,Z2)

Y1 M '(Y2 , Z2 )  M (Y1, Z2 ) f (Y1,Y2 )  Z1Z2 Y1

M(Y1,Z1) M(Y2,Z2) M(Y3,Z3)

Y1 Y2 Y3 Z1 Z2 Z3 X1 X2 X3 Example: Factorial HMM

Finding Elimination Order: Find elimination adding as fewer edges as possible. (greedily)

Intro. 0 edge !! (the best) Y Y Y2 Y3 1 2 Z M’(Y2,Z2) M(Y2,Z2) M(Y3,Z3) 2

Z2 Z3 M(Y1,Z2)

Y1 M '(Y2 , Z2 )  M (Y1, Z2 ) f (Y1,Y2 )  Z1Z2 Y1

M(Y1,Z1) M(Y2,Z2) M(Y3,Z3)

Y1 Y2 Y3 Z1 Z2 Z3 X1 X2 X3 Example: Factorial HMM

Finding Elimination Order: Find elimination adding as fewer edges as possible. (greedily)

Y Y Y2 Y3 1 2 Z M’(Y2,Z2) M(Y2,Z2) M(Y3,Z3) 2

Z Z M(Y ,Z ) M(Y ,Z ) 2 f(Z2,Z3) 3 1 2 2 3 Y Y M (Y2 ,Z3 )  M '(Y2 ,Z2 )M (Y2 ,Z2 ) f (Z2 , Z3 ) 1 2 Z2 Z1Z2 Z2Z3

M(Y1,Z1) M(Y2,Z2) M(Y3,Z3)

Y1 Y2 Y3 Z1 Z2 Z3 X1 X2 X3 Example: Factorial HMM

Finding Elimination Order: Find elimination adding as fewer edges as possible. (greedily)

f(Y2,Y3) Y Y Y Y Y2 Y3 1 2 2 3 Z Z M(Y3,Z3) 2 3 M(Y2,Z3) Z3 M(Y1,Z2) M(Y2,Z3)

Y Y M (Y2 ,Z3 )  M '(Y2 ,Z2 )M (Y2 ,Z2 ) f (Z2 , Z3 ) 1 2 Z2 Z1Z2 Z2Z3

M(Y1,Z1) M(Y2,Z2) M(Y3,Z3)

Y1 Y2 Y3 Z1 Z2 Z3 X1 X2 X3 Example: Factorial HMM

After building a clique tree, we can run “2 passes” on the tree to get all messages M(.) needed for computing marginal.

1st pass: Y Y Y Y Take a node as root. 1 2 2 3 Z2 Z3 Run Sum-Product from leaves to root.

2nd pass: Run Sum-Product from root to leaves. Y1 Y2 All marginal dist. can be derived Z1Z2 Z2Z3 from

1. Multiply all M(.) from neighbors Y1 Y2 Y3 by f(.) on this node. Z1 Z2 Z3 2. Eliminate unwanted variable. X1 X2 X3 Example: Factorial HMM

After building a clique tree, we can run “2 passes” on the tree to get all messages M(.) needed for computing marginal.

Assume we want : P(Y1,Y2)

belief (or maginal dist.) Y1Y2 Y2Y3 Z2 Z3 b(Y1,Y2 ,Z2 )  M (Y1,Z2 ) f (Y1,Y2 , Z2 )M (Y2 ,Z2 )

M(Y1,Z2) P(Y1,Y2 )  b(Y1,Y2 , Z2 ) Z2 Y1 Y2 Z1Z2 Z2Z3

Y1, Y2 are Y1 Y2 Y3 dependent. Y1 Y2 Y3 Z1 Z2 Z3 Z1 Z2 Z3 There must be a node in X X X x1 x2 x3 1 2 3 clique tree containing (Y1,Y2). Example : General Factorial HMM

A clique size=5, intractable most of times. (No tractable elimination exist…)

V1 V2 V3 V1 V2 V3

W1 W2 W3 W1 W2 W3

Y1 Y2 Y3 Y1 Y2 Y3

Z Z Z Z Z Z 1 2 3 Moralize 1 2 3 x1 x2 x3 x1 x2 x3 Some Model are Intractable for Exact Inference Example: A Grid MRF Some Model are Intractable for Exact Inference Example: A Grid MRF Some Model are Intractable for Exact Inference Example: A Grid MRF Some Model are Intractable for Exact Inference Example: A Grid MRF Some Model are Intractable for Exact Inference Example: A Grid MRF

We will introduce Approximate Inference for this kind of problem later.

Generally, we will have clique of “size N” for a N*N grid, which is indeed intractable. Variable Elimination: Dealing with Evidence

What if some variables X={X1…XD} are given in Evidence : Given Evidence { B=b } :

E f(A,B=b) f(B=b,C) A B=bB C D

f(B=b,F) F

G H Variable Elimination: Dealing with Evidence

What if some variables X={X1…XD} are given in Evidence : Given Evidence { B=b } :

E f(A) f(C) A model with evidence equivalent to another model A C D without evidence. f(F) F To infer PM(Z|X), we transform G H M to another model M’ and infer PM’(Z). Variable Elimination: Dealing with Evidence If we can know “which variables will be given”, then a intractable model will become a tractable one.

Sometimes we want capture more dependency in a model, which induce intractable inference.

1 P(X , Z)  f (Z , X ) f (Z , X ) f (Z , X ) Z 1 1 1 2 1 3

f (Z2 , X1) f (Z2 , X 2 ) f (Z2 , X 3 ) Z1 Z2 Z3

f (Z3, X1) f (Z3 , X 2 ) f (Z3, X 3 ) x x x f (Z1, Z2 ) f (Z2 , Z3 ) 1 2 3

But given X1~X3, we actually run inference on another model M’. Variable Elimination: Dealing with Evidence If we can know “which variables will be given”, then a intractable model will become a tractable one.

Sometimes we want capture more dependency in a model, which induce intractable inference.

1 P(X  x, Z)  f (Z ) f '(Z ) f ''(Z ) Z x 1 x 1 x 1

f x (Z2 ) f x '(Z2 ) f x ''(Z2 ) Z1 Z2 Z3

f x (Z3 ) f x '(Z3 ) f x ''(Z3 )

x1 x2 x3 f (Z1, Z2 ) f (Z2 , Z3 )

But given X1~X3, we actually run inference on another model M’. Variable Elimination: Dealing with Evidence If we can know “which variables will be given”, then a intractable model will become a tractable one.

Sometimes we want capture more dependency in a model, which induce intractable inference.

1 P(X  x,Z)  f '(Z ,Z ) f '(Z , Z ) Z x 1 2 x 2 3

Z1 Z2 Z3

x1 x2 x3

But given X1~X3, we actually run inference on another model M’. Variable Elimination: Dealing with Evidence If we can know “which variables will be given”, then a intractable model will become a tractable one.

Sometimes we want capture more dependency in a model, which induce intractable inference.

1 f '(Z , Z ) f '(Z , Z ) P(X  x, Z) x 1 2 x 2 3 P(Z | X  x)   Z 1 P(X  x) f '(Z , Z ) f '(Z , Z ) Z  x 1 2 x 2 3 Z1 Z 2 Z 3 Z1 Z2 Z3 1  f '(Z , Z ) f '(Z , Z ) Z'(x) x 1 2 x 2 3 x1 x2 x3

But given X1~X3, we actually run inference on another model M’. Variable Elimination: Dealing with Evidence If we can know “which variables will be given”, then a intractable model will become a tractable one.

Sometimes we want capture more dependency in a model, which induce intractable inference.

1 f '(Z , Z ) f '(Z , Z ) P(X  x, Z) x 1 2 x 2 3 P(Z | X  x)   Z 1 P(X  x) f '(Z , Z ) f '(Z , Z ) Z  x 1 2 x 2 3 Z1 Z 2 Z 3 Z1 Z2 Z3 1  f '(Z , Z ) f '(Z , Z ) Z'(x) x 1 2 x 2 3 M’ is much more tractable. x1 x2 x3

But given X1~X3, we actually run inference on another model M’. Variable Elimination: Dealing with Evidence If we can know “which variables will be given”, then a intractable model will become a tractable one.

Sometimes we want capture more dependency in a model, which induce intractable inference.

1 f '(Z , Z ) f '(Z , Z ) P(X  x, Z) x 1 2 x 2 3 P(Z | X  x)   Z 1 P(X  x) f '(Z , Z ) f '(Z , Z ) Z  x 1 2 x 2 3 Z1 Z 2 Z 3 Z1 Z2 Z3 1  f '(Z , Z ) f '(Z , Z ) Z'(x) x 1 2 x 2 3 M’ is much more tractable.

New normalize const. can be computed Using VE. But given X1~X3, we actually run inference on another model M’. Variable Elimination: Dealing with Evidence Even if P(Z|X) can be inferred efficiently, “learning P(X,Z)” is intractable. One solution is model P(Z|X) directly , yielding “CRF” model.

1 P(X , Z)  f (Z , X ) f (Z , X ) f (Z , X ) Z Z Z Z 1 1 1 2 1 3 1 2 3

f (Z2 , X1) f (Z2 , X 2 ) f (Z2 , X 3 ) x x x f (Z3, X1) f (Z3 , X 2 ) f (Z3, X 3 ) 1 2 3 f (Z , Z ) f (Z , Z ) 1 2 2 3 Intractable MRF model

Z Z Z 1 1 2 3 P(Z | X )  f '(Z , Z ) f '(Z ,Z ) Z'(X ) X 1 2 X 2 3 x1 x2 x3 Tractable CRF Model Agenda

• Introduce the concept of “Variable Elimination” in special case of Tree-structured Factor Graph.

• Extend the idea of “VE” to General Factor Graph with concept of “Clique Tree”.

• See how to extend “VE” to “Most Probable Assignment” (MAP configuration) Problem. Query 3: Most Probable Assignment

• Given Evidence E = {X1=x1,…, XD=xD} and some other variables Z={Z1, …, Zk} unspecified , Most Probable Assignment of Z is given by:

MPA(Z | X )  arg max P(Z | X ) Z P(X | Z)P(Z)  arg max  arg max P(X | Z)P(Z) Z P(X) Z

arg max Z1 P(Z1 | X )  argmax P(Z| X)  ...... Z  arg max ZK P(ZK | X ) What’s the different ?

MPA Goal:

max P(Z | X )  max...max P(Z1...ZK | X ) Z Z1 ZK

Likelihood Goal::

(Solved using VE) P(X )  ... P(Z1...ZK , X ) Z1 ZK

Exploring the similarity between “max” & “∑” is the key to solve MPA using VE. What’s the different ?

Review:

P(E  e)  P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  P(E | D)P(D | C)P(C | B)P(B | A)P(A) D C B A M(B) : marginal of B max P(A, B,C, D, E  e) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A M’(B) : ??? M(B) = maxA F(A,B) : maxMarginal of B

M(B)  max F(A, B) A

For every choice of B, we decide an A*(B)=argmaxA F(A,B) with M(B) = F( A*(B), B ).

F(A,B) b1 b2 b3 a1 1 3 9 a2 2 5 8 a3 4 7 6 M(B) = maxA F(A,B) : maxMarginal of B

M(B)  max F(A, B) A

For every choice of B, we decide an A*(B)=argmaxA F(A,B) with M(B) = F( A*(B), B ).

B b1 b2 b3 F(A,B) b1 b2 b3 A*(B) a3 a1 1 3 9 a2 2 5 8 B b1 b2 b3 a3 4 7 6 M(B) 4 M(B) = maxA F(A,B) : maxMarginal of B

M(B)  max F(A, B) A

For every choice of B, we decide an A*(B)=argmaxA F(A,B) with M(B) = F( A*(B), B ).

B b1 b2 b3 F(A,B) b1 b2 b3 A*(B) a3 a3 a1 1 3 9 a2 2 5 8 B b1 b2 b3 a3 4 7 6 M(B) 4 7 M(B) = maxA F(A,B) : maxMarginal of B

M(B)  max F(A, B) A

For every choice of B, we decide an A*(B)=argmaxA F(A,B) with M(B) = F( A*(B), B ).

B b1 b2 b3 F(A,B) b1 b2 b3 A*(B) a3 a3 a1 a1 1 3 9 a2 2 5 8 B b1 b2 b3 a3 4 7 6 M(B) 4 7 9 Most Probable Assignment on a Chain F(A,B) max P(A, B,C, D, E  e) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A F(A,B)

F(A,B) a1 a2 a3 b1 … … … b2 … … … b3 … … … Most Probable Assignment on a Chain M(B) max P(A, B,C, D, E  e) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A

M(B) = maxA F(A,B)

B A*(B) M(B) F(A,B) a1 a2 a3 b1 a1 … b1 … … … b2 a3 … b2 … … … b3 a2 … b3 … … … Most Probable Assignment on a Chain F(B,C) max P(A, B,C, D, E  e) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A F(B,C)=P(C|B)M(B)

P(C|B) b1 b2 b3 B A*(B) M(B) c1 … … … b1 a1 … c2 … … … b2 a3 … c3 … … … b3 a2 … Most Probable Assignment on a Chain F(B,C) max P(A, B,C, D, E  e) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A F(B,C)=P(C|B)M(B)

F(B,C) b1 b2 b3 c1 … … … c2 … … … c3 … … … Most Probable Assignment on a Chain M(C) max P(A, B,C, D, E  e) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A

M(C) = maxB F(B,C)

C B*(C) M(C) F(B,C) b1 b2 b3 c1 b3 … c1 … … … c2 b1 … c2 … … … c3 b2 … c3 … … … Most Probable Assignment on a Chain F(C,D) max P(A, B,C, D, E  e) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A F(C,D) = P(D|C)M(C)

P(D|C) c1 c2 c3 C B*(C) M(C) d1 … … … c1 b3 … d2 … … … c2 b1 … d3 … … … c3 b2 … Most Probable Assignment on a Chain F(C,D) max P(A, B,C, D, E  e) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A F(C,D) = P(D|C)M(C)

F(C,D) c1 c2 c3 d1 … … … d2 … … … d3 … … … Most Probable Assignment on a Chain M(D) max P(A, B,C, D, E  e) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A

M(D)=maxC F(C,D)

D C*(D) M(D) F(C,D) c1 c2 c3 d1 c1 … d1 … … … d2 c2 … d2 … … … d3 c3 … d3 … … … Most Probable Assignment on a Chain F(D,E=e) max P(A, B,C, D, E  e) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A F(D)=P(E=e|D)M(D)

P(E=e|D) d1 d2 d3 D C*(D) M(D) e … … … d1 c1 … d2 c2 … d3 c3 … Most Probable Assignment on a Chain M max P(A, B,C, D, E  e) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A

M= maxD F(D) D* M F(D,E=e) d1 d2 d3 d2 … e … … …

What we get ?  M = maxABCD P(A,B,C,D,E=e)

What we want ?  (A*,B*,C*,D*) = argmaxABCD P(A,B,C,D,E=e) Most Probable Assignment on a Chain M max P(A, B,C, D, E  e) A,B,C,D  max max max max P(E  e | D)P(D | C)P(C | B)P(B | A)P(A) D C B A  max P(E  e | D)max P(D | C)max P(C | B)max P(B | A)P(A) D C B A What we want ?  (A*,B*,C*,D*) = argmaxABCD P(A,B,C,D,E=e) (a1 , b1, c2, d2)

D* M D C*(D) M(D) C B*(C) M(C) B A*(B) M(B) d2 … d1 c1 … c1 b3 … b1 a1 … d2 c2 … c2 b1 … b2 a3 … d3 c3 … c3 b2 … b3 a2 … Most Probable Assignment on general

It’s straight forward to generalizeGraph algorithm above to case of general graph with similarity of “∑” and “max”.

(The difference is there must be a “traceback” procedure to find the “argmax” after we get “max”.)

Y1Y2 Y2Y3 Z2 Z3

Y1 Y2 Y3

Z1 Z2 Z3 Y1 Y2 Z1Z2 Z2Z3

x1 x2 x3

Y1 Y2 Y3 Z1 Z2 Z3 X1 X2 X3 Summary

• To solve inference problems like “likelihood of X”, “P(Z|X)”, “Most Probable Assignment”, we can use Variable Elimination (e.g. Sum-Product) algorithm

• In case of tree-structured factor graph, we just run “2 passes” of VE from leaves to a root & the reverse.

• In case of general-structured graph, we must find a “good” elimination order inducing smallest “maximum clique”, which is often done with greedy method.

• When we know which variables will be given in advance, we can derive much easier model M’ from original M with evidence, which is more tractable in Inference & Learning. Deterministic (Variational) Approximate Inference

Reference: Bayesian Reasoning and Machine Learning Ch. 28 (David Barber) Probabilistic Graphical Model Ch. 11 (Koller & Friedman) Pattern Recognition & Machine Learning Ch. 10. (Bishop) In terms of difficulty, there are 3 types of inference problem.

• Inference which is easily solved with Bayes rule.

• Inference which is tractable using some dynamic programming technique. (e.g. Variable Elimination or J-tree algorithm )

Today’s focus • Inference which is proved intractable & should be solved using some Approximate Method. (e.g. Approximation with Optimization or Sampling technique.) Agenda

• Principle of Variational Approximation

• Global Approximation (Mean Field Approximation)

• Message Approximation (Expectation Propagation) Intractable Inference

~ Example: A N*N Grid MRF What we can solve: P(X ) : unnormalized distribution (N=3) 1 ~ ~ P(A,..., I)  P(A,..., I), P(A,..., I)  (X , X ) Z  1 2 A D G ( X1 ,X 2 ) (tractable) What we cannot solve: B E H

~ C F I Z   P(A,..., I) ( A,...,I ) (intractable, as N increases) ~ ~ P(A)   P(A,..., I) (B,..,I ) Intractable Inference

Example: What we can solve: 1 N layers Factorial HMM P(W,Y,Z | X  x)  P(W,Y, Z, X  x), Z(X  x)

W1 W2 W3 P(W,Y,Z, X  x)  P(W)*P(Y)*P(Z)*P(X  x |W,Y,Z) (easy) Y1 Y2 Y3

Z Z Z ~ 1 2 3 What we cannot solve: P(X ) : unnormalized distribution

x1 x2 x3 P(X  x)  P(W,Y,Z, X  x) (hard) W ,Y ,Z ~ P(Z1  z | X  x)  P(W,Y,Z1...Z3, X  x) (W ,Y ,Z2 ...Z3 ) Intractable Inference

Example: Some intractability comes not from “Structure”, but from Latent Topic Model passing message between different type of distribution.

Let Mix   ( ,..., ), K  number of topics Document Dir(α) 1 K Mix K P(Topic  Z | Mix )   M ()   P(w | Z ) 1 1 Z1 Topic1Mix  Z1 1 1 Pos. Z1 1 Topic K P(Topic  Z | Mix )   M ()   P(w | Z ) 2 2 Z2 Topic2Mix  Z2 2 2 Z2 1 Word No compact representation for message: ~ P(Mix  | w)  P(Mix )*M ()*M () Topic1Mix Topic2 Mix  1 K        1   P(w | Z )  P(w | Z )  const  k  Z1 1 1  Z2 2 2   k1  Z1  Z2  Summation is intractable. (Exponential to #variables) ~ 1  K  P(Mix | w) d     11[kZ1 ]1[kZ2 ] P(w | Z )P(w | Z ) d    k  1 1 2 2  const Z1 Z2   k1  Principle of Variational Approximation

Let X: observation, Z: hidden variables.

Finds an approximate distribution Q(Z) from a “Tractable Family” that most similar to the target distribution P(Z|X) measured by some distance like KL divergence.

Q*(Z)  arg min KL( P(Z | X ) || Q(Z) ) Q(Z ) Q*(Z)  arg min KL( Q(Z) || P(Z | X ) ) Q(Z ) Intractable Family P(Z | X )

Q(Z) How can we optimize Q(Z) without computing P(Z|X) ? Tractable Family Agenda

• Principle of Variational Approximation

• Global Approximation (Mean Field Approximation)

• Message Approximation (Expectation Propagation) Global Approximation

One of the most popular tractable family is Factorized Distribution, which assumes the target (posterior) distribution P(Z|X) can be factorized into q(Z1)*q(Z2)…*q(ZN), that is, variables are independent to each other.

Example:

A D G

B E H P(Z | X )

C F I Factorized Distribution

Q(Z)  q(Z1)q(Z2 )....q(ZN ) 1 ~ P(A,..., I)  P(A,..., I) Z  q(A)q(B)...... q(I) How to Find Q*(Z) ?

Q*(Z)  arg min KL( Q(Z) || P(Z | X ) ) Q(Z )Tractable Family 1 1 KL( Q(Z) || P(Z | X ) )  E [log  log ] Q(Z ) P(Z | X ) Q(Z)

 EQ(Z )[logQ(Z) log P(Z | X )]

 EQ(Z )[logQ(Z) log P(Z, X )  log P(X )]

 EQ(Z )[logQ(Z)] EQ(Z )[log P(Z, X )] log P(X )

( Tractable if Q(Z) is tractable) ( intractable…. but Independent of Q(Z) ) The resulting problem is equivalent to:

* P(Z,X) Find Q(Z) that put “similar weight” Q (Z)  arg max EQ(Z) [log ] Q(Z )Tractable Family Q(Z) to P(Z,X) on which Z=z to happen. How to Find Q*(Z) ?

P(Z,X) Find Q(Z) that put “similar weight” Q*(Z)  arg max E [log ] Q(Z) to P(Z,X) on which Z=z to happen. Q(Z )q(Z1 )q(Z2 )...q(Z N ) Q(Z)

We maximize w.r.t. one q ( Z n ) , while fixing all the other. max EQ(Z) [logP(Z,X)]- EQ(Z) [logQ(Z)] q(Z1 )  E  E [logP(Z,X)]  - E [log q(Z )]  E [log q(Z )] q(Z1 ) q(Z2 )...q(ZN ) q(Z1 ) 1  q(Zk ) k k1

Expectation over other variables Independent to q(Z1) ˆ denote as log P(Z1, X )  const.

 ˆ  * * P(Z1,X) q (Z )  Pˆ(Z , X )  log q (Z )  log Pˆ(Z , X )  E log  const. 1 1 1 1 q(Z1 )    q(Z1)  log q* (Z )  E [log P(Z, X )] const. ˆ 1 q(Z2 )...q(Z N )  KL( q(Z1) || P(Z1, X ) )  E[ log f (Z ...Z ) ] const.  k1 km f factors related to Z1 How to Find Q*(Z) ?

Example: Given other q(B)…q(I) fixed, maximize w.r.t. q(A):

~* ~ A D G log q (A)  Eq(B)...q(I )[log P(A,..., I)]

 Eq(B)[ log(A,B) ]  Eq(D)[ log(A,D) ] const. B E H

C F I

1 ~ P(A,..., I)  P(A,..., I) Z  q(A)q(B)...... q(I) How to Find Q*(Z) ?

Example: Given other q(B)…q(I) fixed, maximize w.r.t. q(A):

~* ~ A D G log q (A)  Eq(B)...q(I )[log P(A,..., I)]

 Eq(B)[ log(A,B) ]  Eq(D)[ log(A,D) ] B E H ~* ~ log q (B)  Eq( A)q(C)...q(I )[log P(A,..., I)] C F I  Eq(A)[ log(A,B) ]  Eq(C)[ log(B,C) ]  Eq(E)[ log(B,E) ] const.

1 ~ P(A,..., I)  P(A,..., I) Z Iterate over all variables ….. until convergence !!  q(A)q(B)...... q(I)

P(Z,X) Guarantee convergence to stationary point of max E Q(Z) [log ] (Why?) Q(Z) Q(Z)

( Every update strictly increase objective function, since KL(q || p)=0 only if q(zk)=p(zk). Since the maximum is bounded, we are guaranteed to convergence. ) Agenda

• Principle of Variational Approximation

• Global Approximation (Mean Field Approximation)

• Message Approximation (Expectation Propagation) Message Approximation

Example: A N*N Grid MRF Variable Elimination  Clique Tree (N=3) F1(A,B,C) F2(A,B,C,D,E,F) F2(D,E,F,G,H,I)

A D G A A D Eliminate A,B,C D G M (A,B,C) M (D,E,F) B E H F1F2 F2F3 B B E E H MF1F2(A,B,C) MF2F3(D,E,F) Eliminate G,H,I C F I Eliminate D,E,F C C F F I

The Elimination: M (D, E, F)  M (A, B,C)F(A, B,C, D, E, F) F2 F3  F1F2 A,B,C is intractable. ( exponential in N )

However, can we approximate the message MFiFj(…) to make the elimination tractable ?  Assume it is factorized !! Message Approximation

Example: A N*N Grid MRF Variable Elimination  Clique Tree (N=3) F2(A,B,C,D,E,F) A D G How to obtain the next q(A) A D q(D) “Factorized Message” ?

B E H q(B) B E q(E)

C F I q(C) C F q(F)

Approximate the message by a factorized distribution: M (A, B,C)  q(A)q(B)q(C) F1F2

A, B, C not entangled !! q(A)q (B)q(C)F 2 (A, B,C, D , E , F ) forms a tree.

 We can compute marginal by sum-product algorithm !! How to obtain a Factorized Message ?

F2(A,B,C,D,E,F) How to obtain the next “Factorized Message” ? q(A) A D q(D) min KL( q(D)q(E)q(F) || M F F (D,E, F) ) q(D)q(E)q(F) 2 3 ?

q(B) B E q(E) Need iteratively solve for q(D), q(E), q(F). (Is there more efficient way ?) q(C) C F q(F) min KL( M F F (D,E, F)|| q(D)q(E)q(F) ) q(D)q(E)q(F) 2 3 ? M (D, E, F) KL( M (D,E, F)|| q(D)q(E)q(F) )  E (log ) M (D,E,F) q(D)q(E)q(F) M (D, E, F) M (D) M (E) M (F)  E (log )  E [ log ] E [ log ] E [ log ] M (D,E,F) M (D)M (E)M (F) M (D) q(D) M (E) q(E) M (F ) q(F)  KL(M (D, E, F) || M (D)M (E)M (F))  KL(M (X ) || q(X )) X{D,E,F} const. Set q*(X) = M(X) (set q(D), q(E), q(F) equal to the marginal. ) 2-Layers Sum-Product Algorithm with Approximate Messages

F1(A,B,C) F2(A,B,C,D,E,F) F2(D,E,F,G,H,I)

A A D Eliminate A,B,C D G M(A)M(B)M(C) M(D)M(E)M(F) B M’(A)M’(B)M’(C) B E M’(D)M’(E)M’(F) E H Eliminate D,E,F Eliminate G,H,I C C F F I

Elimination is easy since factors in every Clique form a “Tree”. Computing Marginal (ex. M(D), M(E), M(F)) can be done by inner Sum-Product Algorithm. Approximate Message: Expectation Propagation Previous example is a special case of “Expectation Propagation”. General Expectation Propagation uses distribution come from Log-linear model (including Gaussian, Multinomial, Poisson, Dirichlet Distribution):  1 T T Z()  exp  f(X) log Z()  EQ (X)[ f (X )] Q (X)  exp  f(X)      Z() X 

T T where f (X) [ f1(X ), f2 (X ),..., fD (X )] are sufficient ( features) derived from X.

min KL( P(X)|| Q (X) )  EP(X)[log P(X )] EP(X)[logQ (X )]  const.

T max EP(X)[logQ (X )]  EP(X)[ f (X )]log Z() 

 T  E [ f (X )] log Z()  0 EP(X)[ f (X )]  EQ (X)[ f (X )]  P(X)   Moment Matching!! Match the Expectation of feature to the original message. Approximate Message: Expectation Propagation Previous example is a special case of “Expectation Propagation”. A more general version uses distribution come from Log-linear model (including Gaussian, Multinomial, Poisson, Dirichlet Distribution):

Moment Matching: E [ f (X )]  E [ f (X )] P(X) Q (X) Example: Moment Matching: 1. Q(X) is Multi(θ): f (X ) 1[X  k] k Set equal Marginal Probability E  1[X  k]  Q (X  k) Q (X  k)   P(X  k) Q ( X )   k As previous MRF example.

T Moment Matching: 2. Q(X) is Gaussian(μ,Σ): f1(X )  X f2 (X )  XX Set equal Mean, Variance.

EQ ( X ) X      EP( X )[X ] T T E XX     T T Q ( X )   EP( X )[XX ] 

VarP( X )[X ] Example: Use EP Handling Continuous / Discrete BN

When BN contains both Discrete / Continuous Variables, messages cannot have a compact representation…….

Gaussian Mixture of Gaussian

Mixture of Mixture of Gaussian (Exponentially Explode…)

Tracking Positions

Zt =1 if Ot is from Xt X1 X2 …… Xt =2 if Ot is from Noise

Z1 Z2 Zt

Sensor O O Observation O1 2 …… t Example: Use EP Handling Continuous / Discrete BN

When BN contains both Discrete / Continuous Variables, messages cannot have a compact distribution……. To prevent message grows to exponentially many mixtures of Gaussian….

M  (X1)  P(Z1)P(X1)P(O1| Z1, X1) Approximate M(X1) with Z1 N( X; μZ1 , ΣZ1) single Gaussian Q(X1) by “Expectation Matching”: w1*N( X; μ1 , Σ1) + w2*N( X; μ2 , Σ2)   E [X ]  w   w  Q M ( X1 ) 1 1 1 2 2  Var [X ] Q M ( X1 ) 1 Approximate Message with N( μQt , ΣQt )  E [ Var[X ] | Z ]Var [ E[X ] | Z ] P(Z1 ) 1 1 P(Z1 ) 1 1 M(X1) M(X2) M(Xt) Z =1 if O is from X X X X t t t  E [  | Z ] Var [  | Z ] 1 2 …… t =2 if O is from Noise P(Z1 ) Z1 1 P(Z1 ) Z1 1 t  w   w  1 1 1 2 Z1 Z2 Zt T  w1(1  Q )(1  Q ) T Sensor O O  w2 (2  Q )(2  Q ) Observation O1 2 …… t Agenda

• Principle of Variational Approximation

• Global Approximation (Mean Field Approximation)

• Message Approximation (Expectation Propagation)

• Comparison Mean Field Approximation vs. Expectation Propagation

• Both of them find a tractable distribution (ex. Factorized distribution) Q(Z) to approximate the real distribution.

• Mean Field approximate joint posterior distribution P(Z|X), minimizing KL(Q||P). ( Why not KL(P||Q) ? 1 )

• Expectation Propagation approximate messages, minimizing KL(P||Q). (Why not KL(Q||P) ? 2 )

• Expectation Propagation needs only one-pass Sum-Product, while Mean Field Approximation needs iterative maximization.

Green: Mean Field • min KL(Q||P) has more False Negative. (Why 3) Red: EP Blue: Truth • min KL(P||Q) has more False Positive. (Why 4)

 log Z()  E [ f (X )]  Q (X)

 1  log Z()  exp  T f(X)   Z()  X

1  exp  T f(X) *f(X)  Q (X ) f (X )  E [ f (X )]    Q ( X ) Z() X X

Back Particle-Based Approximate Inference on Graphical Model

Reference:

Probabilistic Graphical Model Ch. 12 (Koller & Friedman) CMU, 10-708, Fall 2009 Probabilistic Graphical Models Lectures 18,19 (Eric Xing) Pattern Recognition & Machine Learning Ch. 11. (Bishop) In terms of difficulty, there are 3 types of inference problem.

• Inference which is easily solved with Bayes rule.

• Inference which is tractable using some dynamic programming technique. (e.g. Variable Elimination or J-tree algorithm ) Today’s focus

• Inference which is proved intractable & should be solved using some Approximate Method. (e.g. Approximation with Optimization or Sampling technique.) Agenda

• When to use Particle-Based Approximate Inference ?

• Forward Sampling & Importance Sampling

• Markov Chain Monte Carlo (MCMC)

• Collapsed Particles Agenda

• When to use Particle-Based Approximate Inference ?

• Forward Sampling & Importance Sampling

• Markov Chain Monte Carlo (MCMC)

• Collapsed Particles Example : General Factorial HMM

A clique size=5, intractable most of times. (No tractable elimination exist…)

V1 V2 V3 V1 V2 V3

W1 W2 W3 W1 W2 W3

Y1 Y2 Y3 Y1 Y2 Y3

Z Z Z Z Z Z 1 2 3 Moralize 1 2 3 x1 x2 x3 x1 x2 x3 Some Model are Intractable for Exact Inference Example: A Grid MRF Some Model are Intractable for Exact Inference Example: A Grid MRF Some Model are Intractable for Exact Inference Example: A Grid MRF Some Model are Intractable for Exact Inference Example: A Grid MRF Some Model are Intractable for Exact Inference Example: A Grid MRF

Approximate Inference needed.

Generally, we will have clique of “size N” for a N*N grid, which is indeed intractable. General idea of Particle-Based (Monte Carlo) Approximation Most of Queries we want can be formed as: Intractable when K ∞.

EP( X )[ f (X )]  ...P(X1...X K )* f (X1...X K ) X1 X K which is intractable most of time. Assume we can generate i.i.d. samples X(1)…X(n) from P(X), we can approximate above using: 1 N fˆ   f (X (n) ) N n1 It’s a unbiased estimator whose variance converges to 0 when N∞. 1 N E[ fˆ]  E[ f (X (n) )]  E[ f (X )] N  Var. not Related to n1 dimension of X. 1 N 1 Var0 as N∞ Var[ fˆ]  Var[ f (X (n) )]  Var[ f (X )] 2  N n1 N Which Problem can use Particle-Based (Monte Carlo) Approximation ? • Type of queries:

– 1. Likelihood of evidence/assignments on variables – 2. Conditional Probability of some variables (given others). – 3. Most Probable Assignment for some variables (given others ).

Problem which can be written as following form:

EP( X )[ f (X )]  ...P(X1...X K )* f (X1...X K ) X1 X K Marginal Distribution (Monte Carlo)

To Compute Marginal Distribution on Xk

P(X k  xk )

  P(X k  xk , X k )   P(X k , X k )*1{X k  xk } X k X k X k

 EP( X )[ 1{X k  xk } ]

Particle-Based Approximation:

N ˆ 1 (n) f  1{X k  xk } N n1

( Just count the proportion of samples in which Xk=xk ) Marginal Joint Distribution (Monte Carlo)

To Compute Marginal Distribution on ( Xi , Xj )

P(X i  xi , X j  x j )

  P(X i  xi , X j  x j , X ij )   P(X i , X j , X k )*1{X i  xi & X j  x j } X ij X ij Xi Xj

 EP( X )[ 1{X i  xi & X j  x j } ]

Particle-Based Approximation:

N ˆ 1 (n) (n) f  1{X i  xi , X j  x j} N n1

( Just count the proportion of samples in which Xi=xi & Xj=xj ) So What’s the Problem ?

Note what we can do is:

“Evaluate” the probability/likelihood P(X1=x1,…, XK=xK).

What we cannot do is:

Summation / Integration in high-dim. space: ∑X P(X1,…, XK).

What we want to do (for approximation) is:

“Draw” samples from P(X1,…, XK). How to make better use of samples ? How to know we’ve sampled enough ? How to draw Samples from P(X) ?

• Forward Sampling draw from ancestor to descendant in BN.

• Rejection Sampling create samples using Forward Sampling, and reject those inconsistent with evidence.

• Importance Sampling Sample from proposal dist. Q(X), but give large weight on sample with high likelihood in P(X).

• Markov Chain Monte Carlo Define a Transition Dist. T(xx’) s.t. samples can get closer and closer to P(X). Agenda

• When to use Particle-Based Approximate Inference ?

• Forward Sampling & Importance Sampling

• Markov Chain Monte Carlo (MCMC)

• Collapsed Particles Forward Sampling

B E P(A) t t 0.95 t f 0.94 P(B) P(E) f t 0.29 0.001 0.002 f f 0.01 Burglary Earthquake

Alarm

John Calls Mary Calls A P(J) A P(M) t 0.90 t 0.70 f 0.05 f 0.01 Forward Sampling

B E P(A) t t 0.95 t f 0.94 P(E) P(B) a ~b f t 0.29 e 0.001 0.002 f f 0.01 Burglary Earthquake

Alarm

John Calls Mary Calls A P(J) A P(M) t j 0.90 t ~m0.70 f 0.05 f 0.01 i.i.d Sample : (~b,e,a,j,~m) Forward Sampling

Samples : Particle-Based Represent (~b,e,a,j,~m) of the joint distribution P(B,E,A,J,M). … N 1 (n) … P(M  m)  1{M  m} N n1 (~b,~e, a,~j, ~m) 1 N P(B  b, M ~ m)  1{B(n)  b, M (n)  ~ m} N n1

What if we want samples from P( B, E, A | J=j , M=~m ) ?

1. Collect all samples in which J=j , M=~m .

2. Those samples form the particle-based representation of P( B, E, A | J=j, M=~m). Disadvantage……. Forward Sampling from P(Z|Data) ?

…… Z Z1 Z2 Z3 ZK

X 1 0 1 0

1. Forward Sampling N times.

(n) (n) 2. Collect all samples (Z ,X ) in which X1=1, X2=0, X3=1, …… XK=0.

3. Those samples form the particle-based representation of P(Z|X).

How many such samples can we get ??  N*P(Data) !! ( Less than 1 if N not large enough……) Solutions……. Importance Sampling to the Rescue

We need not draw from P(X) to compute EP(X)[f(X)] :

EP( X )[ f (X )]   P(X )* f (X ) X P(X ) P(X )  Q(X )*( * f (X ))  EQ( X )[ * f (X )] X Q(X ) Q(X )

1 N P(X (n) ) Eˆ [ f (X )]  ( )* f (X (n) ) P( X )  (n) N n1 Q(X )

That is, we can draw from an arbitrary distribution Q(X), but give larger weights on samples having higher probability under P(X). Importance Sampling to the Rescue

Sometimes we can only evaluate an unnormalized ~ distribution : ~ P(X ) P(X ) , where  P(X ) Z Then we can estimate Z as follows: ~ ~ ~ ~ P(X ) P(X ) 1 N P(X (n) ) Z  P(X )  Q(X )  E [ ] Zˆ    Q( X )  (n) X X Q(X ) Q(X ) N n1 Q(X )

Note that we can compute Z ˆ only if we can evaluate a normalized distribution Q ( X ) ,

that is, we have Z Q or is from a BN. ~ N P(X (n) ) ~ * f (X (n) ) Eˆ ~ [ f (X )]  (n) 1 P(X ) ˆ P( X ) n1 Q(X ) EP( X )[ f (X )]  EQ( X )[ * f (X )] EP( X )[ f (X )]   ~ Z Q(X ) Zˆ N P(X (n) )  (n) n1 Q(X ) Importance Sampling from P(Z|Data) ?

…… Z ZA1 ZB2 ZA3 ZAK

X 1 0 1 0

1. Sampling from P(Z), a normalized distribution obtained from BN truncating the part with evidence. Importance Sampling from P(Z|Data) ?

…… Z ZA1 ZB2 ZA3 ZAK w(n)  P(Data | Z)  P(1| A)P(0 | B)P(1| A)....P(0 | A) X 1 0 1 0

1. Sampling from P(Z), a normalized distribution obtained from BN truncating the part with evidence.

2. Give each sample (Z(n), X(n)) a weight: ~ P(Z) P(Z)P(Data | Z) w(n)    P(Data | Z) Q(Z) P(Z)

Importance Sampling from P(Z|Data) ?

(A,B,A,…,A) w= 0.01 Z Z Z …… Z 1 2 3 ZK (A,B,A,…,B) 0.3 (B,B,B,…,A) 1.0

X 1 0 1 0 Neff = 1.31

P(Data)=Neff/N = 1.31/3 1. Sampling from P(Z), a normalized distribution obtained from BN truncating the part with evidence.

2. Give each sample (Z(n), X(n)) a weight: ~ P(Z) P(Z)P(Data | Z) w(n)    P(Data | Z) Q(Z) P(Z) N (n) 3. The effective number of samples is Neff  w n1 1 N 1 N ( Pˆ(Data)  w(n)  P(Data | Z (n) ) ) N n1 N n1 Importance Sampling from P(Z|Data) ?

(A,B,A,…,A) w= 0.01 Z Z Z …… Z 1 2 3 ZK (A,B,A,…,B) 0.3 (B,B,B,…,A) 1.0

X 1 0 1 0 Neff = 1.31

P(Data)=Neff/N = 1.31/3

To get estimate of P(Z1|Data) :

0.01*0  0.3*0 1.0*1 Pˆ(Z  B | Data)   0.76 1 1.31

0.01*0  0.3*11.0*0 Pˆ(Z  A,Z  B | Data)   0.23 1 K 1.31

Any joint dist. can be estimated. ( No “out of clique” problem ) Bayesian Treatment with Importance Sampling

X θ Ex. Pθ(Y=1|X) = logistic( θ1*X + θ0 ) Y N Often, Posterior on parameters θ : P(Data |)P() P(Data |)P() P( | Data)   P(Data)  P(Data |)P() d 

is intractable because many types of Pθ(Data|θ) cannot be integrated analytically.

N N (n) (n) (n) Approximate  P(Data |  a) 1{  a} P(Data |  a)1{  a} Pˆ(  a | Data)  n1  n1 with: N ˆ  P(Data | (n) ) P(Data) n1 We need not evaluate “the integration” to estimate P(θ|Data) using Importance Sampling. P(X)^ What’s the problem ?

P(X) Q(X)

Val(X)

If P(X) and Q(X) not matched properly……

Only small number of samples will fall in the region with high P(X).

 Very large N needed to get a good picture of P(X). How P(Z|X) and Q(Z) Match ?

When evidence is close to root, forward sampling is a good Q(Z), which can generate samples with Evidence high likelihood in P(Z|X).

Q(Z) close to P(Z|X) How P(Z|X) and Q(Z) Match ?

When evidence is on the leaves, Q(Z) far from P(Z|X) forward sampling is a bad Q(Z), yields very low likelihood=P(X|Z).

So we need very large sample size to get a good picture of P(Z|X).

Evidence

Can we improve with time to draw from a distribution more like the desired P(Z|X) ? MCMC try to draw from a distribution closer and closer to P(Z|X). ( Apply equally well in BN & MRF. ) Agenda

• When to use Particle-Based Approximate Inference ?

• Forward Sampling & Importance Sampling

• Markov Chain Monte Carlo (MCMC)

• Collapsed Particles What is Markov Chain (MC) ?

A set of Random Variables:

X = (X1,….XK)

Variables change with Time: (t) (t) (t) Possible X = (X1 ,….XK ) Configurations of X which take transition following: P(X(t+1) =x’|X(t) =x) = T(xx’)

There is a stationary distribution πT(X) for Transition T, in which:

πT(X=x’) = ∑x πT(X=x) * T(xx’) (After transition, still the same distribution over all possible configurations X1~X3) Ex. The MC (Markov Chain) above has only 1 variable X taking on values {x1,x2,x3}, 0.25 0 0.75 There is a π s.t.  *T  0.2 0.5 0.3  0 0.7 0.3   0.2 0.5 0.3   T T      T  0.5 0.5 0  What is MCMC (Markov Chain Monte Carlo) ?

Importance Sampling is efficient only if Q(X) matches P(X) well. Finding such Q(X) is difficult. P(X) Instead, MCMC tries to find a transition dist. T(xx’), s.t. X tends to transit into states with high P(X), T(xx’) and finally follows stationary dist. πT = P(X).

Setting X(0)=any initial value, we samples X(1),X(2)……X(M) following T(xx’), and (M) hope that X follows stationary distribution πT = P(X). (M) (M) If X really does , we got a sample X from P(X). Why will the MC converge to stationary distribution ? there is a simple, useful sufficient condition: “Regular “ Markov Chain : (for finite state space) Any state x can reach any other states x’ with prob. > 0. (all entries of Potential/CPD > 0 )

(M)  X follows a unique πT as M large enough. Example Result How to define T(xx’) ? ---- Gibbs Sampling Gibbs Sampling is the most popular one used in Graphical Model. In graphical model :

It is easy to draw sample from “each individual variable given others P(Xk|X-k)”, while drawing from the joint dist. of (X1,X2,…,XK) is difficult.

So, we define T(XX’) in Gibbs-Sampling as :

Taking transition of X1 ~ XK in turn with transition distribution :

T1(x1x1’), T2(x2x2’), …… TK(xKxK’) Where ( Redraw X ~ conditional dist. given all others. ) Tk(xkxk’) = P(Xk=xk’|X-k) k

In a Graphical Model,

P(Xk=xk’|X-k) = P( Xk=xk’|Markov Blanket( Xk ) ) Gibbs Sampling for MRF

Gibbs Sampling : t=1

1. Initialize all variables randomly. 0 0 1 for t = 1~M for every variable X 0 1 1 2. Draw Xt from P( X | N(X)t-1 ). end end 0 0 1

(X 1,Y) P( X 1| N(X ) )  YN ( X ) (X 1,Y)  (X  0,Y) φ(X,Y) 0 1 YN ( X ) YN ( X ) 0 5 1 1 1 9 Gibbs Sampling for MRF

Gibbs Sampling : t=2

1. Initialize all variables randomly. 0 01 1 for t = 1~M for every variable X 0 1 1 2. Draw Xt from P( X | N(X)t-1 ). end end 0 01 1

(X 1,Y) P( X 1| N(X ) )  YN ( X ) (X 1,Y)  (X  0,Y) φ(X,Y) 0 1 YN ( X ) YN ( X ) For the central node: 0 5 1 1*9*9*1 P( X 1| N(X ) )   0.76 1*9*9*1 5*1*1*5 1 1 9 Gibbs Sampling for MRF

Gibbs Sampling : t=3

1. Initialize all variables randomly. 01 1 1 for t = 1~M for every variable X 01 1 1 2. Draw Xt from P( X | N(X)t-1 ). end 1 end 01 1 1

(X 1,Y) P( X 1| N(X ) )  YN ( X ) (X 1,Y)  (X  0,Y) φ(X,Y) 0 1 YN ( X ) YN ( X ) For the central node: 0 5 1 9*9*9*9 P( X 1| N(X ) )   0.99 9*9*9*9 1*1*1*1 1 1 9 Gibbs Sampling for MRF

Gibbs Sampling : t=3

1. Initialize all variables randomly. 01 1 1 for t = 1~M for every variable X 01 1 1 2. Draw Xt from P( X | N(X)t-1 ). end 1 end 01 1 1

When M is large enough, X(M) follows stationary dist. : φ(X,Y) 0 1 1 T (X )  P(X )  (X C ) Z C 0 5 1

(Regularity: All entries in the Potential are positive.) 1 1 9 Why Gibbs Sampling has πT = P(X) ?

To prove P(X) is the stationary distribution, we prove P(X) is invariant under Tk(xkxk’):

Assume (X1,…,XK) currently follows P(X)= P(Xk|X-k)*P(X-k),

1. After Tk(xkxk’), X-K still follows P(X-k) because they are unchanged.

2. After Tk(xkxk’)=P(Xk=xk’|X-k) (new state indep. from current value xk )

 Xk(t) still follows P(Xk|X-k).

So, after T1(x1x1’) , ……, T1(xKxK’), X=(X1,…,XK) still follows P(X).

( Uniqueness & Convergence guaranteed from Regularity of MC. ) Gibbs Sampling not Always Work When drawing from individual variable is not possible: ( We can evaluate P(Y|X) but not P(X|Y). ) Non-linear Dependency : P(Y | X )  N(w  w X  w X 2 , 2 ) P(Y | X )P(X ) 0 1 2 P(X |Y)  P(Y | X )P(X ) dX P(Y | X )  log istic (w0  w1 X1)  X N P(Y | X )  N( K(X,X(n) ) ,  2 ) (ker nel trick ) ( Intractable Integration ) n1

Large State Space : (In Structure Learning, state space  G1,G2 ,G3....)

P(Data | G)P(G) P(G | Data)   P(Data | G)P(G) G ( Too large state space to do summation )

Other MCMC like Metropolis-Hasting needed. (see reference.)

Metropolis-Hasting ----MCMC

Metropolis-Hasting (M-H) is a general MCMC method to sample P(X|Y) whenever we can evaluate P(Y|X). ( evaluation of P(X|Y) not needed )

In M-H, instead of drawing from P(X|Y), we draw from another Proposal Dist. T(xx’) based on current sample x, and Accept the Proposal with probability:

 1 , if P(x')T(x' x)  P(x)T(x  x' )  P(accept from x to x')   P(x')T(x' x)  , o.w.  P(x)T(x  x' ) Accept ? T(xx’) X X’ Example : P(X) = N(μ,σ2)

Proposal Dist. T(xx’) = N( x, 0.22 )

  1 , if | x' -  |  | x   |  P(accept from x to x')    2  N(x' ;, ) 2 , o.w.  N(x ;, )

( T(x  x')  T(x' x) this case. ) ( red: Reject ) ( green: Accept ) Example : Structure Posterior = P(G|Data)

Proposal Distribution: A T(G  G') A T(G  G')  P(add/remove a randomly chosen edge of G  G') B C B C

  1 , if P(Data | G')  P(Data | G)  P(accept from G to G')    P(Data | G')  , o.w.  P(Data | G)

( T(G  G')  T(G' G) this case. ) Why Metropolis-Hasting has πT = P(X) ?

Detailed-Balance Sufficient Condition:

If πT(x’)*T(x’x) = πT(x)*T(xx’), then πT(x) is stationary under T.

Given desired πT(x)=P(X) , and a Proposal dist. T(xx’) , we can let Detailed Balance satisfied using accept prob. A(xx’) : Assume P(x')T(x' x)  P(x)T(x  x') , then : P(x')T(x' x) We know P(x')T(x' x)*1  P(x)T(x  x')* P(x)T(x  x')

1 , P(x')T(x' x)  P(x)T(x  x')  T(xx’) define A(x  x')  P(x')T(x' x) , o.w.  πT(x) X X’ πT(x’)  P(x)T(x  x') T(x’x) How to Collect Samples ? Assume we want collecting N samples:

1. Run N times of MCMC and MC1 … th M collect their M samples. N …

MCN M

2. Run 1 time of MCMC and collect (M+1)th ~ (M+N)th samples.

MC1 M M+N What’s the difference ?? How to Collect Samples ? Assume we want collecting N samples:

Cost = M*N samples

1. Run N times of MCMC and MC1 … M Independent th N collect their M samples. … Samples

MCN M

2. Run 1 time of MCMC and collect (M+1)th ~ (M+N)th samples. Cost = M + N samples

MC1 M M+N

Correlated Samples Comparison

Cost = M*N samples Cost = M + N samples MC1 … M Independent MC N M M+N 1 … Samples Correlated Samples MC M N

1 N 1 N 1 N E[ fˆ]  E[  f (X (n) )]  E[ f (X (n) )]  E[ f (X (n) )]  E[ f (X )] N n1 N n1 N n1 No Independent Assumption Used  Unbiased Estimator in both cases.

1 For simple analysis, Take N=2: Var[ fˆ]  Var[ ( f (X (1) )  f (X (2) )] 2 1 Var[ fˆ] Var[ ( f (X (1) )  f (X (2) )] 1 2  ( Var[ f (X (1) )]Var[ f (X (2) )] 2Cov[ f (X (1) ), f (X (2) )] ) 4 1 Var[ f (X )]  ( Var[ f (X (1) )]Var[ f (X (2) )] )  Var[ f (X )] Var[ f (X )] Var[ f (X )] 4 2    (1) ( 2) *  2 f ( X ), f ( X ) 2 2

Practically, many correlated samples (right) outperforms few independent samples (left). How to Check Convergence ?

MC1 M M+N Should be consistent if converge to πT

MC2 M M+N

B 1 K Check Ratio  close to 1 enough. (assume K MCs, each with N samples.) f   fk W K k1 K K N N 2 1 (k,n) 2 B Var. between MC ( fk  f ) W Var. within MC  ( f (X )  fk ) K -1 k1 K(N 1) k1 n1 The Critical Problem of MCMC

When ρ1 , M∞ , Var[.] not decreasing with N  MCMC cannot yield acceptable result in reasonable time.

0.0001

0.0002

Taking very large M to converge to πT .

P(X) How to Reduce Correlation (ρ) among Samples ?

Taking Large Step in Sample Space :

• Block Gibbs Sampling

• Collapsed-Particle Sampling Problem of Gibbs Sampling

Correlation (ρ) between samples is high,

when correlation among variables X1~XK is high. Y X 0 Y φ(Y,X) 0 1

0 0 99 1

1 1 99 X

Taking very large M to converge to πT . Block Gibbs Sampling Draw “block” of variables jointly: P(X,Y)=P(X)P(Y|X)

Marginal Y X 100 100

Y φ(Y,X) 0 1

0 99 1

1 1 99 X

Converge to πT much quickly. Block Gibbs Sampling

Divide X into several “tractable blocks” X1, X2, …, XB. Each block Xb can be drawn jointly given variables in other blocks.

V1 V2 V3

W1 W2 W3

Y1 Y2 Y3

Z1 Z2 Z3

x1 x2 x3 Block Gibbs Sampling

Divide X into several “tractable blocks” X1, X2, …, XB. Each block Xb can be drawn jointly given variables in other blocks.

Given samples on other blocks, Drawing a block jointly from is tractable.

V1 V2 V3

W1 W2 W3

Y1 Y2 Y3

Z1 Z2 Z3

x1 x2 x3 Block Gibbs Sampling

Divide X into several “tractable blocks” X1, X2, …, XB. Each block Xb can be drawn jointly given variables in other blocks.

Given samples on other blocks, Drawing a block jointly from is tractable.

V1 V2 V3

W1 W2 W3

Y1 Y2 Y3

Z1 Z2 Z3

x1 x2 x3 Block Gibbs Sampling

Divide X into several “tractable blocks” X1, X2, …, XB. Each block Xb can be drawn jointly given variables in other blocks.

Given samples on other blocks, Drawing a block jointly from is tractable.

V1 V2 V3

W1 W2 W3

Y1 Y2 Y3

Z1 Z2 Z3

x1 x2 x3 Block Gibbs Sampling

Divide X into several “tractable blocks” X1, X2, …, XB. Each block Xb can be drawn jointly given variables in other blocks.

Given samples on other blocks, Drawing a block jointly from is tractable.

V1 V2 V3

W1 W2 W3

Y1 Y2 Y3

Z1 Z2 Z3

x1 x2 x3 Block Gibbs Sampling by VE

Drawing from a block Xb jointly may need 1 pass of VE.

Draw A from: A B C D E M(A)=∑B f(A,B)M(B) f(A,B) f(B,C) f(C,D) f(D,E) M(B) M(C) M(D) Block Gibbs Sampling by VE

Drawing from a block Xb jointly may need 1 pass of VE.

Draw B from: a B C D E f(A=a,B)M(B) f(a,B) f(B,C) f(C,D) f(D,E) M(B) M(C) M(D) Block Gibbs Sampling by VE

Drawing from a block Xb jointly may need 1 pass of VE.

Draw C from: a b C D E f(B=b,C)M(C) f(a,b) f(b,C) f(C,D) f(D,E) M(C) M(D) Block Gibbs Sampling by VE

Drawing from a block Xb jointly may need 1 pass of VE.

Draw D from: a b c D E f(C=c,D)M(D) f(a,b) f(b,c) f(c,D) f(D,E) M(D) Block Gibbs Sampling by VE

Drawing from a block Xb jointly may need 1 pass of VE.

Draw E from: a b c d E f(D=d,E) f(a,b) f(b,c) f(c,d) f(d,E) Block Gibbs Sampling by VE

Drawing from a block Xb jointly may need 1 pass of VE.

Draw E from: a b c d e f(D=d,E) f(a,b) f(b,c) f(c,d) f(d,e) Block Gibbs Sampling by VE

Drawing from a block Xb jointly may need 1 pass of VE.

Draw Another Block. KC S

K K K K K K

C C C C C C

T T T T T T

S KC

1. Intractable K K K Divide into 4 blocks: with Exact Inference. C C C 1. Student type 2. Gibbs Sampling 2. KC type may converge too slow. T T T 3. KState 4. Problem Type C C C

K K K

S KC KC S

K K K K K K

C C C C C C

T T T T T T

S KC

Student: K K K Divide into 4 blocks:

C C C Indep. to each other 1. Student type given samples on 2. KC type T T T 3. KState other blocks. 4. Problem Type  C C C Easy to Draw Jointly.

K K K

S KC KC S

K K K K K K

C C C C C C

T T T T T T

S KC

KC: K K K Divide into 4 blocks:

C C C Indep. to each other 1. Student type given samples on 2. KC type T T T 3. KState other blocks. 4. Problem Type  C C C Easy to Draw Jointly.

K K K

S KC KC S

K K K K K K

C C C C C C

T T T T T T

S KC

Problem Type: K K K Divide into 4 blocks:

C C C Indep. to each other 1. Student type given samples on 2. KC type T T T 3. KState other blocks. 4. Problem Type  C C C Easy to Draw Jointly.

K K K

S KC KC S

K K K K K K

C C C C C C

T T T T T T

S KC

KState: K K K Divide into 4 blocks:

C C C Chain-structured 1. Student type given samples on 2. KC type T T T 3. KState other blocks. 4. Problem Type C C C  Draw jointly using VE. K K K

S KC Agenda

• When to use Approximate Inference ?

• Forward Sampling & Importance Sampling

• Markov Chain Monte Carlo (MCMC)

• Collapsed Particles Collapsed Particle

Exact: EP( X )[ f (X )]  P(X )* f (X ) X

N ˆ 1 (n) Particle-Based: f   f (X ) N n1

Collapsed-Particle:

Divide X into 2 parts {Xp, Xd} , where Xd can do inference given Xp

EP( X )[ f (X )]  P(X )* f (X )  P(X p )P(X d | X p )* f (X ) X Xp Xd N ˆ 1 (n) (n) EP( X )[ f (X )]   ( P(X d | X p )f (X d , X p ) ) N n1 Xd

( If Xp contains few variables, Var. can be much reduced !! ) KC S

K K K K K K

C C C C C C

T T T T T T

S KC

K K K Divide into {Xp,Xd}

Xd can be exactly C C C Xp: inferenced given Xp. Student type T T T KC type Problem Type C C C Xd: KState K K K

S KC Collapsed Particle with VE

S KC

To draw Xk , Given all other variables in Xp K K K sum out all other variables in Xd C C C

T T T

f(S,KC=k,K1) M(S,KC=kc,K2) f(S,KC=k,K1,K2) S KC S KC K1 K2 K2 K3

M(K1,T1=t1)

Draw S (given KC=k & T=t) from: K1 K2 K3 M(S)= T1 T2 T3

∑K1,K2 F(S,KC=k,K1,K2) M(K1,T1=t1) M(S, KC=k,K2) f(T3) Collapsed Particle with VE

S KC

To draw Xk , Given all other variables in Xp K K K sum out all other variables in Xd C C C

T T T

f(S=s,KC,K1) M(S=s,KC,K2) f(S=s,KC,K1,K2) S KC S KC K1 K2 K2 K3

M(K1,T1=t1) Draw KC (given S=s & T=t) from: K1 K2 K3 M(KC)= T1 T2 T3

∑K1,K2 F(S=s,KC,K1,K2) M(K1,T1=t1) M(S=s,KC,K2) f(T3) Collapsed Particle with VE

S KC

To draw Xk , Given all other variables in Xp K K K sum out all other variables in Xd C C C

T T T

S KC S KC K1 K2 K2 K3

M(K1,S=s, KC=k) Draw T1 (given S=s & KC=k) from: K1 K2 K3 T1 T2 T3 M(T1) = ∑K1 M(K1,S=s,KC=k) F(K1,T1) f(T3) Collect Samples Xp Xd (S, KC, T1, T2, T3) ( K1, K2, K3)

(Intel, Quick, Hard, Easy, Hard) ( {1/3,1/3,1/3} , {1/4,1/4,1/2}, {1/2,1/2,0}) (Intel, Slow, Easy, Easy, Hard) ( {1/2,1/2,1/4} , {1/5,4/5,0}, {1/4,1/4,1/2}) ….. ….. (Dull, Slow, Easy, Easy, Hard) ( {1/3,1/3,1/3} , {1/4,1/4,1/2}, {1/2,1/2,0})

Average Average

N ˆ 1 (n) (n) EP( X )[ f (X )]   ( P(X d | X p )f (X d , X p ) ) N n1 Xd