<<

School of Computer Science

Probabilistic Graphical Models

Dirichlet Process and Hiarchical DP

-- case studies in genetics and text analysis

Eric Xing Lecture 20, March 31, 2014

Reading:

© Eric Xing @ CMU, 2005-2014 1

 A CDF, G, on possible worlds   6 4  of random partitions follows a  5 3  Dirichlet Process if for any  2 1 measurable finite partition a distribution (1,2, .., m):

 A discrete distribution over a continue space

(G(1), G(2), …, G(m) ) ~ another Dirichlet( G ( ), …., G0( ) ) distribution 0 1 m

where G0 is the base and is the scale parameter Thus a Dirichlet Process G defines a distribution of distribution

© Eric Xing @ CMU, 2005-2014 2 Stick-breaking Process

 0 0.4 0.4   ∞  G  ∑ k ( k ) k 1 0.6 0.5 0.3 G  k ~ 0 ∞ Location 0.3 0.8 0.24 ∑ k 1  k 1  k -1

k  k ∏ (1- k )  j 1 Mass G0 k ~ Beta(1,)

© Eric Xing @ CMU, 2005-2014 3 Chinese Restaurant Process

 1  2

0 P( ci = k| c-i) = 1 0 1  1+ 1+ 0 1 1  2 + 2 + 2 + 1 2  3 + 3 + 3 +

m1 m2 ....  i+  -1 i+  -1 i+  -1 CRP defines an exchangeable distribution on partitions over an (infinite) of samples, such a distribution is formally known as the Dirichlet Process (DP) © Eric Xing @ CMU, 2005-2014 4 Representations of DP mixture

G0 G0

 G    

i yi

xi xi N N

The CRP construction The Stick-breaking construction

© Eric Xing @ CMU, 2005-2014 5 Case I: Ancestral Inference

Ak k

Hn1 Hn2

Gn N

Essentially a clustering problem, but …

 Better recovery of the ancestors leads to better haplotyping results (because of more accurate grouping of common )

 True haplotypes are obtainable with high cost, but they can validate model more subjectively (as opposed to examining saliency of clustering)

 Many other biological/scientific utilities © Eric Xing @ CMU, 2005-2014 6 Example: DP-haplotyper [Xing et al, 2004]

 Clustering human populations

 G 0 DP G

K infinite mixture components A  (for population haplotypes)

Hn1 Hn2 Likelihood model (for individual G haplotypes and genotypes) n N

 Inference: Monte Carlo (MCMC)

 Metropolis Hasting

© Eric Xing @ CMU, 2005-2014 7 The DP Mixture of Ancestral Haplotypes

 The customers around a table in CRP form a cluster

 associate a mixture component (i.e., a population ) with a table

 sample {a, } at each table from a base measure G0 to obtain the population haplotype and nucleotide substitution frequency for that component

1 4 8 9 2 {A} {A} {A} {A} {A} {A} …

6 7 3 5

 With p(h|{  }) and p(g|h1,h2), the CRP yields a posterior distribution on the number of population haplotypes (and on the haplotype configurations and the nucleotide substitution frequencies)

© Eric Xing @ CMU, 2005-2014 8 Inheritance and Observation Models

 Single-locus mutation model A1 AC  Hi A2 Ancestral ie  e C  i1 A3 pool  for h  a …   t t PH (ht | at , ) 1  C  i2  for ht  at  | B |1  h  a with prob.  t t H i1 Haplotypes H  Noisy observation model i2 H ,H  G i 1 i 2 i

P ( g | h , h ) : G 1 2 G Genotype g t h 1,t h 2,t with prob.  i

© Eric Xing @ CMU, 2005-2014 9 MCMC for Haplotype Inference

 Gibbs sampling for exploring the posterior distribution under the proposed model  Integrate out the parameters such as  or , and sample c , a  i e k and h ie

p(c  k | c ,h,a)  p(c  k | c ) p(h | a h ,c) ie [ie ] ie [ie ] ie k, [ie ] Posterior Prior x Likelihood

CRP 

 Gibbs sampling algorithm: draw samples of each to be sampled given values of all the remaining variables

© Eric Xing @ CMU, 2005-2014 10 MCMC for Haplotype Inference

(j) 1. Sample cie , from

2. Sample ak from

(j) 3. Sample hie from

 For DP scale parameter : a vague inverse Gamma prior

© Eric Xing @ CMU, 2005-2014 11 Convergence of Ancestral Inference

© Eric Xing @ CMU, 2005-2014 12 DP vs. Finite Mixture via EM

0.45 0.4 0.35 0.3 0.25 Series1DP 0.2 Series2EM 0.15

individual error error individual 0.1 0.05 0 12345 data sets

© Eric Xing @ CMU, 2005-2014 13 Multi-population Genetic Demography

 Pool everything together and solve 1 hap problem?

 --- ignore population structures

 Solve 4 hap problems separately?

 --- data fragmentation

 Co-clustering … solve 4 coupled hap problems jointly

© Eric Xing @ CMU, 2005-2014 14 Solving Multiple Clustering Problems ?

G   ~ DP( G0 ) G   ~ DP( G0 ) G   ~ DP( G0 )

… … … Nature articles PNAS articles Science articles

G  ~ DP( G0 ) 1 3 2 4 5 6 ?

G   ~ DP( G0 ) © Eric Xing @ CMU, 2005-2014 15 How? And what is the challenge?

 How to share mixture components?

 Because the base measure is continuous, we have zero of picking the same component twice.

 If we want to pick the same topic twice, we need to use a discrete base measure.

 For example, if we chose the base measure to be

 We want there to be an infinite number of topics, so we want an infinite, discrete base measure.

 We want the location of the topics to be random, so we want an infinite, discrete, random base measure.

© Eric Xing @ CMU, 2005-2014 16 Hierarchical Dirichlet Process (Teh et al, 2006)

 Solution: Sample the base measure from a Dirichlet process!

© Eric Xing @ CMU, 2005-2014 17 Hierarchical Dirichlet Process [Teh et al., 2005, Xing et al. 2005]

 Two level CRP scheme: The Chinese restaurant franchise

 At the i-th step in j-th "group",

  Oracle m jk  Choose k with prob. n m  k k jk 0 Choose k with prob.  n   Gototheupper level DP  k  Drawa new sample with prob. 0  m  with prob. k jk 0  nk  

© Eric Xing @ CMU, 2005-2014 18 Chinese restaurant franchise

 Imagine a franchise of restaurants, serving an infinitely large, global menu.

 Each table in each restaurant orders a single dish.

 Let nrt be the number of customers in restaurant r sitting at table t.

 Let mrd be the number of tables in restaurant r serving dish d.

 Let m.d be the number of tables, across all restaurants, serving dish d.

© Eric Xing @ CMU, 2005-2014 19 Recall: Graphical Model Representations of DP

G0 G0

 G    

i yi

xi xi N N

The Pólya urn construction The Stick-breaking construction

© Eric Xing @ CMU, 2005-2014 20 Hierarchical DP Mixture

H

  G0  H

 Gj  j  

i yi

xi xi N N    J J    ~ H  k   ∞  Stick( , ):    Stick( ), G0  ∑ k ( k) k 1  k   ' ~ Beta ,  ,  ' ' . k 1 jk k 1 l 1 l jk jk  1 jl ∞ l 1   j Stick( , ), Gj  ∑ k ( k) k 1 © Eric Xing @ CMU, 2005-2014 21 Results - Simulated Data

 5 populations with 20 individuals each (two kinds of mutation rates)

 5 populations share parts of their ancestral haplotypes

 the sequence length = 10

Haplotype error

© Eric Xing @ CMU, 2005-2014 22 Results - International HapMap DB

 Different sample sizes, and different # of sub-populations

© Eric Xing @ CMU, 2005-2014 23 Constructing a topic model with infinitely many topics

 LDA: Each distribution is associated with a distribution over K topics.

 Problem: How to choose the number of topics?

 Solution:

 Infinitely many topics!

 Replace the over topics with a Dirichlet process!

 Problem: We want to make sure the topics are shared between documents

© Eric Xing @ CMU, 2005-2014 24 Infinite Topic Models

 H  H   H Stick Dirichlet breaking

j  j   j  k  

zi zi zi

wi wi wi N N N J A single image A single image J images with k topic with inf-topic with inf-topic

An LDA A DP An HDP

© Eric Xing @ CMU, 2005-2014 An infinite topic model

 Restaurants = documents; dishes = topics.

 Let H be a V-dimensional Dirichlet distribution, so a sample from H is a distribution over a vocabulary of V words.

 Sample a global distribution over topics,

 For each document m=1,…,M

 Sample a distribution over topics, Gm~DP(γ,G0).

 For each word n=1,…,Nm

 Sample a topic ϕmn~Discrete(G0).

 Sample a word wmk~Discrete(ϕmn).

© Eric Xing @ CMU, 2005-2014 26 The “right” number of topics

© Eric Xing @ CMU, 2005-2014 27 Dynamic Dirichlet Process

 Two Main Ideas:

 Infinite HMM: a hidden Markov DP (see appendix)

 Dependent DP/HDP: directly evolving a DP/HDP

© Eric Xing @ CMU, 2005-2014 Evolutionary Clustering

 Adapts the number of mixture components over time

 Mixture components can die out

 New mixture components are born at any time

 Retained mixture components parameters evolve according to a Markovian dynamics

Phy Topics Bio CS Research Papers

1900 2000

© Eric Xing @ CMU, 2005-2014 The Chinese Restaurant Process

 Customers correspond to data points

 Tables correspond to clusters/mixture components

 Dishes correspond to parameter of the mixtures

  1 2

© Eric Xing @ CMU, 2005-2014 Temporal DPM [Ahmed and Xing 2008]

 The Recurrent Chinese Restaurant Process

 The restaurant operates in epochs

 The restaurant is closed at the end of each epoch

 The state of the restaurant at time epoch t depends on that at time epoch t-1

 Can be extended to higher-order dependencies.

© Eric Xing @ CMU, 2005-2014 T=1 1,1 2,1 3,1

Dish eaten at table 3 at time epoch 1 OR the parameters of cluster 3 at time epoch 1

Generative Process ‐Customers at time T=1 are seated as before:

‐ Choose table j  Nj,1 and Sample xi ~ f(j,1) ‐ Choose a new table K+1 

‐ Sample K+1,1 ~ G0 and Sample xi ~ f(K+1,1)

© Eric Xing @ CMU, 2005-2014 T=1 1,1 2,1 3,1

N1,1=2 N2,1=3 N3,1=1 T=2

© Eric Xing @ CMU, 2005-2014    1,1 2,1 3,1 T=1

N1,1=2 N2,1=3 N3,1=1 T=2 1,1 2,1 3,1

© Eric Xing @ CMU, 2005-2014    1,1 2,1 3,1 T=1

N1,1=2 N2,1=3 N3,1=1 T=2 1,1 2,1 3,1

1  3 6   6   2 6   6  

© Eric Xing @ CMU, 2005-2014    1,1 2,1 3,1 T=1

N1,1=2 N2,1=3 N3,1=1 T=2 1,1 2,1 3,1

2 6  

© Eric Xing @ CMU, 2005-2014    1,1 2,1 3,1 T=1

N1,1=2 N2,1=3 N3,1=1 T=2 1,2 2,1 3,1

Sample 1,2 ~ P(.| 1,1)

© Eric Xing @ CMU, 2005-2014    1,1 2,1 3,1 T=1

N1,1=2 N2,1=3 N3,1=1 T=2 1,2 2,1 3,1

1  3 6 1 6 1 1 2 6 1 6 1

And so on ……

© Eric Xing @ CMU, 2005-2014    1,1 2,1 3,1 T=1

N1,1=2 N2,1=3 N3,1=1 T=2 1,2 2,2 3,1 4,2

Died out cluster Newly born cluster

At the end of epoch 2

© Eric Xing @ CMU, 2005-2014    1,1 2,1 3,1 T=1

N1,1=2 N2,1=3 N3,1=1 T=2 1,2 2,2 3,1 4,2

N1,2=2 N2,2=2 N4,2=1 T=3

© Eric Xing @ CMU, 2005-2014 Temporal DPM

 Can be extended to model higher-order dependencies

 Can decay dependencies over time

 Pseudo-counts for table k at time t is

W w    e Nk ,tw w1 History size  

Decay factory Number of customers sitting at table K at time epoch t‐w

© Eric Xing @ CMU, 2005-2014    1,1 2,1 3,1 T=1

N1,1=2 N2,1=3 N3,1=1 T=2 1,2 2,2 3,1 4,2

N2,3 T=3 1,2 2,2 4,2

N2,3 =

© Eric Xing @ CMU, 2005-2014 TDPM Generative Power

DPM W=T 

TDPM Power‐law curve W=4 

Independent DPMs W= 0 any)

© Eric Xing @ CMU, 2005-2014 Results: NIPS 12

 Building a simple dynamic topic model

 Chain dynamics is as before

 Emission model for document xk,t is:

 Project k,t over the

 Sample xk,t|ct,i ~ Multinomial(.|Logisitic(k,t))  Unlike LDA here a document belongs to one topic

 Use this model to analyze NIPS12 corpus

 Proceeding of NIPS conference 1987-1999

© Eric Xing @ CMU, 2005-2014 © Eric Xing @ CMU, 2005-2014 The Big Picture Time

K‐ DynamicFixed-dimensions clustering Dynamic clustering

   ci

xi N

DPM TDPM G

     Dimension

ci

Model xi N

© Eric Xing @ CMU, 2005-2014 Summary

 A non-parametric Bayesian model for Pattern Uncovery

 Finite of latent patterns (e.g., image segments, objects)  infinite mixture of propotypes: alternative to model selection  hierarchical infinite mixture  temporal infinite mixture model

 Applications in general data-mining …

© Eric Xing @ CMU, 2005-2014 Appendix:

 What if we have an HMM with unknown # of states?

 E.g., “recombination” over unknown number of chromosomes?

© Eric Xing @ CMU, 2005-2014 48 A common inheritance model to begin with …

Each individual haplotype is a mosaic of ancestral haplotypes

Ancestral chromosomes (K=5)

x x Individual x chromosomes x x x

x x x x x

© Eric Xing @ CMU, 2005-2014 49 The Hidden

Ct+1 1 2 … K

1

2 Ct : .

K

A  1 Transition process: recombination A2 C Ancestral i1 A3 pool dr dr p(c i t, 1 k'| c i t, k) e k k, ' (1 e ) (k, k') … C Emission process: mutation i2  I ( hi,t ak ,t ) H I (h a ) 1   i1 p(h | a , )  i,t k ,t  k  Haplotypes i ,t k ,t k k   H  | B | 1 i2

How many recombining ancestors? Gi Genotype

© Eric Xing @ CMU, 2005-2014 50 Hidden Markov Dirichlet Process

(Xing and Sohn, Bayesian Analysis, 2007)

 Hidden Markov Dirichlet process mixtures  Extension of HMM model to infinite ancestral space  Infinite dimensional transition  Each row of the transition matrix is modeled with a DP

Ct+1 123...

1

2 C t 3 … … : .

© Eric Xing @ CMU, 2005-2014 51 HMDP as a Graphical Model

Ancestor allele reconstruction

Inferring population structure

Inferring recombination hotspot A H .... 

 

  Cy1 Cy2 Cy3 … CyNN  H

© Eric Xing @ CMU, 2005-2014 52 Recombination Analysis

CEU YRI HCB+JPT HapMap4

© Eric Xing @ CMU, 2005-2014 53