Learning Heterogeneous Hidden Markov Random Fields

Learning Heterogeneous Hidden Markov Random Fields Jie Liu Chunming Zhang Elizabeth Burnside David Page CS, UW-Madison Statistics, UW-Madison Radiology, UW-Madison BMI & CS, UW-Madison Abstract neighborhood system on an image is invariant across different regions. However, it is necessary to bring Hidden Markov random fields (HMRFs) are heterogeneity to HMMs and HMRFs in some biolog- conventionally assumed to be homogeneous ical applications where the correlation structure can in the sense that the potential functions are change over different sites. For example, a heteroge- invariant across different sites. However in neous HMM is used for segmenting array CGH data some biological applications, it is desirable [Marioni et al., 2006], and the transition matrix de- to make HMRFs heterogeneous, especially pends on some background knowledge, i.e. some dis- when there exists some background knowl- tance measurement which changes over the sites. A edge about how the potential functions vary. heterogeneous HMRF is used to filter SNPs in genome- We formally define heterogeneous HMRFs wide association studies [Liu et al., 2012a], and the and propose an EM algorithm whose M-step pairwise potential functions depend on some back- combines a contrastive divergence learner ground knowledge, i.e. some correlation measure be- with a kernel smoothing step to incorpo- tween the SNPs which can be different between differ- rate the background knowledge. Simulations ent pairs. In both of these applications, the transition show that our algorithm is effective for learn- matrix and the pairwise potential functions are hetero- ing heterogeneous HMRFs and outperforms geneous and are parameterized as monotone paramet- alternative binning methods. We learn a het- ric functions of the background knowledge. Although erogeneous HMRF in a real-world study. the algorithms tune the parameters in the monotone functions, there is no justification that the parameter- ization of the monotone functions is correct. Can we 1 Introduction adopt the background knowledge about these heterogeneous parameters adaptively during HMRF learning, Hidden Markov models (HMMs) and hidden Markov and recover the relation between the parameters and random fields (HMRFs) are useful approaches for mod- the background knowledge nonparametrically? elling structured data such as speech, text, vision and Our paper is the first to learn HMRFs with hetero- biological data. HMMs and HMRFs have been ex- geneous parameters by adaptively incorporating the tended in many ways, such as the infinite models [Beal background knowledge. It is an EM algorithm whose et al., 2002, Gael et al., 2008, Chatzis and Tsechpe- M-step combines a contrastive divergence style learner nakis, 2009], the factorial models [Ghahramani and with a kernel smoothing step to incorporate the back- Jordan, 1997, Kim and Zabih, 2002], the high-order ground knowledge. Details about our EM-kernel-PCD models [Lan et al., 2006] and the nonparametric mod- algorithm are given in Section 3 after we formally de- els [Hsu et al., 2009, Song et al., 2010]. HMMs are fine heterogeneous HMRFs in Section 2. Simulations homogeneous in the sense that the transition matrix in Section 4 show that our EM-kernel-PCD algorithm stays the same across different sites. HMRFs, inten- is effective for learning heterogeneous HMRFs and out- sively used in image segmentation tasks [Zhang et al., performs alternative methods. In Section 5, we learn 2001, Celeux et al., 2003, Chatzis and Varvarigou, a heterogeneous HMRF in a real-world genome-wide 2008], are also homogeneous. The homogeneity as- association study. We conclude in Section 6. sumption for HMRFs in image segmentation tasks is legitimate, because people usually assume that the 2 Models Appearing in Proceedings of the 17th International Con- ference on Artificial Intelligence and Statistics (AISTATS) 2.1 HMRFs And Homogeneity Assumption 2014, Reykjavik, Iceland. JMLR: W&CP volume 33. Copy- Suppose that X = f0; 1; :::; m − 1g is a discrete space, right 2014 by the authors. and we have a Markov random field (MRF) defined on Learning Heterogeneous Hidden Markov Random Fields a random vector X 2 Xd. The conditional indepen- dence is described by an undirected graph G(V; E). The node set V consists of d nodes. The edge set E consists of r edges. The probability of x from the MRF 1 with parameters θ is Q(x; θ) 1 Y P (x; θ) = = φ (x; θ ); (1) Z(θ) Z(θ) c c c2C(G) 1 where Z(θ) is the normalizing constant. Q(x; θ) is 1 some unnormalized measure with C(G) being some subset of the cliques in G. The potential function φc is defined on the clique c and is parameterized by Figure 1: The pairwise HMRF model with three latent θc. For simplicity in this paper, we consider pairwise MRFs, whose potential functions are defined on the nodes (X1, X2, X3) and observable nodes (Y1, Y2, Y3) with parameters θ = fθ ; θ ; θ g and ' = f' ;' g. edges, namely jC(G)j = r. We further assume that 1 2 3 0 1 each pairwise potential function is parameterized by a single parameter, i.e. θc = fθcg. X A hidden Markov random field [Zhang et al., 2001, L(θ; ') = log P (y; θ; ') = log P (x; y; θ; '): (3) d Celeux et al., 2003, Chatzis and Varvarigou, 2008] con- x2X sists of a hidden random field X 2 d and an observ- X Since we only have one instantiation (x; y), we usually able random field Y 2 d where is another space Y Y have to assume that θ 's are the same for i = 1; :::; r (either continuous or discrete). The random field X i for effective parameter learning. This homogeneity as- is a Markov random field with density P (x; θ), as de- sumption is widely used in computer vision problems fined in Formula (1), and its instantiation x cannot be because people usually assume that the neighborhood measured directly. Instead, we can observe the emit- system on an image is invariant across its different re- ted random field Y with its individual dimension Y i gions. Therefore, conventional HMRFs refer to homo- depending on X for i = 1; :::; d, namely P (yjx; ') = i geneous HMRFs, similar to conventional HMMs whose Qd P (y jx ; ') where ' = f' ; :::; ' g and ' i=1 i i 0 m−1 xi transition matrix is invariant across different sites. parameterizes the emitting distribution of Yi under the state x . Therefore, the joint probablity of x and y is i 2.2 Heterogeneous HMRFs P (x; y;θ; ') = P (x; θ)P (yjx; ') In a heterogeneous HMRF, the potential functions on d 1 Y Y (2) different cliques can be different. Taking the model in = φ (x; θ ) P (y jx ; '): Z(θ) c c i i Figure 1 as an example, θ1, θ2 and θ3 can be different c2C(G) i=1 if the HMRF is heterogeneous. As with conventional Example 1: One pairwise HMRF model with three HMRFs, we want to be able to address applications that have one instantiation (x; y) where y is observ- latent variables (X1, X2, X3) and three observable able and x is hidden. Therefore, learning an HMRF variables (Y1, Y2, Y3) is given in Figure 1. Let from one instantiation y is infeasible if we free all θ's. X = f0; 1g. X1, X2 and X3 are connected by three To partially free the parameters, we assume that there edges. The pairwise potential function φi on edge i is some background knowledge k = fk1; :::; krg about (connecting Xu and Xv) parameterized by θi (0<θi<1) I(X =X ) the parameters θ = fθ ; :::; θ g in the form of some un- is φ (X; θ ) = θ u v (1−θ )I(Xu6=Xv ) for i = 1; 2; 3, 1 r i i i i known smooth mapping function which maps θ to k where I is an indicator variable. Let = R. For i i Y for i = 1; :::; r. The background knowledge describes i = 1; 2; 3, Y jX =0 ∼ N(µ ; σ2) and Y jX =1 ∼ i i 0 0 i i how these potential functions are different across dif- N(µ ; σ2), namely ' = fµ ; σ g and ' = fµ ; σ g. 1 1 0 0 0 1 1 1 ferent cliques. Taking pairwise HMRFs for example, In common applications of HMRFs, we observe only the potentials on the edges with similar background one instantiation y which is emitted according to the knowledge should have similar parameters. We can re- hidden state vector x, and the task is to infer the most gard the homogeneity assumption in conventional HM- probable state configuration of X, or to compute the RFs as an extreme type of background knowledge that marginal probabilities of X. In both tasks, we need k1 = k2 = ::: = kr. The problem we solve in this paper to estimate the parameters θ = fθ1; :::; θrg and ' = is to estimate θ and ' which maximize the log likeli- f'0; :::; 'm−1g. Usually, we seek maximum likelihood hood L(θ; ') in Formula (3), subject to the condition estimates of θ and ' which maximize the log likelihood that the estimate of θ is smooth with respect to k. Liu, Zhang, Burnside, Page 3 Parameter Learning Methods Algorithm 1 PCD-n Algorithm [Tieleman, 2008] 1 2 s Learning heterogeneous HMRFs in above manner in- 1: Input: independent samples X = fx ; x ; :::; x g volves three difficulties, (i) the intractable Z(θ), (ii) from P (x; θ), maximum iteration number T ^ the latent x, and (iii) the heterogeneous θ. The way 2: Output: θ from the last iteration we handle the intractable Z(θ) is similar to using con- 3: Procedure: trastive divergence [Hinton, 2002] to learn MRFs.

Load more