Learning Heterogeneous Hidden Markov Random Fields

Jie Liu Chunming Zhang Elizabeth Burnside David Page CS, UW-Madison , UW-Madison Radiology, UW-Madison BMI & CS, UW-Madison

Abstract neighborhood system on an image is invariant across different regions. However, it is necessary to bring Hidden Markov random fields (HMRFs) are heterogeneity to HMMs and HMRFs in some biolog- conventionally assumed to be homogeneous ical applications where the correlation structure can in the sense that the potential functions are change over different sites. For example, a heteroge- invariant across different sites. However in neous HMM is used for segmenting array CGH data some biological applications, it is desirable [Marioni et al., 2006], and the transition matrix de- to make HMRFs heterogeneous, especially pends on some background knowledge, i.e. some dis- when there exists some background knowl- tance measurement which changes over the sites. A edge about how the potential functions vary. heterogeneous HMRF is used to filter SNPs in genome- We formally define heterogeneous HMRFs wide association studies [Liu et al., 2012a], and the and propose an EM algorithm whose M-step pairwise potential functions depend on some back- combines a contrastive divergence learner ground knowledge, i.e. some correlation measure be- with a kernel smoothing step to incorpo- tween the SNPs which can be different between differ- rate the background knowledge. Simulations ent pairs. In both of these applications, the transition show that our algorithm is effective for learn- matrix and the pairwise potential functions are hetero- ing heterogeneous HMRFs and outperforms geneous and are parameterized as monotone paramet- alternative binning methods. We learn a het- ric functions of the background knowledge. Although erogeneous HMRF in a real-world study. the algorithms tune the parameters in the monotone functions, there is no justification that the parameter- ization of the monotone functions is correct. Can we 1 Introduction adopt the background knowledge about these heteroge- neous parameters adaptively during HMRF learning, Hidden Markov models (HMMs) and hidden Markov and recover the relation between the parameters and random fields (HMRFs) are useful approaches for mod- the background knowledge nonparametrically? elling structured data such as speech, text, vision and Our paper is the first to learn HMRFs with hetero- biological data. HMMs and HMRFs have been ex- geneous parameters by adaptively incorporating the tended in many ways, such as the infinite models [Beal background knowledge. It is an EM algorithm whose et al., 2002, Gael et al., 2008, Chatzis and Tsechpe- M-step combines a contrastive divergence style learner nakis, 2009], the factorial models [Ghahramani and with a kernel smoothing step to incorporate the back- Jordan, 1997, Kim and Zabih, 2002], the high-order ground knowledge. Details about our EM-kernel-PCD models [Lan et al., 2006] and the nonparametric mod- algorithm are given in Section 3 after we formally de- els [Hsu et al., 2009, Song et al., 2010]. HMMs are fine heterogeneous HMRFs in Section 2. Simulations homogeneous in the sense that the transition matrix in Section 4 show that our EM-kernel-PCD algorithm stays the same across different sites. HMRFs, inten- is effective for learning heterogeneous HMRFs and out- sively used in image segmentation tasks [Zhang et al., performs alternative methods. In Section 5, we learn 2001, Celeux et al., 2003, Chatzis and Varvarigou, a heterogeneous HMRF in a real-world genome-wide 2008], are also homogeneous. The homogeneity as- association study. We conclude in Section 6. sumption for HMRFs in image segmentation tasks is legitimate, because people usually assume that the 2 Models Appearing in Proceedings of the 17th International Con- ference on Artificial Intelligence and Statistics (AISTATS) 2.1 HMRFs And Homogeneity Assumption 2014, Reykjavik, Iceland. JMLR: W&CP volume 33. Copy- Suppose that X = {0, 1, ..., m − 1} is a discrete space, right 2014 by the authors. and we have a Markov random field (MRF) defined on Learning Heterogeneous Hidden Markov Random Fields a random vector X ∈ Xd. The conditional indepen- dence is described by an undirected graph G(V, E). The node set V consists of d nodes. The edge set E consists of r edges. The probability of x from the MRF 1 with parameters θ is

Q(x; θ) 1 Y P (x; θ) = = φ (x; θ ), (1) Z(θ) Z(θ) c c c∈C(G) 1 where Z(θ) is the normalizing constant. Q(x; θ) is 1 some unnormalized measure with C(G) being some subset of the cliques in G. The potential function φc is defined on the clique c and is parameterized by Figure 1: The pairwise HMRF model with three latent θc. For simplicity in this paper, we consider pairwise MRFs, whose potential functions are defined on the nodes (X1, X2, X3) and observable nodes (Y1, Y2, Y3) with parameters θ = {θ , θ , θ } and ϕ = {ϕ , ϕ }. edges, namely |C(G)| = r. We further assume that 1 2 3 0 1 each pairwise potential function is parameterized by a single parameter, i.e. θc = {θc}. X A hidden Markov random field [Zhang et al., 2001, L(θ, ϕ) = log P (y; θ, ϕ) = log P (x, y; θ, ϕ). (3) d Celeux et al., 2003, Chatzis and Varvarigou, 2008] con- x∈X sists of a hidden random field X ∈ d and an observ- X Since we only have one instantiation (x, y), we usually able random field Y ∈ d where is another space Y Y have to assume that θ ’s are the same for i = 1, ..., r (either continuous or discrete). The random field X i for effective parameter learning. This homogeneity as- is a Markov random field with density P (x; θ), as de- sumption is widely used in computer vision problems fined in Formula (1), and its instantiation x cannot be because people usually assume that the neighborhood measured directly. Instead, we can observe the emit- system on an image is invariant across its different re- ted random field Y with its individual Y i gions. Therefore, conventional HMRFs refer to homo- depending on X for i = 1, ..., d, namely P (y|x; ϕ) = i geneous HMRFs, similar to conventional HMMs whose Qd P (y |x ; ϕ) where ϕ = {ϕ , ..., ϕ } and ϕ i=1 i i 0 m−1 xi transition matrix is invariant across different sites. parameterizes the emitting distribution of Yi under the state x . Therefore, the joint probablity of x and y is i 2.2 Heterogeneous HMRFs

P (x, y;θ, ϕ) = P (x; θ)P (y|x; ϕ) In a heterogeneous HMRF, the potential functions on d 1 Y Y (2) different cliques can be different. Taking the model in = φ (x; θ ) P (y |x ; ϕ). Z(θ) c c i i Figure 1 as an example, θ1, θ2 and θ3 can be different c∈C(G) i=1 if the HMRF is heterogeneous. As with conventional Example 1: One pairwise HMRF model with three HMRFs, we want to be able to address applications that have one instantiation (x, y) where y is observ- latent variables (X1, X2, X3) and three observable able and x is hidden. Therefore, learning an HMRF variables (Y1, Y2, Y3) is given in Figure 1. Let from one instantiation y is infeasible if we free all θ’s. X = {0, 1}. X1, X2 and X3 are connected by three To partially free the parameters, we assume that there edges. The pairwise potential function φi on edge i is some background knowledge k = {k1, ..., kr} about (connecting Xu and Xv) parameterized by θi (0<θi<1) I(X =X ) the parameters θ = {θ , ..., θ } in the form of some un- is φ (X; θ ) = θ u v (1−θ )I(Xu6=Xv ) for i = 1, 2, 3, 1 r i i i i known smooth mapping function which maps θ to k where I is an indicator variable. Let = R. For i i Y for i = 1, ..., r. The background knowledge describes i = 1, 2, 3, Y |X =0 ∼ N(µ , σ2) and Y |X =1 ∼ i i 0 0 i i how these potential functions are different across dif- N(µ , σ2), namely ϕ = {µ , σ } and ϕ = {µ , σ }. 1 1 0 0 0 1 1 1 ferent cliques. Taking pairwise HMRFs for example, In common applications of HMRFs, we observe only the potentials on the edges with similar background one instantiation y which is emitted according to the knowledge should have similar parameters. We can re- hidden state vector x, and the task is to infer the most gard the homogeneity assumption in conventional HM- probable state configuration of X, or to compute the RFs as an extreme type of background knowledge that marginal probabilities of X. In both tasks, we need k1 = k2 = ... = kr. The problem we solve in this paper to estimate the parameters θ = {θ1, ..., θr} and ϕ = is to estimate θ and ϕ which maximize the log likeli- {ϕ0, ..., ϕm−1}. Usually, we seek maximum likelihood hood L(θ, ϕ) in Formula (3), subject to the condition estimates of θ and ϕ which maximize the log likelihood that the estimate of θ is smooth with respect to k. Liu, Zhang, Burnside, Page

3 Parameter Learning Methods Algorithm 1 PCD-n Algorithm [Tieleman, 2008] 1 2 s Learning heterogeneous HMRFs in above manner in- 1: Input: independent samples X = {x , x , ..., x } volves three difficulties, (i) the intractable Z(θ), (ii) from P (x; θ), maximum iteration number T ˆ the latent x, and (iii) the heterogeneous θ. The way 2: Output: θ from the last iteration we handle the intractable Z(θ) is similar to using con- 3: Procedure: trastive divergence [Hinton, 2002] to learn MRFs. We 4: Initialize θ(1) and initialize particles review contrastive divergence and its variations in Sec- 5: Calculate EXψ from X tion 3.1. To handle the latent x in HMRF learning, 6: for i = 1 to T do we introduce an EM algorithm in Section 3.2, which is 7: Advance particles n steps under θ(i) applicable to conventional HMRFs. In Section 3.3, we 8: Calculate Eθ(i) ψ from the particles further address the heterogeneity of θ in the M-step of 9: θ(i+1) = θ(i) + η(EXψ − Eθ(i) ψ) the EM algorithm. 10: Adjust η 11: end for 3.1 Contrastive Divergence for MRFs Assume that we observe s independent samples X = 3.2 Expectation-Maximization for Learning {x1, x2, ..., xs} from (1), and we want to estimate θ. Conventional HMRFs The log likelihood L(θ|X) is concave w.r.t. θ, and we can use gradient ascent to find the MLE of θ. The We begin with a lower bound of the log likelihood partial derivative of L(θ|X) with respect to θi is function, and then introduce the EM algorithm which handles the latent variables in HMRFs. Let qx(x) s be any distribution on x∈Xd. It is well known that ∂L(θ|X) 1 X j = ψi(x ) − Eθψi = EXψi − Eθψi, (4) there exists a lower bound of the log likelihood L(θ, ϕ) ∂θ s i j=1 in (3), which is provided by an auxiliary function F(qx(x), {θ, ϕ}) defined as follows, where ψi is the sufficient statistic corresponding to θi, and Eθψi is the expectation of ψi with respect to the X P (x, y; θ, ϕ) F(q (x),{θ, ϕ}) = q (x) log distribution specified by θ. In the i-th iteration of x x q (x) d x (5) gradient ascent, the parameter update is x∈X = L(θ, ϕ) − KL[qx(x)|P (x|y, θ, ϕ)],

θ(i+1) = θ(i) +η∇L(θ(i)|X) = θ(i) +η(EXψ −Eθ ψ), (i) where KL[qx(x)|P (x|y, θ, ϕ)] is the Kullback-Leibler divergence between qx(x) and P (x|y, θ, ϕ), the pos- where η is the . However the exact terior distribution of the hidden variables. This computation of Eθψi takes time exponential in the Kullback-Leibler divergence is the distance between treewidth of G. A few sampling-based methods have L(θ, ϕ) and F(qx(x), {θ, ϕ}). been proposed to solve this problem. The key dif- Expectation-Maximization: We maximize L(θ, ϕ) ferences among these methods are how to draw par- with an expectation-maximization (EM) algo- ticles and how to compute E ψ from the particles. θ rithm which iteratively maximizes its lower bound MCMC-MLE [Geyer, 1991, Zhu and Liu, 2002] uses F(q (x), {θ, ϕ}). We first initialize θ(0) and ϕ(0). In importance sampling, but might suffer from degen- x the t-th iteration, the updates in the expectation (E) eracy when θ is far away from θ . Contrastive (i) (1) step and the maximization (M) step are divergence [Hinton, 2002] generates new particles in each iteration according to the current θ(i) and does (t) (t−1) (t−1) qx = arg max F(qx(x), {θ , ϕ }) (E), not require the particles to reach equilibrium, so as qx to save computation. Variations of contrastive diver- (t) (t) (t) θ , ϕ = arg max F(qx , {θ, ϕ}) (M). gence include particle-filtered MCMC-MLE [Asuncion {θ,ϕ} et al., 2010], persistent contrastive divergence (PCD) (t−1) (t−1) [Tieleman, 2008] and fast PCD [Tieleman and Hin- In the E-step, we maximize F(qx(x), {θ , ϕ }) ton, 2009]. Because PCD is efficient and easy to with respect to qx(x). Because the differ- implement, we employ it in this paper. Its pseudo- ence between F(qx(x), {θ, ϕ}) and L(θ, ϕ) is code is provided in Algorithm 1. Other than con- KL[qx(x)|P (x|y, θ, ϕ)], the maximizer in the E-step (t) (t−1) (t−1) trastive divergence, MRF can be learned via ratio qx is P (x|y, θ , ϕ ), namely the posterior matching [Hyv¨arinen,2007], non-local contrastive ob- distribution of x|y under the current estimated jectives [Vickrey et al., 2010], noise-contrastive esti- parameters θ(t−1) and ϕ(t−1). This posterior distribu- mation [Gutmann and Hyv¨arinen, 2010] and minimum tion can be calculated by Monte Carlo KL contraction [Lyu, 2011]. for general graphs. Learning Heterogeneous Hidden Markov Random Fields

(t) In the M-step, we maximize F(qx (x), {θ, ϕ}) with 3.3 Learning Heterogeneous HMRFs respect to {θ, ϕ}, which can be rewritten as Learning heterogeneous HMRFs is different from (t) learning conventional homogeneous HMRFs in two arg max F(qx (x), {θ, ϕ}) {θ,ϕ} ways. First, we need to free the θ’s in heterogeneous X HMRFs. Second, there is some background knowl- = arg max q(t)(x) log P (x, y; θ, ϕ) x edge k about how the θ’s are different, as introduced {θ,ϕ} d x∈X in Section 2. Therefore, we make two modifications to   X (t) the EM-homo-PCD algorithm in order to learn hetero- = arg max qx (x) log P (x; θ) + log P (y|x; ϕ) . {θ,ϕ} geneous HMRFs with background knowledge. First, x∈ d X we estimate the θ’s separately, which obviously brings It is obvious that this function can be maximized with more variance in estimation. Second, within each it- respect to ϕ and θ separately as eration of the contrastive divergence update, we apply a kernel regression to smooth the estimate of the θ’s (t) X (t) θ = arg max qx (x) log P (x; θ), with respect to the background knowledge k. Specifi- θ x∈ d cally, in the i-th iteration of PCD update, we advance X (6) ˆ (t) X (t) the particles under θ(i) for n steps, and calculate the ϕ = arg max qx (x) log P (y|x; ϕ). ϕ moments Eˆ ψ from the particles. Therefore, we can x∈ d θ(i) X update the estimate as Estimating ϕ: Estimating ϕ in this maximum like- ˜ ˆ (t) θ = θ + η∇LM (θ|q ). lihood manner is straightforward, because the maxi- (i+1) (i) x mization can be rewritten as follows, ˜ Then we regress θ(i+1) with respect to k via Nadaraya- Watson kernel regression [Nadaraya, 1964, Watson, X (t) arg max qx (x) log P (y|x; ϕ) ˆ 1964], and set θ(i+1) to be the fitted values. For ϕ d x∈X ease of notation, we drop the iteration index (i + 1). d ˜ ˜ ˜ X X Suppose that θ = {θ1, ..., θr} is the estimate be- = arg max q(t)(x ) log P (y |x ; ϕ), xi i i i ϕ fore kernel smoothing; we set the smoothed estimate i=1 xi∈X ˆ ˆ ˆ θ = {θ1, ..., θr} as (t) Qd (t) ˆ Xr ˜ where qx (x) = i=1 qxi (xi). θj = γijθi, ∀j = 1, ..., r, i=1 Estimating θ: Estimating θ in Formula (6) is dif- ficult due to the intractable Z(θ). Some approaches where

[Zhang et al., 2001, Celeux et al., 2003] use pseudo- ki−kj K( h ) likelihood [Besag, 1975] to estimate θ in the M-step. γij = r k −k . (t) P m j P m=1 K( h ) It can be shown that d qx (x) log P (x; θ) is con- x∈X vex with respect to θ. Therefore, we can use gradient For the kernel function K, we use the Epanechnikov ascent to find the MLE of θ, which is similar to using kernel, which is usually computationally more efficient contrastive divergence [Hinton, 2002] to learn MRFs than a Gaussian kernel. We tune the bandwidth h in Section 3.1. through cross-validation, namely we select the band- P (t) (t) Denote d qx (x) log P (x; θ) by LM (θ|qx ). The width which minimizes the leave-one-out score x∈X (t) partial derivative of LM (θ|qx ) with respect to θi is  ˜ ˆ 2 1 Xr θi − θi . r i=1 1 − γ (t)   ii ∂LM (θ|qx ) X = q(t)(x) ψ (x) − E ψ (x) . ∂θ x i θ i i d Tuning the bandwidth is usually computation- x∈X intensive, so we tune it every t0 iterations to save com- Therefore, the derivative here is similar to the deriva- putation. We name our parameter learning algorithm tive in contrastive divergence in Formula (4) except we for heterogeneous HMRFs the EM-kernel-PCD algo- (t) rithm. Its pseudo-code is given in Algorithm 2. have to reweight it to qx . We run the EM algorithm until both θ and ϕ converge. Note that when learn- Another intuitive way of handling background knowl- ing homogeneous HMRFs with this algorithm, we tie edge about these heterogeneous parameters is to cre- all θ’s all the time, namely θ = {θ}. Therefore, we ate bins according to the background knowledge and name this parameter learning algorithm for conven- tie the θ’s that are in the same bin. Suppose that tional HMRFs the EM-homo-PCD algorithm. we have b bins after we carefully select the binwidth, Liu, Zhang, Burnside, Page

EM−homo−PCD

0.20 EM−homo−PCD

Algorithm 2 EM-kernel-PCD Algorithm 0.20 σ EM−binning−PCD, ε =0.02 EM−binning−PCD, σε =0.02 EM−binning−PCD, σε =0.01 EM−binning−PCD, σε =0.01 EM−binning−PCD, σε =0 EM−binning−PCD, σε =0 σ 1: Input: sample y, background knowledge k, max EM−kernel−PCD, ε =0.02 EM−kernel−PCD, σε =0.02 EM−kernel−PCD, σε =0.01 EM−kernel−PCD, σε =0.01

σ 0.15 σ iteration number T , initial bandwidth h 0.15 EM−kernel−PCD, ε =0 EM−kernel−PCD, ε =0 2: Output: θˆ from the last iteration 0.10 3: Procedure: 0.10 Absolute Estimate Error Absolute Estimate Error 4: Initialize θˆ, ϕˆ and particles 0.05 0.05 5: while not converge do

6: E-step: infer xˆ from y 0 500 1000 1500 2000 0 500 1000 1500 2000 Iteration Number Iteration Number (1a) tree structure, k = sin(θ ) + ε (1b) tree structure, k = θ2 + ε 7: Calculate Exˆψ from xˆ i i i i EM−homo−PCD EM−homo−PCD σ 8: for i = 1 to T do EM−binning−PCD, ε =0.02 EM−binning−PCD, σε =0.02 EM−binning−PCD, σε =0.01 EM−binning−PCD, σε =0.01 EM−binning−PCD, σε =0 EM−binning−PCD, σε =0 ˆ σ 9: Advance particles for n steps under θ EM−kernel−PCD, ε =0.02 EM−kernel−PCD, σε =0.02 (i) EM−kernel−PCD, σε =0.01 EM−kernel−PCD, σε =0.01 0.15 0.15 EM−kernel−PCD, σε =0 EM−kernel−PCD, σε =0 10: Calculate Eˆ ψ from the particles θ(i) ˜ ˆ (t) 0.10 11: θ(i+1) = θ(i) + η∇LM (θ|qx ) 0.10

ˆ ˜ Absolute Estimate Error Absolute Estimate Error 12: θ(i+1) = kernelRegF it(θ(i+1), k, h) 0.05 13: Adjust η and tune bandwidth h 0.05 14: end for 0 500 1000 1500 2000 0 500 1000 1500 2000 Iteration Number Iteration Number 15: MLE ϕˆ from xˆ and y (θ ) + ε (2b) grid structure, k = θ2 + ε (2a) grid structure, ki = sin i i i 16: end while Figure 2: Performance of EM-homo-PCD, EM- binning-PCD and EM-kernel-PCD in tree-HMRFs and grid-HMRFs for two types of background knowledge: namely we have θ = {θ1, ..., θb}. The rest of the al- 2 gorithm is the same as the EM-homo-PCD algorithm (a) ki = sin θi + , and (b) ki = θi + . in Section 3.2. We name this parameter learning al- gorithm via binning the EM-binning-PCD algorithm. We can also regard our EM-kernel-PCD algorithm as background knowledge. In the first type of background a soft-binning version of EM-binning-PCD. knowledge, we set ki = sin θi + . In the second type 2 of background knowledge, we set ki = θi + , where  2 4 Simulations is some random Gaussian noise from N(0, σ ). We try three values for σ, namely 0.0, 0.01 and 0.02. Then we We investigate the performance of our EM-kernel- generate one instantiation x. Finally, we generate one PCD algorithm on heterogeneous HMRFs with differ- observable y from a d dimensional multivariate normal ent structures, namely a tree-structure HMRF and a 2 distribution N(µx, σ I) where µ = 2 is the strength of grid-structure HMRF. In the simulations, we first set signal, and σ2 = 1.0 is the variance of the manifesta- the ground truth of the parameters, and then set the tion, and I is the identity matrix of dimension d. For background knowledge. We then generate one exam- our EM-kernel-PCD algorithm, we use an Epanech- ple x and then generate one example y|x. With the nikov kernel with α = β = 5. For tuning bandwidth h, observable y, we apply EM-kernel-PCD, EM-binning- we try 100 values in total, namely 0.005, 0.01, 0.015, PCD and EM-homo-PCD to learn the parameters θ. ..., 0.5. For the EM-binning-PCD algorithm, we set We eventually compare the three algorithms by their the binwidth to be 0.005. The rest of the parameter Pr ˆ average absolute estimate error 1/r i=1 |θi−θi| where settings for the three algorithms are the same, includ- ˆ θi is the estimate of θi. ing the n parameter in PCD which is set to be 1 and For the HMRFs, each dimension of X takes values in the number of particles which is set to be 100. We also replicate each experiment 20 times, and the averaged {0, 1}. The pairwise potential function φi on edge i results are reported. (connecting Xu and Xv) parameterized by θi (0 < I(Xu=Xv ) I(Xu6=Xv ) θi < 1) is φi(X; θi) = θi (1 − θi) for Performance of the algorithms The results from i = 1, 2, 3, where I is an indicator variable. For the the tree-structure HMRFs and the grid-structure HM- tree structure, we choose a perfect binary tree of height RFs are reported in Figure 2. We plot the aver- 12, which yields a total number of 8,191 nodes and age absolute error of the estimate of the three algo- 8,190 parameters, i.e. d = 8,191 and r = 8,190. For rithms against the number of iterations of PCD up- the grid-structure HMRFs, we choose a grid of 100 date. We have separate plots for background knowl- rows and 100 columns, which yields a total number of edge ki = sin θi + , and background knowledge ki = 2 10,000 nodes and 19,800 parameters, i.e. d = 10,000 θi +. Since there are three noise levels for background and r = 19,800. For both of the two models, we gen- knowledge, both the EM-kernel-PCD algorithm and erate θi ∼ U(0.5, 1) independently and then generate the EM-binning-PCD algorithm have three variations. the background knowledge ki. We have two types of All the three algorithms converge as they iterate. It Learning Heterogeneous Hidden Markov Random Fields is observed that the absolute estimate error of the time and absolute estimate error are plotted for the EM-homo-PCD algorithm reduces to 0.125 as it con- three choices in the tree-structure HMRFs and grid- verges. Since the parameters θi’s are drawn indepen- structure HMRFs under different levels of noise in the dently from the uniform distribution on the interval background knowledge ki = sin θi +  and the back- 2 [0.5, 1], the EM-homo-PCD algorithm ties all the θi’s ground knowledge ki = θi + . The running time and estimates them to be 0.75. Therefore, the av- increases as n increases, but the estimation accuracy R 1.0 does not increase. This observation stays the same for eraged absolute error is 0.5 2|x − 0.75| dx = 0.125. Our EM-kernel-PCD algorithm significantly outper- different structures and different levels of noise in dif- forms the EM-binning-PCD algorithm and the EM- ferent types of background knowledge. This suggests homo-PCD algorithm. It is also observed that as the that we can simply choose n = 1 in our EM-kernel- noise level of background knowledge increases, the per- PCD algorithm. formance of the EM-kernel-PCD algorithm and the EM-binning-PCD algorithm deteriorates. However, as 5 Real-world Application long as the noise level is moderate, the performance of our EM-kernel-PCD algorithm is satisfactory. The We use our EM-kernel-PCD algorithm to learn a het- results from the tree-structure HMRFs and the grid- erogeneous HMRF model in a real-world genome-wide structure HMRFs are comparable except that it takes association study on breast cancer. The dataset is more iterations to converge in grid-structure HMRFs from NCI’s Cancer Genetics Markers of Susceptibility than in tree-structure HMRFs. (CGEMS) study [Hunter et al., 2007]. In total, 528,173 genetic markers (single-nucleotide polymorphisms or Behavior of the algorithms We then plot the esti- SNPs) for 1,145 breast cancer cases and 1,142 controls mated parameters against their background knowledge are genotyped on the Illumina HumanHap500 array, in the iterations of our EM-kernel-PCD algorithm. We and the task is to identify the SNPs which are associ- provide plots for after 100 iterations, after 200 itera- ated with breast cancer. This dataset has been used tions and after convergence respectively, to show how in the study of Liu et al. [2012b]. We build a heteroge- the EM-kernel-PCD algorithm behaves during the gra- neous HMRF model to identify the associated SNPs. dient ascent. Figure 3 shows the plots for the back- In the HMRF model, the hidden vector X ∈ {0, 1}d ground knowledge ki = sin θi +  and the background denotes whether the SNPs are associated with breast 2 knowledge ki = θi + with three levels of noise (namely cancer, i.e. Xi = 1 that the SNPi is associated σ=0, 0.01 and 0.02) for both the tree-structure HM- with breast cancer. For each SNP, we can perform RFs and the grid-structure HMRFs. It is observed a two-proportion z-test from the minor allele count in that as the algorithm iterates, it gradually recovers cases and the minor allele count in controls. Denote Yi the relationship between the parameters and the back- to be the test statistic from the two-proportion z-test ground knowledge. There is still a gap between our for SNPi. It can be derived that Yi|Xi=0 ∼ N(0, 1) estimate and the ground truth. This is because we and Yi|Xi=1 ∼ N(µ1, 1) for some unknown µ1 (µ1 6= only have one hidden instantiation x and we have to 0). We assume that X forms a pairwise Markov ran- infer x from the observed y in the E-step. Especially dom field with respect to the graph G. The graph G at the boundaries, we can observe a certain amount is built as follows. We query the squared correlation of estimate bias. The boundary bias is very common coefficients (r2 values) among the SNPs from HapMap in kernel regression problems because there are fewer [The International HapMap Consortium, 2003]. Each data points at the boundaries [Fan, 1992]. SNP becomes a node in the graph. For each SNP, we connect it with the SNP having the highest r2 Choosing parameter n One parameter in con- 2 trastive divergence algorithms is n, the number of value with it. We also remove the edges whose r val- MCMC steps we need to perform under the current ues are below 0.25. There are in total 340,601 edges parameters in order to generate the particles. The ra- in the graph. The pairwise potential function φi on tionale of contrastive divergence is that it is enough to edge i (connecting Xu and Xv) parameterized by θi I(Xu=Xv ) I(Xu6=Xv ) find the direction to update the parameters by a few (0 < θi < 1) is φi(X; θi) = θi (1 − θi) MCMC steps using the current parameters, and we for i = 1, ..., 340,601, where I is an indicator variable. do not have to reach the equilibrium. Therefore, the It is believed that two SNPs with a higher level of cor- parameter n is usually set to be very small to save com- relation are more likely to agree in their association putation when we are learning general Markov random with breast cancer. Therefore, we set the background 2 fields. Here we explore how we should choose the n pa- knowledge k about the parameters to be the r values rameter in our EM-kernel-PCD algorithm for learning between the SNPs on the edge. We first perform the HMRFs. We choose three values for n in the simula- two-proportion z-test and set y to be the calculated tions, namely 1, 5 and 10. In Figure 4, the running test statistics. Then we estimate θ|y, k in the het- erogeneous HMRF with respect to G using our EM- Liu, Zhang, Burnside, Page

Figure 3: The behavior of the EM-kernel-PCD algorithm during gradient ascent for different types of background knowledge with different levels of noise in the tree-structure HMRFs and the grid-structure HMRFs. The red dots show the mapping pattern between the ground truth of the parameters and their background knowledge.

verged. We plotted the estimated parameters against their background knowledge, namely the r2 values be- tween the pairs of SNPs on the edges. The plot is provided in Figure 5. It is observed that the map- ping between the estimated parameters and the back- ground knowledge is monotone increasing, as we ex- pect. Finally we calculated the marginal probabilities of the hidden X, and ranked the SNPs by the marginal ˆ probabilities P (Xi = 1|y; θ, µˆ1). There are in total five ˆ SNPs with P (Xi = 1|y; θ, µˆ1) greater than 0.99, which means they are associated with breast cancer with a Figure 5: The estimated parameters against their probability greater than 0.99 given the observed test 2 ˆ background knowledge, namely the r values between statistics y under the estimated parameters θ andµ ˆ1. the pairs of SNPs. There is strong evidence in the literature that supports the association with breast cancer for three of them. kernel-PCD algorithm. After we estimate θ and µ1, The two SNPs rs2420946 and rs1219648 on chromo- we calculate the marginal probabilities of the hidden some 10 are reported by Hunter et al (2007), and have X. Eventually, we rank the SNPs by the marginal been further validated by 1,776 cases and 2,072 con- ˆ probabilities P (Xi = 1|y; θ, µˆ1), and select the SNPs trols from three additional studies. Their associated with the largest marginal probabilities. gene FGFR2 is very well known to be associated with The algorithm ran for 46 days on a single processor breast cancer in the literature. There is also strong evi- (AMD Opteron Processor, 3300 MHz) before it con- dence supporting the association of the SNP rs7712949 Learning Heterogeneous Hidden Markov Random Fields

● ● ●

● ● 0.028 ● 8000 8000 0.024 8000 0.0345 ● ● ●

● ● ● 0.027

● ● 6000 6000 ● ● 0.022 6000 ● ●

n=1 ● ● n=1 ● n=1 ● ● ● 0.0335 ● n=5 ● n=5 ● n=5 ● 0.026 ● ● ● n=10 n=10 ● ● ● n=10 4000 4000 ● ● ●

4000 ●

0.020 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● 0.025 ● ● ● ● ● ● ● 0.0325 ● ● ● ● ● ● ● ● ● ● ● ●

2000 ● 2000

2000 ● ●

● Absolute Estimate Error Absolute Estimate Error Absolute Estimate Error ● 0.018 ●

Running Time (seconds) ● Running Time (seconds) ● Running Time (seconds) ● ● ● ● ● ● ● ● ● ● ● ● 0.024 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 0 0.0315 0.016 0 100 200 300 400 500 600 700 800 900 1100 0 100 200 300 400 500 600 700 800 900 1100 0 100 200 300 400 500 600 700 800 900 1100 Iteration Number Iteration Number Iteration Number (1a) tree structure, ki = sin(θi) + ε , σε =0 (1b) tree structure, ki = sin(θi) + ε , σε =0.01 (1c) tree structure, ki = sin(θi) + ε , σε =0.02

● ● ● ● ● ● ● ● ● ● ● ●

● 20000 ●

20000 ●

● 0.022 ●

20000 ● ● ● ● ● ● ● 0.0250 0.023 ● ● ● ● ● ● ● ● ●

● 0.021 ● ● ● ● ● ● ● 15000 15000 ● ● ● ● ● ● ● 15000 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.022 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.020 ● ● ● ● ● n=1 0.0240 ● ● n=1 ● ● ● ● n=1 ● ● ● ● ● ● ● ● ● ● ● ● n=5 ● ● ● ● ● ●

n=5 10000 10000 n=5 ● ● ● ● ● ● ● 10000 ● ● ● ● ● ● n=10 ● ● ● 0.021 ● ● 0.019 n=10 ● ● ● ● ● n=10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● 0.0230 ● ● ● ● ● ● ● ● ● ● ●

● 5000 ● 5000 ● ● ● ● ● 5000 ● ● ● 0.018 ● ● Absolute Estimate Error ● Absolute Estimate Error Absolute Estimate Error ● ● ● ● ● ● ● ● Running Time (seconds) ● Running Time (seconds) ● Running Time (seconds) ●

● 0.020 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.017 ● ● 0 0 0 0.019 0.0220 0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 Iteration Number Iteration Number Iteration Number θ2 + ε σ θ2 + ε σ θ2 + ε σ (2a) tree structure, ki = i , ε =0 (2b) tree structure, ki = i , ε =0.01 (2c) tree structure, ki = i , ε =0.02

● ● ● 40000 0.048 ● 40000 ● ● 40000

● ● ● 0.054 ● ● ●

● ● 0.050 ● ● ●

0.046 ●

● ● ● ● ● ● ● ● ● 30000 30000

30000 ● ● ● ● ● ● ● ● ● 0.052 0.048

● 0.044 ● ● ● ● ● n=1 ● ● ● ● ● n=1 ● ● n=1 ● ● ● n=5 ● ● ● ● ● ● ● n=5 ● ● ● ● 0.042 n=5 ● ● ● 0.050 ● ● 0.046 ● 20000

n=10 20000 20000 ● n=10 ● n=10 ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● 0.040 ● ● ● ● ● ● ● ● 0.044 ● ● ● ● 0.048 ● ● ● ● ●

● 10000 ● 10000 10000 ● ●

● Absolute Estimate Error ● Absolute Estimate Error ● Absolute Estimate Error 0.038 Running Time (seconds) Running Time (seconds) Running Time (seconds) ● ● ● ● ● ● ● ● ●

● ● 0.042 ● ● ● ● ● ● ● ● ● 0.046

● 0.036 ● ● 0 0 0 0.040 0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400 Iteration Number Iteration Number Iteration Number (3a) grid structure, ki = sin(θi) + ε , σε =0 (3b) grid structure, ki = sin(θi) + ε , σε =0.01 (3c) grid structure, ki = sin(θi) + ε , σε =0.02

● ● ● 0.052 0.050 ● ● ●

50000 ●

50000 ● 50000 ●

● ● ● 0.050 ● ● ● ● ● ● ● ● ● ● ● ●

0.045 ● ● ● ● ● ● 40000 40000 40000 ● 0.048 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.045 n=1 ● ● n=1 ● ● ● ● ● ● ● n=1 ● ● ● n=5 ● ● n=5 ● ● ● ● 0.046 ● ● n=5 ●

30000 ●

30000 ● ● ● ● 30000 n=10 ● ● ● ● n=10 ● ● ● ● ● n=10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.040 ● ● ● 0.044 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20000 20000 ● ● 20000 ● ● ● ● ● ● ● ● ● ● 0.040 ● ● ● 0.042 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Absolute Estimate Error ● ● ● Absolute Estimate Error ● ● Absolute Estimate Error Running Time (seconds) Running Time (seconds) Running Time (seconds) ● ● ● 10000 ●

● 0.040 10000

10000 ● ● ● ● ● ● ● ● ● ● ● ● 0.035 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.038 0 0 0 0.035 0 200 400 600 800 1200 1600 2000 0 200 400 600 800 1200 1600 2000 0 200 400 600 800 1200 1600 2000 Iteration Number Iteration Number Iteration Number θ2 + ε σ θ2 + ε σ θ2 + ε σ (4a) grid structure, ki = i , ε =0 (4b) grid structure, ki = i , ε =0.01 (4c) grid structure, ki = i , ε =0.02 Figure 4: Absolute estimate error (plotted in blue, in the units on the right axes) and running time (plotted in black, in minutes on the left axes) of the EM-kernel-PCD algorithm in the tree-structure HMRFs and the grid-structure HMRFs when we choose different n values; n is the number of MCMC steps for advancing particles in the PCD algorithm. The absolute estimate error in the first 400 iterations is not shown in the plots. on chromosome 5. The SNP rs7712949 is highly corre- between the parameters and the background knowl- lated (r2=0.948) with SNP rs4415084 which has been edge is recovered in a nonparametric way, which is also identified to be associated with breast cancer by an- adaptive to the data. Simulations show that our algo- other six large-scale studies. 1 rithm is effective for learning heterogeneous HMRFs and outperforms alternative binning methods. 6 Conclusion Similar to other EM algorithms, our algorithm only Capturing parameter heterogeneity is an important is- converges to a local maximum of the likelihood sue in and statistics, and it is partic- L(θ, ϕ), although the lower bound F(qx(x), {θ, ϕ}) ular challenging in HMRFs due to both the intractable nondecreases over the EM iterations (except for some Z(θ) and the latent x. In this paper, we propose the MCMC error introduced in the E-step). Our algorithm EM-kernel-PCD algorithm for learning the heteroge- also suffers from long run time due to computationally neous parameters with background knowledge. Our expensive PCD algorithm within each M-step. These algorithm is built upon the PCD algorithm which han- two issues are important directions for future work. dles the intractable Z(θ). The EM part we add is for dealing with the hidden x. The kernel smoothing part Acknowledgements we add is to adaptively incorporate the background The authors gratefully acknowledge the support knowledge about the heterogeneity in parameters in of NIH grants R01GM097618, R01CA165229, the gradient ascent learning. Eventually, the relation R01LM010921, P30CA014520, UL1TR000427, NSF grants DMS-1106586, DMS-1308872 and the UW 1http://snpedia.com/index.php/rs4415084 Carbone Cancer Center. Liu, Zhang, Burnside, Page

References A. Hyv¨arinen. Some extensions of score matching. Computational Statistics & Data Analysis, 51(5): A. U. Asuncion, Q. Liu, A. T. Ihler, and P. Smyth. 2499–2512, 2007. Particle filtered MCMC-MLE with connections to contrastive divergence. In ICML, 2010. J. Kim and R. Zabih. Factorial Markov random fields. M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. The In ECCV, pages 321–334, 2002. infinite hidden . In NIPS, 2002. X. Lan, S. Roth, D. Huttenlocher, and M. J. Black. Ef- J. Besag. Statistical analysis of non-lattice data. Jour- ficient belief propagation with learned higher-order nal of the Royal Statistical Society. Series D (The markov random fields. In ECCV, pages 269–282, Statistician), 24(3):179–195, 1975. 2006. G. Celeux, F. Forbes, and N. Peyrard. EM procedures J. Liu, C. Zhang, C. McCarty, P. Peissig, E. Burnside, using field-like approximations for Markov and D. Page. High-dimensional structured feature model-based image segmentation. Pattern Recog- screening using binary Markov random fields. In nition, 36:131–144, 2003. AISTATS, 2012a. S. P. Chatzis and G. Tsechpenakis. The infinite hidden J. Liu, C. Zhang, C. McCarty, P. Peissig, E. Burnside, Markov random field model. In ICCV, 2009. and D. Page. Graphical-model based multiple test- ing under dependence, with applications to genome- S. P. Chatzis and T. A. Varvarigou. A wide association studies. In UAI, 2012b. approach toward hidden Markov random field mod- els for enhanced spatially constrained image segmen- S. Lyu. Unifying non-maximum likelihood learning tation. IEEE Transactions on Fuzzy Systems, 16: objectives with minimum KL contraction. In NIPS, 1351 – 1361, 2008. pages 64–72, 2011. J. Fan. Design-adaptive nonparametric regression. J. C. Marioni, N. P. Thorne, and S. Tavar´e.BioHMM: Journal of the American Statistical Association, 87 a heterogeneous for segment- (420):998–1004, 1992. ISSN 01621459. ing array CGH data. , 22:1144–1146, J. V. Gael, Y. Saatci, Y. W. Teh, and Z. Ghahra- 2006. mani. Beam sampling for the infinite hidden Markov E. Nadaraya. On estimating regression. Theory of model. In ICML, 2008. Probability and Its Applications, 9(1):141–142, 1964. C. J. Geyer. maximum L. Song, B. Boots, S. Siddiqi, G. Gordon, and likelihood. Computing Science and Statistics, pages A. Smola. Hilbert space embeddings of hidden 156–163, 1991. Markov models. In ICML, 2010. Z. Ghahramani and M. I. Jordan. Factorial Hid- The International HapMap Consortium. The interna- den markov models. Machine Learning, 29:245–273, tional hapmap project. Nature, 426:789–796, 2003. 1997. T. Tieleman. Training restricted Boltzmann machines M. Gutmann and A. Hyv¨arinen.Noise-contrastive es- using approximations to the likelihood gradient. In timation: A new estimation principle for unnormal- ICML, 2008. ized statistical models. In AISTATS, 2010. T. Tieleman and G. Hinton. Using fast weights to G. Hinton. Training products of experts by minimiz- improve persistent contrastive divergence. In ICML, ing contrastive divergence. Neural Computation, 14: 2009. 1771–1800, 2002. D. Vickrey, C. Lin, and D. Koller. Non-local con- D. Hsu, S. M. Kakade, and T. Zhang. A spectral algo- trastive objectives. In ICML, 2010. rithm for learning hidden Markov models. In COLT, 2009. G. S. Watson. Smooth . The Indian Journal of Statistics, Series A, 26(4):359–372, 1964. D. J. Hunter, P. Kraft, K. B. Jacobs, D. G. Cox, M. Yeager, S. E. Hankinson, S. Wacholder, Z. Wang, Y. Zhang, M. Brady, and S. Smith. Segmentation of R. Welch, A. Hutchinson, J. Wang, K. Yu, N. Chat- brain MR images through a hidden Markov random terjee, N. Orr, W. C. Willett, G. A. Colditz, R. G. field model and the expectation-maximization al- Ziegler, C. D. Berg, S. S. Buys, C. A. Mccarty, H. S. gorithm. IEEE Transactions on Medical Imaging, Feigelson, E. E. Calle, M. J. Thun, R. B. Hayes, 2001. M. Tucker, D. S. Gerhard, J. F. Fraumeni, R. N. S. C. Zhu and X. Liu. Learning in Gibbsian fields: How Hoover, G. Thomas, and S. J. Chanock. A genome- accurate and how fast can it be? IEEE Transactions wide association study identifies alleles in fgfr2 asso- on Pattern Analysis and Machine Intelligence, 24: ciated with risk of sporadic postmenopausal breast 1001–1006, 2002. cancer. Nature Genetics, 39(7):870–874, 2007.