Convergent Stochastic Expectation Maximization Algorithm with Efficient Sampling in High Dimension. Application to Deformable Te

Home , Stochastic approximation

Convergent Stochastic Expectation Maximization algorithm with eﬀicient sampling in high dimension. Application to deformable template model estimation Stéphanie Allassonniere, Estelle Kuhn

To cite this version:

Stéphanie Allassonniere, Estelle Kuhn. Convergent Stochastic Expectation Maximization algorithm with eﬀicient sampling in high dimension. Application to deformable template model estimation. 2013. hal-00720617v4

HAL Id: hal-00720617 https://hal.archives-ouvertes.fr/hal-00720617v4 Preprint submitted on 6 Sep 2013

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Noname manuscript No. (will be inserted by the editor)

Convergent Stochastic Expectation Maximization algorithm with eﬃcient sampling in high dimension. Application to deformable template model estimation

St´ephanieAllassonni`ere · Estelle Kuhn

Received: date / Accepted: date

Abstract Estimation in the deformable tem- of models in many fields of applications such as plate model is a big challenge in image analysis. pharmacology or genetic. The issue is to estimate an atlas of a popu- Keywords Deformable template · geometric lation. This atlas contains a template and the variability · maximum likelihood estimation · corresponding geometrical variability of the ob- missing variable · high dimension · stochastic served shapes. The goal is to propose an accu- EM algorithm · MCMC · Anisotropic MALA rate algorithm with low computational cost and with theoretical guaranties of relevance. This becomes very demanding when dealing with high dimensional data which is particularly the case 1 Introduction of medical images. We propose to use an optimized Monte Carlo Markov Chain method into We consider here the deformable template model a stochastic Expectation Maximization algorithm introduced for Computational Anatomy in [18]. in order to estimate the model parameters by This model, which has demonstrated great im- maximizing the likelihood. In this paper, we pact in image analysis, was developed and ana- present a new Anisotropic Metropolis Adjusted lyzed later on by many groups (among other Langevin Algorithm which we use as transition [28,24,33,27]). It offers several major advan- in the MCMC method. We first prove that this tages. First, it enables to describe the popu- new sampler leads to a geometrically uniformly lation of interest by a digital anatomical tem- ergodic Markov chain. We prove also that un- plate. It also captures the geometric variability der mild conditions, the estimated parameters of the population shapes through the modeling converge almost surely and are asymptotically of deformations of the template which match it Gaussian distributed. The methodology devel- to the observations. Moreover, the metric on oped is then tested on handwritten digits and the space of deformations is specified in the some 2D and 3D medical images for the de- model as a quantification of the deformation formable model estimation. More widely, the cost. Not only describing the population, this proposed algorithm can be used for a large range generative model also allows to sample synthetic data using both the template and the geomet- S. Allassonnière rical metric of the deformation space which to- CMAP Ecole Polytechnique gether define the atlas. Nevertheless, the key Route de Saclay statistical issue is how to estimate efficiently 91128 Palaiseau, FRANCE Tel.: +331.69.33.45.65 and accurately these parameters of the model E-mail: [email protected] from an observed population of images. E. Kuhn INRA Several numerical methods have been de- Domaine de Vilvert veloped mainly for the estimation of the tem- 78352 Jouy-en-Josas, FRANCE plate image (for example [11,20]). Even if these methods lead to visual interesting results on 2 StéphanieAllassonnière,Estelle Kuhn some training samples, they suffer from a lack ing the correlations of the deformations (see of theoretical properties raising the question [29,22]). of the relevance of the output and are not ro- Our purpose in this paper is to propose an bust to noisy data. Another important contri- efficient and convergent estimation algorithm bution toward the statistical formulation of the for the deformable template model in high di- template estimation issue was proposed in [17]. mension without any constrains. With regards However interesting this approach is not en- to the above considerations, the computational tirely satisfactory since the deformations are cost of the estimation algorithm can be reduced applied to discrete observations requiring some by optimizing the sampling scheme in the MCMC interpolation. Moreover it does not formulate method. the analysis in terms of a generative model which appears very attractive as mentioned above. The sampling of high dimensional variables To overcome these lacks, a coherent statisti- is a well-known difficult challenge. In partic- cal generative model was formulated in [2]. For ular, many authors have proposed to use the estimating all the model parameters, the tem- Metropolis Adjusted Langevin Algorithm (MALA) plate image together with the geometrical met- (see [30] and [31]). This algorithm is a particu- ric, the authors proposed a deterministic algo- lar random walk Metropolis Hastings sampler. rithm based on an approximation of the well- Starting from the current iterate of the Markov known Expectation Maximization (EM) algo- chain, one simulates a candidate with respect to rithm (see [14]), where the posterior distribu- a Gaussian proposal with an expectation equal tion is replaced by a Dirac measure on its mode to the sum of this current iterate and a drift re- (called FAM-EM). However, such an approxi- lated to the target distribution. The covariance mation leads to the non-convergence of the es- matrix is diagonal and isotropic. This candi- timates highlighted when considering noisy ob- date is accepted or rejected with a probability servations. given by the Metropolis Hastings acceptance ratio. One solution to face this problem is to con- Some modifications have been proposed in sider a convergent stochastic approximation of particular to optimize the covariance matrix the EM (SAEM) algorithm which was proposed of the proposal in order to better stride the in [13]. An extension using Monte Carlo Markov support of the target distribution (see [32,7, Chain (MCMC) methods was developed and 23,16]). In [7] and [23], the authors proposed studied in [21] and [5] allowing for wider ap- to construct adaptive MALA chains for which plications. To apply this extension to the de- they prove the geometric ergodicity of the chain formable template model, the authors in [5] uniformly on any compact subset of its param- chose a Metropolis Hastings within Gibbs sam- eters. Unfortunately, this technique does not pler (also called hybrid Gibbs) as MCMC method take the whole advantage of changing the pro- since the variables to sample were of large di- posal using the target distribution. In partic- mension (the usual Metropolis Hastings algo- ular, the covariance matrix of the proposal is rithm providing low acceptation rates). This es- given by a stochastic approximation of the em- timation algorithm has been proved convergent pirical covariance matrix. This choice seems com- and performs very well on very different kind of pletely relevant as soon as the convergence to- data as presented in [4]. Nevertheless, the hy- ward the stationary distribution is reached. How- brid Gibbs sampler becomes computationally ever, it does not provide a good guess of the very expensive when sampling very high dimen- variability during the first iterations of the chain sional variables. Although it reduces the dimen- since it is still very dependent on the initializa- sion of the sampling to one which enables to tion. This leads to chains that may be numer- stride easier the target density support, it loops ically trapped. Moreover, this particular algo- over the sampling variable coordinates, which rithm may require a lot of tuning parameters. becomes computationally unusable as soon as Although the theoretical convergence is proved, the dimension is very large or as the accep- this algorithm may be very difficult to optimize tation ratio involves heavy computations. To in practice into an estimation process. overcome the problem of computational cost of Recently, the authors in [16] proposed the this estimation algorithm, some authors pro- Riemann manifold Langevin algorithm in order pose to simplify the statistical model constrain- to sample from a target density in high dimen- Efficient stochastic EM in high dimension 3 sional setting with strong correlations. This al- proposed estimation method is compared with gorithm is also a MALA based one for which the results obtained from the FAM-EM algo- the choice of the proposal covariance is guided rithm and from the MCMC-SAEM algorithm by the metric of the underlying Riemann man- using different samplers namely the hybrid Gibbs ifold. It requires to evaluate the metric, its in- sampler, the MALA and the adaptive MALA verse as well as its derivatives. The proposed proposed in [7] previously introduced. The com- well-suited metric is the Fisher-Rao informa- parison is also made via classification rates on tion matrix or its empirical value. However, in the USPS database. These experiments demon- the context we are dealing with, the real metric, strate the good behavior of our method in both namely the metric of the space of non-rigid de- the accuracy of the estimation and the low com- formations, is not explicit preventing from any putational cost in high dimension. use of it (the simplest case of the 3-landmark- matching problem is calculated in [26] leading The paper is organized as follows. In Section to a very intricate formula which is difficult 2, we recall the Bayesian Mixed Effect (BME) to extend to more complex models). Moreover, template model. In Section 3, we consider the if we consider the constant curvature simplifi- maximum likelihood estimation issue in the gen- cation suggested in [16], one still needs to in- eral framework of missing data models. We pre- vert the metric which may be neither explicit sent our stochastic version of the EM algorithm nor computationally tractable. Note that these using the AMALA sampler. The convergence constrains are common with other application properties are established in Section 4. Section fields such as genetic or pharmacology, where 5 is devoted to the experiments on the BME models are often complex. template estimation. Finally, we give some conclusion in Section 6. The proofs are postponed For all these reasons, we propose to adapt in Section 7. the MALA algorithm in the spirit of both works in [7] and [16] to get an efficient sampler into the stochastic EM algorithm. Therefore, we pro- 2 Description of the Bayesian Mixed pose to sample from a proposal distribution Effect (BME) Template model which has the same expectation as the MALA but using a full anisotropic covariance matrix The deformable template model aims at sum- based on the anisotropy and correlations of the marizing a population of images by two quanti- target distribution. This sampler will be called ties. The first one is a mean image called tem- AMALA in the sequel. The expectation is ob- plate which has to represent a relevant shape as tained as the sum of the current iterate plus one could find in the population. The second a drift which is proportional to the gradient quantity represents the variance in the space of the logarithm of the target distribution. We of shapes. This corresponds to the geometrical construct the covariance matrix as a regulariza- variability around the mean shape. Let us now tion of the Gram matrix of this drift. We prove describe the deformable template model more the geometric ergodicity uniformly on any com- precisely. pact set of the AMALA assuming some reg- We consider the hierarchical Bayesian frame- ularity conditions on the target distribution. work for dense deformable template developed We also prove the almost sure convergence of in [2] where each image in a population is as- the parameter estimated sequence generated by sumed to be generated as a noisy and randomly the coupling of AMALA and SAEM algorithms deformed version of the template. (AMALA-SAEM) toward the maximum likeli- The database is composed of n grey level hood estimate under some regularity assump- images (yi)1≤i≤n observed on a grid Λ of pix- tions on the model. Moreover, we prove a Cen- els (or voxels) included in a continuous domain tral Limit Theorem for this sequence under usual D ⊂ Rd, (typically D = [−1, 1]d where d equals d conditions on the model. 2 or 3). The expected template I0 : R → R We test our estimation algorithm on the de- takes its values in the continuous domain. Each formable template model for estimating hand- observation y is assumed to be a discretization written digit atlases from the USPS database on Λ of a random deformation of this template and medical images of corpus callosum (2D) plus an independent noise. Therefore, there ex- and of dendrite spine excrescences (3D). The ists an unobserved deformation field (also called 4 StéphanieAllassonnière,Estelle Kuhn mapping) m : Rd → Rd such that for u ∈ Λ as shown in [2]. The priors are all independent: 2 θ = (α, σ ,Γg) ∼ νp ⊗ νg where y(u) = I0(vu − m(vu)) + σ(u) ,  2 1 T −1 where σ denotes the independent additive noise  νp(dα, dσ ) ∝ exp − (α − µp) (Σp) (α − µp) ×  2 and vu is the location of pixel (or voxel) u.  2 ap  σ0 1 2 Considering the template and the deforma-  exp − √ dσ dα, ap ≥ 3 , 2σ2 σ2 tions as continuous functions would lead to a !ag  dense problem. The dimension is reduced as-  −1 1  νg(dΓg) ∝ exp(−hΓg ,ΣgiF /2)p dΓg, suming that both elements belong to a sub-  |Γg|  set of fixed Reproducing Kernel Hilbert Spaces  ag ≥ 4kg + 1 . (RKHS) Vp and Vg defined by their respective (5) kernels Kp and Kg. More precisely, let (rp,j)1≤j≤kp -respectively (rg,j)1≤j≤k - be some fixed con- For two matrices A, B we define the Frobenius g T trol points in the domain D: there exist α ∈ Rkp inner product by hA, BiF , tr(A B). -resp. z ∈ Rkg × Rkg - such that for all v in D: Parameter estimation for this model is then kp X j performed by Maximum A Posteriori (MAP) : Iα(v) = (Kpα)(v) = Kp(v, rp,j)α (1) j=1 θˆ = argmax q(θ|y) , (6) θ∈Θ kg X j where q(θ|y) is the posterior density of θ con- mz(v) = (Kgz)(v) = Kg(v, rg,j)z . (2) j=1 ditional on y. The existence and consistency of the MAP estimator for the BME template For clarity, we write y = (yi)1≤i≤n for the model has been proved in [2]. n−tuple of observations and z = (zi)1≤i≤n for the n−tuple of unobserved variables defining Note that this model belongs to a more gen- the deformations. The statistical model on the eral class called mixed effect models. The fixed observations is chosen as follows: effects are the parameters θ and the random effects are the deformation coefficients z. The  z ∼ ⊗n N (0,Γ ) | Γ ,  i=1 dkg g g estimation issue in this class is treated in the same way as the likelihood maximization prob-  n 2 2 y ∼ ⊗i=1N|Λ|(mzi Iα, σ Id|Λ|) | z, α, σ , lem in the more general framework of incomplete- (3) data models. Therefore, the next section will be presented in this general setting in which the where ⊗ denotes the product of independent proposed algorithm applies. variables and mIα(u) = Iα(vu − m(vu)), for u in Λ. The parameters of interest are the template α, the noise variance σ2 and the defor- 3 Maximum likelihood estimation mation covariance matrix Γg. We assume that 2 3.1 Maximum likelihood estimation for θ = (α, σ ,Γg) belongs to the parameter space Θ: incomplete data setting

We consider in this section the standard incom- Θ { θ = (α, σ2,Γ ) | α ∈ kp , |α| < R, , g R plete data (or partially-observed-data) setting σ > 0,Γ ∈ Sym+ ( ) } , (4) g dkg ,∗ R and recall the usual notation. We denote by y ∈ Rq the observed data and by z ∈ Rl the where Sym+ ( ) is the cone of real positive dkg ,∗ R missing data, so that we obtain the complete q+l ∗ ∗ dkg × dkg definite symmetric matrices, R is an data (y, z) ∈ R for some q ∈ N and l ∈ N . arbitrary positive constant and d is the space We consider these data as random vectors. Let dimension (typically 2 or 3 for images). µ0 be a σ-finite measure on Rq+l and µ the re- Since we aim at dealing with small size sam- striction of µ0 to Rl generated by the projection ples and high dimensional parameters, we work (y, z) 7→ z. We assume that the probability den- in the Bayesian framework and we introduce sity function (pdf) of the random vector (y, z) priors on the parameters. In addition of guid- belongs to P = {f(y, z; θ), θ ∈ Θ}, a family ing the estimation it regularizes the estimation of parametric probability density functions on Efficient stochastic EM in high dimension 5

Rq+l w.r.t. µ0, where Θ ⊂ Rp. Therefore, the becomes very challenging in high dimensional observed likelihood (i.e. the incomplete-data like- setting. Indeed, when the MCMC procedure lihood) is defined for some θ ∈ Θ by: has to explore a space of high dimension, its Z convergence may occur in practice only after a g(y; θ) , f(y, z; θ)µ(dz). (7) possibly infinite time. Thus, it is necessary to optimize this MCMC procedure. This is what Our purpose is to find the maximum likeli- we will propose in the following paragraph. ˆ hood estimate that is the value θg in Θ that maximizes the observed likelihood g given a sample of observations. However, this maximiza- 3.2 Description of the sampling method: tion can often not be done analytically because Anisotropic Metropolis Adjusted Langevin of the integration involved in (7). A powerful Algorithm tool which enables to compute this maximization in such a setting is the Expectation Max- We propose an anisotropic version of the well- imization (EM) algorithm (see [14]). It is an known Metropolis Adjusted Langevin Algorithm iterative procedure which consists of two steps. (MALA). So let us first recall the steps of this First, the so-called E-step computes the condi- algorithm. Let X be an open subset of Rl, the tional expectation of the complete log-likelihood l−dimensional Euclidean space equipped with using the current parameter value. Second, the its Borel σ−algebra B. Let us denote π the pdf M-step achieves the update of the parameter by of the target distribution with respect to the maximizing this expectation over Θ. However, Lebesgue measure on X . We assume that π is the computation of this expectation is often positive continuously differentiable. At each it- intractable analytically. Therefore, alternative eration k of this algorithm, a candidate Xc is procedures have been proposed. We are partic- simulated with respect to the Gaussian distri- ularly interested in the Stochastic Approxima- σ2 bution with expectation Xk + 2 D(Xk) and co- 2 tion EM (SAEM) algorithm (see [13]) because variance σ Idl where Xk is the current value, of its theoretical convergence property and its b small computation time. In this stochastic algo- D(x) = ∇ log π(x) , (8) rithm, the usual E-step is replaced by two steps, max(b, |∇ log π(x)|) the first one corresponding to the simulation of l realizations of the missing data, the second one Idl is the identity matrix in R and b > 0 is a to the computation of a stochastic approxima- fixed truncation threshold. Note that the trun- tion of the complete log-likelihood using these cation of the drift D was already suggested simulated values. It can be shown under weak in [15] to provide more stability. In the fol- regularity conditions that the sequence gener- lowing, we denote qMALA(x, ·) the pdf of this ated by this algorithm converges almost surely Gaussian candidate distribution starting from toward a local maximum of the observed likeli- x. Given this candidate, the next value of the hood (see [13]). Markov chain is updated using an acceptance Nevertheless the simulation step requires some ratio αMALA(Xk,Xc) as follows: Xk+1 = Xc attention. In the SAEM algorithm the simu- with probability lated values of the missing data have to be π(Xc)qMALA(Xc,Xk) drawn from the posterior distribution defined αMALA(Xk,Xc) = min 1, by: qMALA(Xk,Xc)π(Xk) (9) f(y, z; θ)/g(y; θ) if g(y; θ) 6= 0 p(z|y; θ) , 0 if g(y; θ) = 0 . and Xk+1 = Xk with probability 1−αMALA(Xk,Xc). When not possible, the extension using MCMC This provides a transition kernel ΠMALA of method (see [21,5]) allows to apply the SAEM this form: for any Borel set A ∈ B algorithm using simulations obtained from some Z transition probability of an ergodic Markov chain ΠMALA(x, A) = αMALA(x, z)qMALA(x, z)dz+ having the targeted posterior distribution as A Z stationary distribution. Methods like Metropo- 1A(x) (1 − αMALA(x, z))qMALA(x, z)dz . lis Hastings algorithm or Gibbs sampler are X useful to perform this assignment. However, this (10) 6 StéphanieAllassonnière,Estelle Kuhn

The Gaussian proposal of the MALA algo- log π into the covariance matrix to provide an rithm is optimized with respect to its expec- anisotropic covariance matrix depending on the tation guided by the Langevin diffusion. One amplitude of the drift at the current value. When step further is to optimize also its covariance the drift is large, the candidate is likely to be matrix. A first work in this direction was pro- far from the current value. This large step may posed in [7]. The covariance matrix of the pro- not be of the right amplitude and a large vari- posal is given by a projection of a stochastic ance will enable more flexibility. Moreover, this approximation of the empirical covariance ma- enables to explore a larger area around these trix. It produces an adaptive Markov chain. candidates which would not be possible with This process involves some additional tuning a fixed variance. On the other hand, when the parameters which have to be calibrated. Since drift is small in a particular direction, it means our goal is to use this sampler in an estimation that the current value is within a region of high algorithm, the sampler has at each iteration a probability for the next value of the Markov different target distribution (depending on the chain. Therefore, the candidate should not move current estimate of the parameter). Therefore, too far neither with a large drift nor with a the optimal tuning parameter may be different large variance. This enables to sample a lot along the iterations of the estimation process. around large modes which is of particular inter- Although we agree with the idea of using adap- est. This covariance also enables to treat the di- tive chain, we prefer taking the advantage of rections of interest with different amplitudes of the dynamic of the estimation algorithm. On variances as the drift already does. It also pro- the other side, an intrinsic solution has been vides dependencies between coordinates since proposed in [16] where the covariance matrix is the directions of large variances are likely to be given by the metric of the Riemann manifold different from the Euclidean axis. This is taken of the variable to sample. Unfortunately, this into account here by introducing the Gram ma- metric may not be accessible and its empiri- trix of the drift into the covariance matrix. cal approximation not easy to compute. This We denote by qc the pdf of this proposal is particularly the case in the BME template distribution. The transition kernel becomes: for model. any Borel set A.

For these reasons, we propose a sampler in Z the spirit of [7], [16] or [23] however not provid- Π(x, A) = α(x, z)qc(x, z)dz+ ing an adaptive chain as motivated above. The A Z adaption comes from the dependency of the 1A(x) (1 − α(x, z))qc(x, z)dz , (12) target distribution with respect to the parame- X ters of the model which are updated along the where estimation algorithm. The proposal remains a Gaussian distribution but both the drift and π(Xc)qc(Xc,Xk) α(Xk,Xc) = min 1, . the covariance matrix depend on the gradient qc(Xk,Xc)π(Xk) th of the target distribution. At the k iteration, (13) we are provided with Xk. The candidate is sam- pled from the Gaussian distribution with expectation Xk + δD(Xk) and covariance matrix 3.3 Description of the stochastic estimation δΣ(Xk) denoted in the sequel algorithm N (Xk + δD(Xk), δΣ(Xk)) where Σ(x) is given by : Back to the stochastic estimation algorithm, the target distribution of the sampler is the posterior distribution p(·|y; θ). T Σ(x) = εIdl + D(x)D(x) , (11) The four steps of the proposed AMALA- SAEM algorithm are detailed in this subsec- D is defined in Equation (8) and ε > 0 is a small tion : simulation, stochastic approximation, trun- regularization parameter. Note that the thresh- cation on random boundaries and maximiza- old parameter b leads to a symmetric positive tion steps. At each iteration k of the algorithm, definite covariance matrix with bounded non simulated values of the missing data are drawn zero eigenvalues. We introduce the gradient of from the transition probability of the AMALA Efficient stochastic EM in high dimension 7 algorithm described in Section 3.2 with the cur- sequence is required in order to ensure the con- rent value of the parameters. Then, a stochastic vergence of the whole algorithm. approximation of the complete log-likelihood Concerning the maximization step, we de- is computed using these simulated values for note by L the function defined on S × Θ taking the missing data and is truncated using ran- values in R equaled for all (s, θ) to L(s, θ) = dom boundaries. Finally, the parameters are −ψ(θ) + hs, φ(θ)i. We assume that there exists updated by maximizing this quantity over Θ. a function θˆ defined on S taking values in Θ We consider here only parametric models P such that which belong to the curved exponential family, ∀θ ∈ Θ ∀s ∈ S L(s, θˆ(s)) ≥ L(s, θ). this means that the complete likelihood f(y, z; θ) can be written as: Finally we update the parameter using the value ˆ f(y, z; θ) = exp [−ψ(θ) + hS(z), φ(θ)i] , of the function θ evaluated in sk. where h·, ·i denotes the Euclidean scalar prod- The complete algorithm is summarized in uct, the sufficient statistics S is a function on Algorithm 1. It only involves three parameters: Rl, taking its values in a subset S of Rm and b the threshold for the gradient which appears ψ, φ are two functions on Θ (note that S, φ in the expectation as well as in the covariance and ψ may depend also on y, but we omit this matrix, δ the scale on this gradient and ε a dependency for simplicity). This condition is small regularization parameter to ensure a pos- usual in the framework of EM algorithm appli- itive definite covariance matrix. The scale δ can cations and it is fulfilled by large range of mod- be easily optimized looking at the data we are els even complex ones as the BME template dealing with to adapt to the range of the drift. model. Therefore the stochastic approximation The value of the threshold b is in practice never can be done either on the sufficient statistics S reached. The practical choices for the sequences of the model or on the complete log-likelihood (γk)k and (εk)k of positive step sizes used in using a positive step-size sequence (γk)k∈N. the stochastic approximation and the tuning Concerning the truncation procedure, we in- parameters will be detailed in the section de- troduce a sequence of increasing compact sub- voted to the experiments. sets of S denoted by (Kq)q≥0 such that ∪q≥0Kq = S and Kq ⊂ int(Kq+1), for all q ≥ 0. Let also 4 Theoretical Properties (εq)q≥0 be a monotone non-increasing sequence of positive numbers and K a compact subset of l 4.1 Geometric ergodicity of the AMALA R . At iteration k we simulate a valuez ¯ for the missing data from the Anisotropic Metropolis Let S be a subset of m for some positive in- Adjusted Langevin Algorithm using the cur- R teger m. Let X be a measurable subspace of rent value of the parameter θ . We compute k−1 l for some positive integer l. Let (π ) be the associated stochastic approximation of the R s s∈S a family of positive continuously differentiable sufficient statistics of the models ¯. If it does probability density functions with respect to not wander outside the current compact set K k the Lebesgue measure on X . For any s ∈ S, de- and if it is not too far from its previous value note by Π the transition kernel corresponding s , we keep the possible proposed values for s k−1 to the AMALA procedure described in Section (z , s ). As soon as one of these conditions is k k 3.2 with stationary distribution π . We prove not fulfilled, we reinitialize the sequences of z s in the following proposition that each kernel of and s using a projection (for more details see the family (Π ) is uniformly geometrically [6] ) and we increase the size of the compact s s∈S ergodic and that this property holds uniformly set used for the truncation. As explained in [6], in s on any compact subset K of S. the re-projections act as a drift as they force We require a usual assumption on the sta- the chain to come back to a compact set when tionary distributions namely the so-called super- it grows too rapidly. It reinitializes the algo- exponential property given by: rithm with a smaller step size. However, as the chain has an unbounded support, it requires (B1) For all s ∈ S, the density πs is positive the use of adaptive truncations. As we shall see with continuous first derivative such that: in the Proof section (and already noted in [6]), lim n(x).∇ log πs(x) = −∞ (14) the limitation imposed on the increments of the |x|→∞ 8 StéphanieAllassonnière,Estelle Kuhn

Algorithm 1 AMALA within SAEM (B3) There exists b0 > 0 such that, for all s ∈ b0 for all k = 1 : kend do S and x ∈ X , V2 is integrable against Sample zc with respect to N (z + k−1 Πs(x, .) and δD(zk−1, θk−1), δΣ(zk−1, θk−1)) whose pdf is denoted qs (zk−1,.) where b k−1 lim sup sup ΠsV2 (x) = 1 . (16) b→0 s∈S,x∈X  b∇ log p(zk−1|y;θk−1) D(zk−1, θk−1) =  max(b,|∇ log p(z 1|y;θ 1)|)  k− k− Proposition 1 Assume (B1-B3). Let K a com- T  Σ(zk−1, θk−1) = D(zk−1, θk−1)D(zk−1, θk−1) + pact subset of S. There exist a function V ≥ 1,  εIdl. a set C ⊆ X , a probability measure ν such that ν(C) > 0 and there exist constants λ ∈]0, 1[, b ∈ α z , z [0, ∞[ and ε ∈]0, 1] such that for all s ∈ K : Compute the acceptance ratio θk−1 ( k−1 c) as deﬁned in Eq. (13). Πs(x, A) ≥ εν(A) ∀x ∈ C ∀A ∈ B , (17) Sample z¯ = zc with probability α (z , zc) θk−1 k−1 1 andz ¯ = zk−1 with probability 1 − ΠsV (x) ≤ λV (x) + b C(x) . (18) α z , z θk−1 ( k−1 c) Do the stochastic approximation The proof of Proposition 1 is given in Appendix.

s¯ = sk−1 + γk (S(¯z) − sk−1) , The ﬁrst equation deﬁnes C as a small set

where (γk)k is a sequence of positive step sizes. for the transition kernels (Πs). Note that both if s ∈ K ks − s k ≤ ε then ¯ κk−1 and ¯ k−1 ζk−1 ε and ν can depend on C. The ν-small set Equa- Set (zk, sk) = (¯z, s¯) and κk = κk−1, νk = tion (17) ”in one step” also implies the ν-irredu- ν + 1, ζ = ζ + 1 k−1 k k−1 cibility of the transition kernels and their ape- else set (zk, sk) = (˜z, s˜) ∈ K×K0 and κk = κk−1 + riodicity (see [25]). The second inequality is 1, νk = 0, ζk = ζk−1 + Ψ(νk−1) a drift condition which states that the tran- where Ψ : N → Z is a function such that sition kernels tend to bring back elements into Ψ(k) > −k for any k the small set. As a consequence of these well and (˜z, s˜) is chosen arbitrarily. end if known drift conditions, the transition kernels Update the parameter (Πs) are V -uniformly ergodic. Moreover this property holds uniformly in s in any compact θ = θˆ(s ) k k subset K ⊂ S. That is to say: for any compact end for K ⊂ S, there exist 0 < ρ < 1 and 0 < c < ∞ such that for all n ∈ N∗ and f such that kf(x)k kfkV = sup V (x) < ∞: x∈X and n n sup kΠs f(.) − πsfkV ≤ cρ kfkV . (19) s∈K lim sup n(x).ms(x) < 0 (15) |x|→∞ Remark 1 The same property holds for any power p of the function V such that 0 < pβ < 1. where ∇ is the gradient operator in Rl, n(x) = Indeed, the proof follows the same lines as it x |x| is the unit vector pointing in the direc- can be seen in Section 7. This is a property that ∇πs(x) tion of x and ms(x) = is the unit will appear useful in the sequel to prove some |∇πs(x)| vector in the direction of the gradient of the properties of the estimation algorithm. stationary distribution at point x.

We assume also some regularity properties of 4.2 Convergence property of the estimated the stationary distributions with respect to s. sequence generated by the AMALA-SAEM algorithm (B2) For all x ∈ X , the functions s 7→ πs and s 7→ ∇x log πs are continuous on S. We do the following assumptions on the model which are quite usual in the context of missing We now deﬁne for some β ∈]0, 1[, Vs(x) = −β data model using EM-like algorithms (see [13], csπs(x) where cs is a constant so that Vs(x) ≥ [21]). 1 for all x ∈ X . Let also V1(x) = inf Vs(x) and s∈S For sake of simplicity we denote in the se- V2(x) = sup Vs(x). s∈S quel pθ(·) instead of p(·|y; θ) the posterior dis- Let us assume conditions on V2: tribution. Eﬃcient stochastic EM in high dimension 9

– (M1) The parameter space Θ is an open Moreover a usual additional assumption is re- subset of Rp. The complete data likelihood quired on the step size sequences of the stochas- function is given by: tic approximation.

f(y, z; θ) = exp [−ψ(θ) + hS(z), φ(θ)i] , – (A4) The sequences γ = (γk)k≥0 and ε = (εk)k≥0 are non-increasing, positive and sat- where S is a Borel function on Rl taking its isfy: there exist 0 < a < 1 and p ≥ 2 such ∞ values in an open convex subset S of Rm. P l that γk = ∞, lim εk = 0 and Moreover, the convex hull of S(R ) is in- k=0 k→∞ ∞ cluded in S, and, for all θ in Θ, P 2 a −1 p {γk + γkεk + (γkεk ) } < ∞. Z k=1 ||S(z)||pθ(z)µ(dz) < ∞. Theorem 1 (Convergence Result for the Estimated Sequence generated by Algo- – (M2) The functions ψ and φ are twice con- rithm 1) Assume (M1-M8) and (A4). As- tinuously differentiable on Θ. sume that the family of posterior density func- – (M3) The functions ¯ : Θ → S defined as tions {pθˆ(s), s ∈ S} satisfies assumptions (B1- Z B4). s¯(θ) , S(z)pθ(z)µ(dz) Let K be a compact subset of X and K0 ⊂ {s ∈ S, −l(θˆ(s)) < M } ∩ Conv(S( l)) (where is continuously differentiable on Θ. 0 R M0 is defined in (M7)). Then, for all z0 ∈ K – (M4) The function l : Θ → R defined as and s0 ∈ K0, we have lim d(θk, L) = 0 a.s. the observed-data log-likelihood k→∞ where (θ ) is the sequence generated by Algo- Z k k l(θ) , log g(y; θ) = log f(y, z; θ)µ(dz) rithm 1 and L , {θ ∈ Θ, ∂θl(θ) = 0}. The proof is postponed to Appendix 7.2. is continuously differentiable on Θ and Z Z ∂ f(y, z; θ)µ(dz) = ∂ f(y, z; θ)µ(dz). θ θ 4.3 Central Limit Theorem for the estimated sequence generated by the AMALA-SAEM – (M5) There exists a function θˆ : S → Θ, such that: Theorem 1 ensures that the number of re-initiali- zations of the sequence of stochastic approxi- ∀s ∈ S, ∀θ ∈ Θ,L(s; θˆ(s)) ≥ L(s; θ). mation of Algorithm 1 is finite almost surely. Moreover, the function θˆ is continuously dif- We can therefore consider only the non trun- ferentiable on S. cated sequence when we are interested in its – (M6) The functions l : Θ → R and θˆ : S → asymptotic behavior. Θ are m times differentiable. Let us write the stochastic approximation procedure : – (M7) (i) There exists an M0 > 0 such that sk = sk−1 + γkh(sk−1) + γkηk n o ˆ ˆ where H (z) = S(z) − s, h(s) = (H (z)), s ∈ S, ∂sl(θ(s)) = 0 ⊂ {s ∈ S, −l(θ(s)) < M0} . s Epθˆ(s) s

ηk = S(zk) − pˆ (S(z)) and pˆ is the E θ(sk−1) E θ(s) l (ii) For all M1 > M0, the set Conv(S(R ))∩ expectation under the invariant measure pθˆ(s). ˆ {s ∈ S, −l(θ(s)) ≤ M1} is a compact set Let us introduce some usual assumptions in of S. the spirit of these of Delyon (see [12]). – (M8) There exists a polynomial function P ∗ such that for all z ∈ X (N1) The sequence (sk)k converges to s a.s. The function h is C1 in some neighborhood of s∗ ||S(z)|| ≤ P (z) . with first derivatives Lipschitz and J the Ja- cobean matrix of the mean field h in s∗ has – (B4) For any compact subset K of S, there all its eigenvalues with negative real part. exists a polynomial function Q of the hidden (N2) Let gθˆ(s) be a solution of the Poisson equa- variable such that sup |∇z log pθˆ(s)(z)| ≤ Q(z). s∈K tion g − Πθˆ(s)g = Hs − pθˆ(s)(Hs) for any 10 StéphanieAllassonnière,Estelle Kuhn

s ∈ S. There exists a bounded function w compute its amplitude (as its non zero eigen- such that value since it is a rank one matrix). We see that it is of the same order as the diagonal part in average and jumps up to 15 times bigger. This w − Π w = g gT − θˆ(s∗) θˆ(s∗) θˆ(s∗) shows the importance of the anisotropic term. The last check is the Mean Square Euclidean Π g (Π g )T − U (20) θˆ(s∗) θˆ(s∗) θˆ(s∗) θˆ(s∗) Jump Distance (MSEJD) which computes the where the deterministic matrix U is given expected squared distance between successive by : draws of the Markov chain. The two methods provide MSEJD of the same order showing a h T very slight advantage in term of visiting the U = Eˆ ∗ gˆ ∗ (z)gˆ ∗ (z) − θ(s ) θ(s ) θ(s ) space for the AMALA sampler (1.29 versus 1.25 i T for the MALA). Πθˆ(s∗)gθˆ(s∗)(z)Πθˆ(s∗)gθˆ(s∗)(z) . (21) We will observe in the following experiments (N3) The step size sequence (γk) is decreasing that the advantage of considering the AMALA α and satisfies γk = 1/k with 2/3 < α < 1. instead of the MALA sampler will be intensified when increasing the problem dimension and in- cluding it into our estimation process. Theorem 2 Under the assumptions of Theo- rem 1 and under (N1)-(N3), the sequence ∗ √ (sk − s )/ γk converges in distribution to a 5.2 BME Template estimation Gaussian random vector with zero mean and covariance matrix Γ where Γ is the solution of Back to our targeted application, we apply the the following Lyapunov equation: proposed estimation process on different data bases. The first one is the USPS hand-written U + JΓ + ΓJ T = 0. digit base as used in [2] and [5]. The other two Moreover, are medical images of 2D corpus callosum and 3D murine dendrite spine excrescences used in 1 ∗ ˆ ∗ ˆ ∗ T [4]. √ (θk − θ ) →L N (0, ∂sθ(s )Γ ∂sθ(s ) ) γk We begin with presenting the experiments ∗ ˆ ∗ where θ = θ(s ). on the USPS database. In order to make com- The proof of Theorem 2 is given in Ap- parison, we estimate the parameters in the same pendix 7.3. conditions as in the previous mentioned works that is to say using the same 20 images per digit. Each image has grey level between 0 (back- 5 Applications on Bayesian Mixed Effect ground) and 2 (bright white). These images Template model are presented on the top panel of Fig. 2. We also use a noisy training dataset generated by 5.1 Comparison between MALA and AMALA adding a standardized independent Gaussian samplers noise. These images are presented on the bottom panel of Fig. 2. We test five algorithms: the As a first experiment, we compare the mixing deterministic approximation of the EM algo- properties of MALA and AMALA samplers. rithm (FAM-EM) presented in [2], four MCMC- We used both algorithms to sample from a 10 SAEM where the sampler is either the MALA, dimensional normal distribution with zero mean the adaptive MALA proposed in [7], the hy- and non diagonal covariance matrix. Its eigen- brid Gibbs sampler presented in [5] and our values range from 1 to 10. The eigen-directions AMALA algorithm. are chosen randomly. The autocorrelations of For these experiments the tuning parame- both chains are plotted in Fig. 1 where we can ters are chosen as follows: the threshold b is set see that there is a benefit of using the anisotropic to 1, 000, the scale δ to 10−3 and the regular- sampler. To evaluate the weight of the anisotropic ization ε to 10−4. The other tuning parameters term D(x)D(x)T in the covariance matrix, we and hyper-parameters are chosen as in [5]. Efficient stochastic EM in high dimension 11

Fig. 1 Autocorrelations of the MALA (blue) and AMALA (red) samplers to target the 10 dimensional normal distribution with anisotropic covariance matrix.

has 15 iterations in average. This implies to compute 15 times the gradient of the energy (which actually equals our gradient) for each image for each EM step. The ”MALA-like”- SAEM algorithms require about 100 to 150 EM steps (depending on the digit) but only one gradient is computed for each image at each step. This reduces the computational time by a fac- tor of at least 4 (up to 7 depending on the digit). No comparison can be done when the data are noisy since the FAM-EM does not converges toward the MAP estimator as mentioned above. Comparing to the hybrid Gibbs-SAEM, the computational time is 8 times lower with the AMALA-SAEM in this particular case of application. Indeed, the hybrid Gibbs sampler requires no computation of the gradient. How- ever, it includes a loop over the coordinates of the hidden variable, here the deformation vector of size 2kg = 72. At each of these iterations, the candidate is straightforward to sample whereas the computational cost lies into the acceptance rate. When this becomes heavy, the less times you calculate it, the better. In the AMALA-SAEM, this acceptance rate only has Fig. 2 Top: twenty images per digit of the training to be calculated once for each image. There- set used for the estimation of the model parameters fore, even when the dimension of the hidden (inverse video). Bottom: same images with additive noise of variance 1. variable increases, this is of constant cost. The main price to pay is the computation of the gradient. Therefore, a tradeoﬀ has to be found Note that this model satisﬁes the conditions between the computation of either one gradi- of our convergence theorem as these conditions ent or dkg acceptance rates in order to select are similar to the ones proved in [5]. the algorithm to use in a given case.

5.3 Computational performances 5.4 Results on the template estimation

We compare first the computational performances All the estimated templates obtained with the of the algorithms. The computational time is five algorithms and noise-free and noisy train- smaller for the three MCMC-SAEM algorithms ing data are presented in Fig. 3. As noticed using ”MALA-like” samplers compared to the in [5], the FAM-EM estimation is sharp when FAM. Indeed, a numerical convergence of that the training set is noise-free and is deteriorated algorithm requires about 30 to 50 EM steps. while adding noise. This behavior is not sur- Each of them requires a gradient descent which prising with regard to the theoretical bound 12 StéphanieAllassonnière,Estelle Kuhn established in [8] in the particular case of com- samples in one direction whereas the other one pact deformation group. Considering the adap- looks like something present in the training set. tive sampler, it does not reach a good estima- With regards to the above remarks concern- tion of the templates which are still very blurry ing the computational time and the template and noisy in both cases. The problem seems estimations, we present in this subsection only to come from the very low acceptation rate al- the results obtained using MALA and AMALA- ready at the beginning of the estimation. The SAEM algorithms. We notice that the samples bad initial guess we have about the covariance generated by both algorithms look alike in the matrix of the proposal seems to block the chain. case of hidden variable of dimension 72. Thus, Moreover, the tuning parameters are difficult to we present only the results of our AMALA- calibrate along the iterations of the estimation SAEM estimation. As we can see, the defor- algorithm. Concerning the estimated templates mations are very well estimated in both cases using the Gibbs, MALA and AMALA samplers, (without or with noise) and even look similar. they look very similar to each other using the This tends to demonstrate that the noise has noise-free data as well as the noisy ones. This been separated from the template as well as similarity confirms the convergence of all these the geometric variability during the estimation algorithms toward the MAP estimator. In this process. case, the templates are as expected: noise free Increasing the dimension of the deformation and sharp. to 128, we run both algorithms on the noisy Nevertheless, when the dimension of the hid- dataset. We observe on Fig. 6 that the geomet- den variable increases, both the Gibbs and the ric variability of the samples remains similar to MALA samplers show limitations. We run the the one obtained in lower dimension using our estimation on the same noisy USPS database, AMALA-SAEM. However, the MALA-SAEM increasing the number kg of geometrical con- does not manage to capture the whole variabil- trol points. We choose the dimension of the ity of the deformations which is related to the deformation vector equal to 72, 128 and 200. results observed above on the template. This The Gibbs-SAEM would produce sharp esti- confirms the limitation of the use of MALA- mations but explodes the computational time. SAEM in higher dimension. For this reason, we did not run this algorithm on higher dimension experiments. The results are presented in Fig. 4. Concerning the MALA sampler, it does not seem to capture the whole 5.6 Results on the noise variance estimation variability of the population in such high dimension. This yields a poorly estimation of the The last check of the accuracy of the estima- templates. This phenomenon does not appear tion relies in the noise variance estimation. The using our AMALA-SAEM algorithm. The tem- plots of their evolutions along the AMALA- plates still look sharp and the acceptation rates SAEM iterations for each digit in both cases remain reasonable. (without and with noise) are presented in Fig. 7. This variance is underestimated in particular in the noisy case, which is a well-known effect of 5.5 Results on the covariance matrix the maximum likelihood estimator. We observe estimation that the geometrically very constrained digits as 1 or 7 tend to converge very quickly whereas Since we are provided with a generative model, the digits 2 and 4 require more iterations to once the parameters have been estimated, we capture all the shape variability. can generate synthetic samples in order to eval- Since this is a real parameter, we used it to uate the constrained on the deformations that illustrate the Central Limit Theorem stated in have been learnt. Some of these samples are Subsection 4.3. Figure 8 and Figure 9 show the presented in Fig. 5. For each digit, 20 exam- histograms of 10, 000 runs of the algorithm with ples are generated with the deformations given the same initial conditions. We use the digits by +z and 20 others with −z where z is simu- 0 and 2 of the original data set as well as of lated with respect to N (0,Γg). We recall that, the noisy data. As the iterations go along, the as already noticed in [5], the Gaussian distri- distribution of the estimates tends to look like a bution is symmetric which may lead to strange Gaussian distribution centered in the estimated Efficient stochastic EM in high dimension 13

Algo./ FAM Hybrid Gibbs MALA Adaptive MALA AMALA Noise

No Noise

Noise

Fig. 3 Estimated templates using the ﬁve algorithms on noise free and noisy data. The training set includes 20 images per digit. The dimension of the hidden variable is 72.

Dim. of def. / 2kg = 72 2kg = 128 2kg = 200 Sampler

MALA

AMALA

Fig. 4 Estimated templates using MALA and AMALA samplers in the stochastic EM algorithm on noisy training data. The training set includes 20 images per digit. The dimension of the hidden variable increases from 72 to 200.

Fig. 5 Synthetic samples generated with respect to the BME template model using the estimated parameters with AMALA-SAEM. For each digit, the two lines represent the deformation using + and − the simulated deformation z. Left: data without noise. Right: data with noise variance 1. The number of geometric control points is 36 leading to a hidden variable of dimension 72. 14 St´ephanieAllassonni`ere,Estelle Kuhn

Fig. 6 Synthetic samples generated with respect to the BME template model using the estimated parameters with AMALA-SAEM (left) and MALA-SAEM (right). For each digit, the two lines represent the deformation using + and − the simulated deformation z. The number of geometric control points is 64 leading to a hidden variable of dimension 128. noise variances which demonstrates empirically 5.7 Classification results the Central Limit Theorem. The deformable template model enables to perform classification using the maximum likelihood of a new image to allocate it to one class, here the digit. We use the test USPS database (which contains 2007 digits) for classification while the training was done on the previous 20 noisy images. The results obtained with the hybrid Gibbs, MALA and AMALA-SAEM are presented in Table 1. In dimension 72, the best classification rate is performed by the hybrid Gibbs-SAEM. This is easily understandable since the sampling scheme enables to catch deformations which have been optimized control point by control point. Therefore, the estimated covariance matrix carries more local accuracy. The AMALA-SAEM leading to a much smaller computation time and to estimates of the same quality provides also a very good classification rate. This confirms the good results observed on both the template estimates and the synthetic samples. Unfortunately, the MALA-SAEM shows again some limitations. Even if the templates look acceptable, the sampler does not manage to capture the whole class variability. There- fore, the classification rate falls down. In order to evaluate the stability of our estimation algorithm with respect to the dimen- Fig. 7 Evolution of the estimation of the noise vari- sion, we perform the same classification with ance along the AMALA-SAEM iterations. Top: original data. Bottom: noisy data. more control points. As expected, the MALA- SAEM classification rate is deteriorated whereas our AMALA-SAEM keeps very good perfor- Efficient stochastic EM in high dimension 15

Fig. 8 Empirical convergence toward the Gaussian distribution of the estimated noise variance along the AMALA-SAEM iterations for digit 0. Top: original data. Bottom: noisy data.

Fig. 9 Empirical convergence toward the Gaussian distribution of the estimated noise variance along the AMALA-SAEM iterations for digit 2. Top: original data. Bottom: noisy data. mances. Note that the hybrid Gibbs sampler The estimations are compared with these was not tested in dimension 2kg = 128 because obtained with the FAM-EM and the hybrid Gibbs- of its very long computational time. SAEM algorithms and with the grey level mean image (bottom row of Fig. 10). In this real sit- uation, the Euclidean grey level mean image Sampler / Dim. of def. Hybrid Gibbs MALA AMALA (a) is very blurry. The estimated template using the FAM-EM (b) provides a ﬁrst ameliora- 72 22.43 35.98 23.22 tion in particular leading to a sharper corpus callosum. However, the cerebellum still looks 128 × 43.8 25.36 blurry in particular when comparing it to the Table 1 Error rate using the estimations on the shape which appears in the template estimated noisy training set with respect to the sampler used in using the hybrid Gibbs SAEM (c). The result the MCMC-SAEM algorithm and the dimension of of our AMALA-SAEM is given in image (d). the deformation 2kg. The classiﬁcation is performed This template is very close to (c) as we could on the test set of the USPS database. expect at a convergence point. Nevertheless the AMALA-SAEM has much lower computational time than the hybrid Gibbs-SAEM. This shows the advantage of using AMALA-SAEM in real 5.8 2D medical image template estimation cases of high dimension.

A second database is used to illustrate our algorithm. As before, in order to make compar- 5.9 3D medical image template estimation isons with existing algorithms, we use the same database presented in [4]. It consists of 47 med- We also test our algorithm in much higher di- ical images, each of them is a 2D square zone mension using the dataset of murine dendrite around the end point of the corpus callosum. spines (see [1,9,10]) already used in [4]. The This box contains a part of this corpus callosum dataset consists of 50 binary images of micro- as well as a part of the cerebellum. Ten exem- scopic structures, tiny protuberances found on plars are presented in the top rows of Fig. 10. many types of neurons termed dendrite spines. 16 St´ephanieAllassonni`ere,Estelle Kuhn

Fig. 13 Estimated templates of murine dendrite spines. The training set is either composed of 30 (left) or 50 (right) images.

(a) (b) (c) (d) 6 Conclusion

Fig. 10 Medical image template estimation. Top rows : 10 Corpus callosum and cerebellum train- In this paper we have considered the deformable ing images among the 47 available. Bottom row : (a) mean image. (b) FAM-EM estimated template. template estimation issue using the BME model. (c) Hybrid Gibbs - SAEM estimated template. (d) We were particularly interested in the high di- AMALA-SAEM estimated template. mensional setting. To that purpose, we have proposed to optimize the sampling scheme in the MCMC-SAEM algorithm to get an efficient The images are from control mice and knock- and accurate estimation process. We have ex- out mice which have been genetically modi- hibited a new MCMC method based on the fied to mimic human neurological pathologies classical Metropolis Adjusted Langevin Algo- like Parkinson’s disease. The acquisition pro- rithm where we introduced an anisotropic co- cess consisted of electron microscopy after in- variance matrix in the proposal. This optimiza- jection of Lucifer yellow and subsequent photo- tion takes into account the anisotropy of the oxidation. The shapes were then manually seg- target distribution. We proved that the gen- mented on the tomographic reconstruction of erated Markov chain is geometrically ergodic the neurons. Some of these binary images are uniformly on any compact set. We have also presented in Fig. 11 which shows a 3D view of proved the almost sure convergence of the se- some exemplars among the training set. Each quence of parameters generated by the estima- image is a binary (background = 0, object = 2) tion algorithm as well as its asymptotic nor- cubic volume of size 283. We can notice here the mality. We have illustrated this estimation al- large geometrical variability of this population gorithm in the BME model. We have consid- of images. Therefore we use a hidden variable ered different datasets of the literature namely of dimension 3kg = 648 to catch this complex the USPS database, 2D medical images of cor- structure. pus callosum and 3D medical images of murine dendrite excrescences. We have compared the The template estimated with either 30 or results with previously published ones to high- 50 observations are presented in Fig. 13. We light the gain in speed and accuracy of the pro- obtain similar shapes which are coherent with posed algorithm. what a mean shape could be regarding the training sample. To evaluate the estimated geomet- We emphasize that the proposed estimation rical variability, we generate synthetic samples scheme can be applied in a wide range of ap- as done in Subsection 5.5. Eight of these are plication fields involving missing data models shown in Fig. 12. We observe different twist- in high dimensional setting. In particular, this ing which are all coherent with the shapes ob- method is promising when considering mixture served in the dataset. Note that the training models as proposed in [3]. Indeed, it will enable shapes have very irregular boundaries whereas to shorten the computation time of the simula- the parametric model used for the template tion part which in that case requires the use of leads to a smoother image. Thus, the synthetic many auxiliary Markov chains. This also pro- samples do not reflect the local ruggedness of vides a good tool for this BME model when the segmented murine dendrite spines. If the introducing a diffeomorphic constrain on the aim was to capture these local bumps, the num- deformations. In this case, it is even more im- ber of photometrical control points has to be portant to get an efficient estimation process increased. However, the goal of our study was since the computational cost of diffeomorphic to detect global shape deformations. deformation is intrinsically large. Efficient stochastic EM in high dimension 17

Fig. 11 3D views of eight samples of the data set of dendrite spines. Each image is a volume leading to a binary image.

Fig. 12 3D views of eight synthetic data. The estimated template shown on the left of Fig. 13 is randomly deformed with respect to the estimated covariance matrix.

7 Appendix 7.1.1 Proof of the existence of a small set C

7.1 Proof of Proposition 1 Let C be a compact subset of X . Let K be a compact set. We define τ = The idea of the proof is the same as the one inf{ρs(x, z), x ∈ C, z ∈ K, s ∈ K}. Since of the geometric ergodicity of the random walk ρs is a ratio of positive continuous functions Metropolis algorithm developed in [19] and re- in s, x and z and K is a compact subset of S, worked in [7] for its adaptive version of the we have τ > 0. The same argument holds for MALA with truncated drift. The fact that both (s, x, z) 7→ qs(x, z) which is bounded by below the drift and the covariance matrix are bounded by µ > 0. Therefore, for all x ∈ C, for any A ∈ B even depending on the gradient of log πs en- and for all s ∈ K : ables partially similar proofs. Let us first recall the transition kernel: Z Πs(x, A) ≥ αs(x, z)qs(x, z)dz Z A∩K Z Πs(x, A) = αs(x, z)qs(x, z)dz+ 1 A ≥ min(1, τ)µ K (z)dz . Z A 1 A(x) (1 − αs(x, z))qs(x, z)dz , (22) Therefore, we can define ν(A) = 1 R 1 (z)dz X Z A K where Z is the renormalisation constant and where αs(x, z) = min(1, ρs(x, z)) and ρs(x, z) = ε = min(1, τ)µZ so that C is a small set for πs(z)qs(z,x) . the transition kernel Πs for all s ∈ K and (17) qs(x,z)πs(x) holds. Thanks to the bounded drift and covariance matrix, we can bound uniformly in s ∈ S the 7.1.2 Proof of the drift condition proposal distribution qs by two centered Gaus- sian distributions as follows: there exist con- We will prove this property in two steps. First, stants 0 < k1 < k2, 1 > 0 and 2 > 0 such we establish that each kernel Πs satisfies a Drift that for all (x, z) ∈ X 2 and for all s ∈ S property with a specific function Vs. Then, we construct a common function V so that we will be able to prove the Drift property uniformly k1g1 (x − z) ≤ qs(x, z) ≤ k2g2 (x − z) , (23) in s ∈ K. denoting by ga the centered Gaussian probability density function in Rl with covariance ma- Let us concentrate on the first step. Let us trix aIdl. consider s fixed. As already suggested in [19], 18 StéphanieAllassonnière,Estelle Kuhn we only need to prove the two following condi- On the acceptance set As(x), we have: tions: π (z)−β s q (x, z) ≤ q (z, x)βq (x, z)1−β . Π V (x) −β s s s sup s s < ∞ (24) πs(x) x∈X Vs(x) Thanks to Equation (23) one can bound this and right hand side by the following symmetric Gaus- sian distribution: ΠsVs(x) lim sup < 1 . (25) π (z)−β |x|→∞ Vs(x) s −β qs(x, z) ≤ k2g2 (z − x) (26) πs(x) We take the same path as in [7] applied to our which yields: case and refer to Fig. 14 for a 2D visualization of all the sets introduced along the proof. Z Z f1(x, y)dz ≤ k2 g2 (z − x)dz . (27) As(x) As(x)

Equivalently on Rs(x), we have the following bound: 1−β πs(z) 1−β β 1−β qs(z, x) ≤ qs(x, z) qs(z, x) (28) πs(x)

≤ k2g2 (z − x) . (29) Let ﬁx ε > 0, there exists a > 0 such that R B(x,a) g2 (z − x)dz ≥ 1 − ε. This leads to: Z f1(x, z)dz ≤ k2ε . c As(x)∩B(x,a)

Let Cπs(x) be the level set of πs in x: Cπs(x) = {z ∈ X : πs(z) = πs(x)}. We define a pipe Fig. 14 2D representation of the sets used in the around this level set as proof. Cπs(x)(u) = {z + t n(z), |t| ≤ u, z ∈ Cπs(x)}. Thanks to assumption (B1), there exists r1 > 0 such that for all x ∈ X satisfying |x| ≥ For any x ∈ X , we denote by r1 then 0 is inside the hyperspace defined by A (x) = {z ∈ X such that ρ (x, z) ≥ 1} the ac- s s the level set Cπ (x) (πs(0) > πs(x)). Therefore, c s ceptance set and by Rs(x) = As(x) its comple- let x ∈ X , |x| ≥ r1, then for all z ∈ X , ∃x1 ∈ −β mentary set. Then, we recall Vs(x) = csπs(x) Cπs(x) and t > 0 such that z = x1 + t n(x1). for some β ∈]0, 1[. Therefore, for all x ∈ X : Since z 7→ g2 (z − x) is a smooth density in the variable z, we can find u > 0 sufficiently Z ΠsVs(x) Vs(z) small such that = q (x, z) dz+ V (x) s V (x) s As(x) s Z Z g (z − x)dz ≤ ε , (30) πs(z)qs(z, x) Vs(z) 2 qs(x, z) dz+ B(x,a)∩Cπs(x)(u) Rs(x) πs(x)qs(x, z) Vs(x) Z πs(z)qs(z, x) leading to 1 − qs(x, z)dz R (x) πs(x)qs(x, z) Z s f (x, z)dz ≤ k ε . Z −β 1 2 πs(z) A (x)∩B(x,a)∩C (u) ≤ q (x, z) dz+ s πs(x) π (x)−β s As(x) s Assumption (B1) implies that for any r > 0 | {z } f (x,z) πs(x+t n(x)) 1 and t > 0, dr(t) = sup goes to πs(x) Z π (z)1−β Z |x|≥r s c+ qs(z, x) dz+ qs(x, z) dz . 0 as r goes to ∞. Denote C (u) = {z ∈ π (x)1−β πs(x) Rs(x) s Rs(x) | {z } C (u)c s.t. π (x) > π (z)} and C (u)c− = | {z } f3(x,z) πs(x) s s πs(x) f2(x,z) c {z ∈ Cπs(x)(u) s.t. πs(x) < πs(z)}. We denote Efficient stochastic EM in high dimension 19

+ c+ D = As(x) ∩ B(x, a) ∩ Cπs(x)(u) . There- Hence, for |x| ≥ r3, any point x2 = x − u2 n(x) fore there exists r2 > r1 + a such that for any belongs to As(x). x, |x| ≥ r2 Let W (x) be the cone defined as: Z Z 1−β πs(z) d−1 f1(x, z)dz ≤ qs(z, x)dz W (x) = x2 − tζ, 0 < t < a − u2, ζ ∈ S , D+ D+ πs(x) Z ε o 1−β |ζ − n(x2)| ≤ (34) ≤ dr2 (u) k2 g2 (z − x)dz 2 X d−1 d 1−β where S is the unit sphere in R . ≤ k2dr2 (u) , using Equation (15) which states that the sta- Let us prove that W (x) ⊂ As(x). tionary distribution is decreasing in the direc- Using assumption (B1), we have for a suf- tion of the normal of x sufficiently large. ficiently large x: m(x).n(x) ≤ −ε. Besides, by In the same way, one has on the set D− = c− construction of W (x) for large x, for all z ∈ As(x) ∩ B(x, a) ∩ Cπ (x)(u) s W (x), |n(z) − n(x)| ≤ ε/2 with n(x) = n(x2) Z Z −β πs(z) (see Fig. 14). This leads to for any sufficiently f1(x, z)dz ≤ qs(x, z)dz large x, for all z ∈ W (x), D− D− πs(x) β ≤ k2dr2 (u) . m(z).ζ = m(z).(ζ−n(x2))+m(z).(n(x2)−n(z)) The same inequalities can be obtained for f 2 + m(z).n(z) ≤ ε/2 + ε/2 − ε = 0 . (35) using the same arguments: Z Let now z = x2 − tζ ∈ W (x). Using the mean f2(x, z)dz ≤ k2ε c Rs(x)∩B(x,a) value theorem on the differentiable function πs Z between x2 and z, we get that there exists τ ∈ f2(x, z)dz ≤ k2ε ]0, s[ such that πs(z) − π(x2) = −tζ.∇πs(x2 − Rs(x)∩B(x,a)∩Cπs(x)(u) Z τζ). Using the definition of m, this implies that 1−β f2(x, z)dz ≤ k2dr2 (u) πs(z) − πs(x2) = −tζ.m(x2 − τζ)|∇πs(x2 − R (x)∩B(x,a)∩C (u)c+ s πs(x) τζ)| ≥ 0 thanks to Equation (35). Putting all Z β these results together we finally get that for all f2(x, z)dz ≤ k2dr2 (u) . 1 c− Rs(x)∩B(x,a)∩Cπ (x)(u) z ∈ W (x), πs(z) ≥ πs(x2) ≥ a πs(x). More- s c0 This yields over, as z ∈ B(x, a) as well, Equation (32) is satisfied, leading to z ∈ A (x). Z s ΠsVs(x) lim sup ≤ lim sup qs(x, z)dz. |x|→∞ Vs(x) |x|→∞ Rs(x) Then, we have (31) Z Q(x, A (x)) = q (x, z)dz R s s Let Q(x, As(x)) = qs(x, z)dz, we get As(x) As(x) Z ΠsVs(x) ≥ k1g1 (z − x)dz lim sup ≤ 1 − lim inf Q(x, As(x)). As(x) |x|→∞ Vs(x) |x|→∞ Z

≥ k1 g1 (z − x)dz Let us now prove that W (x) lim inf Q(x, As(x)) ≥ c > 0 where c does not Z |x|→∞ = g1 (z)dz depend on x. Tx(W (x)) Let a ﬁxed as above. Since qs is an expo- a where nential function, there exists c0 > 0 such that for all x ∈ X and s ∈ S, n Tx(W (x)) = −u2 n(x) − tζ, 0 < t < a − u2, qs(z, x) a d−1 ε o inf ≥ c0 . (32) ζ ∈ S , |ζ − n(x)| ≤ (36) z∈B(x,a) qs(x, z) 2 Moreover, thanks to assumption (B1) there exis the translation of the set W (x) by the vec- ists r3 > 0 such that for all x ∈ X , |x| ≥ r3, tor x. Note that W (x) does not depend on s. there exists 0 < u2 < a such that, But since g1 is isotropic and Tx(W (x)) only

πs(x) a depends on a fixed constant u2 and n(x), this ≤ c0 . (33) πs(x − u2 n(x)) last integral is independent of x, so there exists 20 StéphanieAllassonnière,Estelle Kuhn

2ξ a positive constant c independent of s ∈ S such Applying the Drift property for Πs with Vs , that: 1 Z ˜ 2ξ ˜1 ΠsV (x) ≤ 2 (λVs(x) + b C(x)) c = g1 (z)dz . (37) 2ε Tx(W (x)) 2 Z ε 4ξ + Πs(x, z)V2(z) dz . (42) Back to our limit, for all s ∈ S 2 X

ΠsVs(x) Using the definition of V and the fact that V1 lim sup ≤ 1 − c (38) is bounded by below by 1, we get: |x|→∞ Vs(x) which ends the proof of the condition (25). λ˜ ˜b Π V (x) ≤ V (x) + 1 (x) To prove (24), we use the previous result. s 2ε2 2ε2 C ΠsVs(x) 2 Z Indeed, since is a smooth function on X ε 4ξ Vs(x) + Πs(x, z)V2(z) dz . (43) it is bounded on every compact subset. More- 2 X over since the lim sup is finite, then it is also ˜ bounded outside a fixed compact. This proves Since 0 < λ < 1 is independent of s ∈ S the results. and using assumption (B3), there exists ξ > 0 such that Z Thanks to assumption (B2) and the bounded 4ξ 2 a sup Πs(x, z)V2(z) dz ≤ . (44) drift for all s ∈ S, there exists a constant c0 s∈S,x∈X X 1 + λ˜ uniform in s ∈ S such that Equations (32) and (33) still hold for all s ∈ S. This implies, This yields as mentioned above, that the set Tx(W (x)) is ˜ 2 ! ˜ independent of s ∈ S. Therefore, we can set λ ε b ΠsV (x) ≤ + V (x)+ 1C(x) . λ˜ = 1 − c < 1 where c is defined in Equation 2ε2 1 + λ˜ 2ε2 (37) and is also independent of s ∈ S. (45)

q ˜ ˜ This proves the Drift property for the func- We can now ﬁx ε2 = λ(1+λ) which leads to ˜ 2 tion Vs: there exist constants 0 < λ < 1 and ˜b > 0 such that for all x ∈ X , s 2λ˜ ˜b Π V (x) ≤ V (x) + 1 (x) . (46) ˜ ˜ s ˜ 2 C ΠsVs(x) ≤ λVs(x) + b1C(x) , (39) 1 + λ 2ε

˜ q ˜ ˜ where C is a small set. Note that b is also inde- We set λ = 2λ < 1 and b = b > 0 1+λ˜ 2ε2 pendent of s ∈ S using the same arguments as which concludes the proof. before. Let us now exhibit a function V independent of s ∈ S and prove the uniform Drift con- 7.2 Proof of Theorem 1 dition. We provide here the proof of the convergence of We define for all x ∈ X , the estimated sequence generated by Algorithm 1. V (x) = V (x)ξV (x)2ξ (40) 1 2 We apply Theorem 4.1 from [5] with the functions H equals to H (z) = S(z) − s, Π = for 0 < ξ < min(1/2β, b0/4). Therefore, for all s s s s ∈ S, for all ε > 0 we have, Πθˆ(s), πs = pθˆ(s) and Z Z ξ 2ξ h(s) = (S(z) − s)pθˆ(s)(z)µ(dz) . ΠsV (x) = Πs(x, z)V1(z) V2(z) dz X Z 2ξ Let us first prove assumption (A1’) which 1 V1(z) 2 4ξ ≤ Πs(x, z) + ε V2(z) dz ensures the existence of a global Lyapunov func- 2 ε2 X tion for the mean field of the stochastic approx- 1 Z 2ξ imation. It guaranties that, under some condi- ≤ 2 Πs(x, z)Vs(z) dz+ 2ε X tions, the sequence (sk)k≥0 remains in a com- 2 Z ε 4ξ pact subset of S and converges to the set of Πs(x, z)V2(z) dz . (41) 2 X critical points of the log-likelihood. Efficient stochastic EM in high dimension 21

Assumptions (M1)-(M7) ensure that S is is satisfied with j = 1 (small set ”in one-step”), an open subset and that the function h is con- each chain of the family is aperiodic (see [25]). tinuous on S. Moreover defining w(s) = −l(θˆ(s)), Assumption (A2) is therefore directly im- we get that w is continuously differentiable on plied by assumption (M1). S. Applying Lemma 2 of [13], we get (A1’)(i), (A1’)(iii) and (A1’)(iv). Let us now prove assumption (A3’) which To prove (A1’)(ii), we consider as absorb- states some regularity conditions (Höldertype l ing set Sa the closure of the convex hull of S(R ) ones) on the solution of the Poisson equation denoted Conv(S(Rl)). So assumption (M7)(ii) related to the transition kernel. It also ensures is exactly equivalent to assumption (A1’)(ii). that this solution and its image through the This achieves the proof of assumption (A1’). transition kernel have reasonable behaviors as the chain goes to infinity and that the kernel is Let us now prove assumption (A2) which V p-bounded in expectation. states in particular the existence of a unique invariant distribution for the Markov chain. The drift conditions proved previously im- To that purpose, we prove that our family of ply the geometric ergodicity uniformly in s in kernels satisfies the drift conditions mentioned any compact set K. This also ensures the exis- in [6] and used in [5] in a similar context. These tence of a solution of the Poisson equation (see conditions are the existence of a small set uni- [25]) required in Assumption (A3’). formly in s ∈ K, the uniform drift condition and an upper bound on the family kernel : We first consider condition (A3’(i)). Let us define for any g : X → Rm the norm kg(z)k (DRI1) For any s ∈ S, Πθˆ(s) is ψ-irreducible and kgkV , sup V (z) . aperiodic. In addition there exist a function z∈X V : l → [1, ∞[ and a constant p ≥ 2 such R Since Hs(z) = S(z)−s, assumptions (M8) and that for any compact subset K ⊂ S, there (B1) ensure that sup kHskV < ∞ and inequal- exist an integer j and constants 0 < λ < 1, s∈K B, κ, δ > 0 and a probability measure ν ity (4.3) of (A3’(i)) holds. such that The uniform ergodicity of the family of Markov chains corresponding to the AMALA on K en- sup Πj V p(z) ≤ λV p(z) + B1 (z) , (47) θˆ(s) C sures that there exist constants 0 < γ < 1 and s∈K K p p CK > 0 such that for all s ∈ K sup Πθˆ(s)V (z) ≤ κV (z) ∀z ∈ X , (48) s∈K X sup kg k = sup k (Πk H − p H )k j θˆ(s) V θˆ(s) s θˆ(s) s V s∈K s∈K inf Π (z, A) ≥ δν(A) ∀z ∈ C, ∀A ∈ B(49). k≥0 s∈K θˆ(s) X ≤ sup C γk kH k < ∞ . Let us start with the irreducibility of Π ˆ . K K s V θ(s) s∈K k≥0 The kernel Πθˆ(s) is bounded by below as follows : Thus for all s in K , gθˆ(s) belongs to LV = {g : l m Z R → R , kgkV < ∞}. Πθˆ(s)(x, A) ≥ αs(x, z)qs(x, z)dz , (50) Repeating the same calculation as above, it A is immediate that sup |kΠθˆ(s)gθˆ(s)kV is bounded. s∈K where αs(x, z) = min(1, ρs(x, z)) and ρs(x, z) = πs(z)qs(z,x) This ends the proof of inequality (4.4) of (A3’(i)). > 0. Since the proposal density qs qs(x,z)πs(x) is positive, this proves that Π (x, A) is posi- s We now move to the Hölderconditions (4.5) tive and the ψ-irreducibility of each kernel of of (A3’(i)). We will use the two following lem- the family. mas which state Hölderconditions on the transition kernel and its iterates: Proposition 1 and Remark 1 show that Equa- tions (47) and (49) hold for j = 1 with V de- Lemma 1 Let K be a compact subset of S. fined in Equation (40) and some p > 2. More- There exists a constant CK such that for all over, since Equation (47) holds for j = 1 and 1 ≤ p there exists q > p, for all function f ∈ 0 2 V ≥ 1, Equation (48) directly comes from Equa- LV p and for all (s, s ) ∈ K we have :

tion (47) choosing κ = B + λ. This implies all 0 kΠ f − Π fk q ≤ C kfk p ks − s k . three inequalities. Since the small set condition θˆ(s) θˆ(s0) V K V 22 St´ephanieAllassonni`ere,Estelle Kuhn

l Proof For any f ∈ LV p and any x ∈ R , we After some calculations, using the bounded drift have and covariance and Assumption (M8), we get : Z d log qs(x, z) ˜ dDs(x) dΣs(x) Πsf(x) = f(z)αs(x, z)qs(x, z)dz ≤ P1(x, z) + l ds ds ds F R −1 + f(x)(1 − α (x)) , dΣs (x) s + , ds F pθˆ(s)(z)qs(z,x) where αs(x, z) = min 1, and ≤ CKP1(x, z) . qs(x,z)pθˆ(s)(x) α (x) = R α (x, z)q (x, z)dz is the average ac- s s s where Ds and Σs are respectively the drift and ceptance rate. Let us denote for all x, z and s: covariance of the proposal qs and P˜1 and P1 are rs(x, z) = αs(x, z)qs(x, z). two polynomial functions in both variables. 0 Let s and s be two points in K. We note Using Equation (23), we have : ˆ that s 7→ θ(s) is a continuously differentiable function therefore uniformly bounded in s ∈ K. dqs(x, z) ≤ k2CKP1(x, z)g2 (z − x) , ds which leads to : kΠ f(x) − Π 0 f(x)k ≤ kfk p × s s V Z Z p p |qs(x, z) − qs0 (x, z)|V (z)dz |rs(x, z) − rs0 (x, z)|V (z)dz + As∩As0 X Z Z 0 p p ≤ k2CKks − s k V (z)P1(x, z)g2 (z − x)dz V (x) |rs(x, z) − rs0 (x, z)|dz , X X 0 p ≤ k2CKQ1(x)ks − s k , ≤ 2kfkV p V (x)× Z p where Q1 is a polynomial function. |rs(x, z) − rs0 (x, z)|V (z)dz . X Now we move to the second term (52). Let R p Let I = |rs(x, z) − rs0 (x, z)|V (z)dz. For c X z ∈ As ∩ As0 . We define for all u ∈ [0, 1] the sake of simplicity, we denote by As the accep- barycenter s(u) of s and s0 equals to us + (1 − tance set instead of As(x). We decompose I u)s0 which belongs to the convex hull of the into four terms : compact subset K. R p Since u 7→ ρs(u)(x, z) is continuously dif- I = |rs(x, z) − rs0 (x, z)|V (z)dz As∩A 0 s ferentiable, ρs(x, z) ≥ 1 and ρs0 (x, z) < 1, us- (51) ing the intermediate value theorem, there ex- R p + c |rs(x, z) − rs0 (x, z)|V (z)dz ists u ∈]0, 1] depending on x and z such that As∩A 0 s ρ (x, z) = 1. We choose the minimum value (52) s(u) of u satisfying this condition. Therefore, R p + c |rs(x, z) − rs0 (x, z)|V (z)dz As∩As0 (53) |αs(x, z)qs(x, z) − αs0 (x, z)qs0 (x, z)| R p 0 ≤ |q (x, z) − q (x, z)|+ + Ac∩Ac |rs(x, z) − rs (x, z)|V (z)dz s s(u) s s0 . (54) |ρs(u)(x, z)qs(u)(x, z) − ρs0 (x, z)qs0 (x, z)| . (56) Let us first consider the term (51). We treat the first term of the right hand Z side as previously. For the second term, we use p |rs(x, z) − rs0 (x, z)|V (z)dz the mean value theorem for the function v 7→ As∩As0 Z fs(v)(x, z) = ρs(v)(x, z)qs(v)(x, z) on ]0, u[. There p = |qs(x, z) − qs0 (x, z)|V (z)dz . (55) exists v ∈]0, u[ such that As∩A 0 s dfs(v)(x, z) 0 We use the mean value theorem on the smooth |fs(u)(x, z) − fs0 (x, z)| ≤ ks − s k . dv function s 7→ qs(x, z) for fixed values of (x, z). Thanks to the upper bound above we get dq (x, z) d log q (x, z) d log f (x, z) s = q (x, z) s . s(v) ≤ C P (x, z) , ds s ds dv K 2 Efficient stochastic EM in high dimension 23 where P2 is a polynomial function in both vari- Proof The proof of lemma 2 follows the line of ables. Since on the segment defined by s(u) and the proof of Proposition B.2 of [6]. 0 s we have ρs(x, z) ≤ 1 : df (x, z) d log f (x, z) s(v) = f (x, z) s(v) dv s(v) dv Thanks to the proofs of [5], we get that h ≤ CKqs(v)(x, z)P2(x, z) is a Hölderfunction for any 0 < a < 1 which ≤ k C ks − s0kP (x, z)g (z − x) . 2 K 2 2 leads to (A3”(i)). This yields : We finally focus on the proof of (A3”(ii)). Z p |rs(x, z) − rs0 (x, z)|V (z)dz A ∩Ac Lemma 3 Let K be a compact subset of S and s s0 0 p ≥ 1. For all sequences γ = (γk)k≥0 and ε = ≤ k2CK Q1(x) + Q2(x) ks − s k . (εk)k≥0 satisfying εk < ε for some ε sufficiently small, there exists CK > 0, such that for any z0 ∈ X , we have The third term (53) is the symmetric one of γ p 1 p the second. sup sup Ez,s[V (zk) σ(K)∧ν(ε)≥k] ≤ CKV (z0) , s∈K k≥0 Let us end with the last term (54). Z γ where Ez,s is the expectation related to the non- |αs(x, z)qs(x, z) − αs0 (x, z)qs0 (x, z)| × c c homogeneous Markov chain ((zk, sk)) started from As∩A 0 s (z, s) with step size sequence γ. V p(z)dz Z = |ρs(x, z)qs(x, z) − ρs0 (x, z)qs0 (x, z)| × Proof Let K be a compact subset of Θ such Ac∩Ac ˆ s s0 that θ(K) ⊂ K. We note in the sequel, θk = p ˆ V (z)dz . θ(sk). We have for k ≥ 2, using the Markov property and the drift property (18) for V p, If for all u ∈]0, 1[, ρs(u)(x, y) < 1 then this term can be treated as the second term of Equa- γ p 1 γ p Ez,s[V (zk) σ(K)∧ν(ε)≥k] ≤ Ez,s[Πθk−1 V (zk−1)] tion (56). If there exists u ∈]0, 1[ such that (58) ρ (x, y) ≥ 1, we define u and u respec- s(u) 0 1 γ p tively the smallest and biggest elements in ]0, 1[ ≤ λEz,s[V (zk−1)] + C. (59) such that ρs(u0) = ρs(u1) = 1. The first and last terms are treated as the previous case and the middle term is treated as the term (51). Iterating the same arguments recursively leads to : Putting all these upper bounds together yields : γ p 1 Ez,s[V (zk) σ(K)∧ν(ε)≥k] p 0 kΠsf(x)−Πs0 f(x)k ≤ 2kfkV p V (x)Q(x)ks−s k , k−1 k γ p X l (57) ≤ λ Ez,s[V (z0)] + C λ . l=0 where Q is a polynomial function in x ∈ X . Therefore, there exists a constant q > p such Since λ < 1 and V (z) ≥ 1 for all z ∈ X , for p q that V (x)Q(x) ≤ V (x) which concludes the all k ∈ N, we have : proof. k−1 γ p 1 p X l Ez,s[V (zk) σ(K)∧ν(ε)≥k] ≤ V (z0) + C λ l=0 Lemma 2 Let K be a compact subset of S. p C ≤ V (z0) 1 + . There exists a constant CK such that for all 1 − λ 1 ≤ p < q, for all function f ∈ LV p , for all (s, s0) ∈ K2 and for all k ≥ 0, we have: This yields (A3’(ii)) which concludes the

k k 0 proof of Theorem 1. q p kΠθˆ(s)f − Πθˆ(s0)fkV ≤ CKkfkV ks − s k . 24 St´ephanieAllassonni`ere,Estelle Kuhn

7.3 Proof of the Central Limit Theorem for The result of Theorem 24 still holds replac- the Estimated Sequence ing assumption (C) of [12] by (N1). Indeed, it is sufficient to establish that the random vari- k The proof of Theorem 2 follows the lines of the −1/2 P able γk exp[(tk − ti)J]γiri converges to- proof of Theorem 25 of [12]. This theorem is an i=0 i application of Theorem 24 of [12] in the case of P ward 0 in probability where ti = γj. The- Markovian dynamics. However, some assump- j=1 tions required in Theorem 24 are not fulfilled in orem 19 and Proposition 39 of [12] can be ap- our case: the A-stability of the algorithm and plied in expectation. Theorems 23 and 20 of the boundedness in infinite norm of the solution [12] also still hold with assumption (N1). of the Poisson equation. Consider the stochastic approximation: We now prove that assumptions of Theorem 3 hold. sk = sk−1 + γkh(sk−1) + γkηk , (60) where the remainder term is decomposed as fol- By definition of ξi it is obvious that (65) is lows: fulfilled. Moreover, the following lemma proves that there exists ε > 0 such that (66) holds ηk = ξk + νk − νk−1 + rk (61) with X = 1. with Lemma 4 For all ε > 0, the sequence (ξk) is 2+ε ξ = gˆ (z ) − Π ˆ gˆ (z ) in L . k θ(sk−1) k θ(sk−1) θ(sk−1) k−1 (62) Proof We use the convexity of the function x 7→ 2+ε ν = −Π ˆ gˆ (z ) (63) x . Indeed, we have k θ(sk) θ(sk) k rk = Πθˆ(s )gθˆ(s )(zk) − Πθˆ(s )gθˆ(s )(zk) 2+ε k k k−1 k−1 |gˆ (z ) − Π ˆ gˆ (z )| θ(sk−1) k θ(sk−1) θ(sk−1) k−1 (64) 2+ε ≤ (|gθˆ(s )(zk)|+|Πθˆ(s )gθˆ(s )(zk−1)|) and for any s ∈ S, g is a solution of the k−1 k−1 k−1 θˆ(s) 2+ε 2+ε ≤ Cε(|gθˆ(s )(zk)| +|Πθˆ(s )gθˆ(s )(zk−1)| ) , Poisson equation g − Πθˆ(s)g = Hs − pθˆ(s)(Hs). k−1 k−1 k−1 We recall Theorem 24 of [12] with sufficient 1 where Cε = 23+ε . assumptions for our setting. Applying the drift condition, we get : Theorem 3 (Adapted from Theorem 24 2+ε 2+ε (||ξk|| |Fk−1) ≤ Cε (|gˆ (zk)| | Fk−1) [12]) Let assumptions (N1) and (N3) be ful- E E θ(sk−1) filled. Furthermore, assume that for some ma- 2+ε +|E(Π ˆ gˆ (zk−1)|Fk−1)| trix U, some ε > 0 and some positive random θ(sk−1) θ(sk−1) 0 00 2+ε 2+ε variables X,X ,X : ≤ C E(V (zk) + V (zk−1) |Fk−1)) ≤ C λV 2+ε(z ) + 1 . The sequence (ξi) is a F − martingale (65) k−1

sup kξik2+ε < ∞ (66) Finally taking the expectation after induction i∈N as in Lemma 3 leads to: −1/2 lim γk kXrkk1 = 0 (67) k→∞ 2+ε 2+ε E(||ξk ||) ≤ CV (z0) < +∞ . 1/2 0 lim γk kX νkk1 = 0 (68) k→∞ Let us now focus on Equation (67). Thanks k to the Hölderproperty of our kernel and the 00 X T lim γkkX (ξiξi − U)k1 = 0 (69) k→∞ fact that Hsk belongs to LV : i=1 where F = (Fi)i∈N is the increasing family of krkk1 = [|Π ˆ gˆ (zk) − Π ˆ gˆ (zk)|] σ−algebra generated by the random variables E θ(sk) θ(sk) θ(sk−1) θ(sk−1) q a (s0, z1, ..., zi). Then ≤ CE[V (zk)|sk − sk−1| ] ∗ q+1 a sk − s ≤ CE[V (zk)]γk √ →L N (0,V ) (70) a γk ≤ Cγk where V is the solution of the following Lya- where the last inequality comes from the drift punov equation U + JV + VJ T = 0. property. Since the Hölderproperty is true for Efficient stochastic EM in high dimension 25 any 0 < a < 1, we can choose a > 1/2 which a truncated drift. Methodol. Comput. Appl. leads to the conclusion. Probab. 8, 235–254 (2006) 8. Bigot, J., Charlier, B.: On the consistency of Fréchet means in deformable models for curve To prove Equation (68), we note that using and image analysis. Electron. J. Stat. 5, 1054– the drift condition as in the previous lemma, 1089 (2011). DOI 10.1214/11-EJS633. URL http://dx.doi.org/10.1214/11-EJS633 E(kνkk) is uniformly bounded in k. Since the 9. Ceyhan, E., Fong, L., Tasky, T., Hurdal, M., Beg step-size sequence (γk)k tends to zero, the re- 0 M.F.and Martone, M., Ratnanather, J.: Type- sult follows with X = 1. specific analysis of morphometry of dendrite spines of mice. 5th Int. Symp. Image Signal We follow the lines of the proof of Theorem Proc. Analysis, ISPA pp. 7–12 (2007) ¨ 25 of [12] to establish Equation (69). As in the 10. Ceyhan, E., Olken, R., Fong, L., Tasky, T., Hurdal, M., Beg, M., Martone, M., Rat- proof of Lemma 4, we use the drift property nanather, J.: Modeling metric distances of den- coupled with our Höldercondition in LV -norm drite spines of mice based on morphometric instead of the usual Höldercondition consid- measures. Int. Symp on Health Informatics and ered by [12] which is denoted (MS). Bioinformatics (2007) 11. Cootes, T., Taylor, C., Cooper, D., Graham, J.: Active shape models: their training and appli- This concludes the proof of Theorem 3. cation. Comp. Vis. and Image Understanding 61(1), 38–59 (1995) Applying Theorem 3 allows to prove the 12. Delyon, B.: Stochastic approximation with de- first part of Theorem 2. Assumption (N2) en- creasing gain: convergence and asymptotic the- ory. Technical Report: Publication interne 952, ables to characterize the covariance matrix Γ . IRISA (2000) The Delta method gives the result on the se- 13. Delyon, B., Lavielle, M., Moulines, E.: Conver- quence (θk) achieving the proof of our Central gence of a stochastic approximation version of Limit Theorem. the EM algorithm. Ann. Statist. 27(1), 94–128 (1999) 14. Dempster, A.P., Laird, N.M., Rubin, D.B.: Max- imum likelihood from incomplete data via the References EM algorithm. Journal of the Royal Statistical Society 1, 1–22 (1977) 1. Aldridge, G., Ratnanather, J., Martone, M., 15. Gilks, W., Richardson, S., Spiegelhalter, D.: Terada, M., Beg, M., Fong, L., Ceyhan, E., Ko- Markov Chain Monte Carlo in Practice. Chap- lasny, A., Brown, T., Cochran, E., Tang, S., man & Hall (1996) Pisano, D., Vaillant, M., Hurdal, M., Churchill, 16. Girolami, M., Calderhead, B.: Riemann man- J., Greenough, W., Miller, M., Ellisman, M.: ifold langevin and hamiltonian monte carlo Semi-automated shape analysis of dendrite methods. Journal of the Royal Statistical So- spines from animal models of fragilex and ciety: Series B 73(2), 1–37 (2011) parkinson’s disease using large deformation dif- 17. Glasbey, C.A., Mardia, K.V.: A penalised likeli- feomorphic metric mapping. Society for Neuro- hood approach to image warping. Journal of the science Annual Meeting, Washington DC (2005) Royal Statistical Society, Series B 63, 465–492 2. Allassonnière,S., Amit, Y., Trouvé,A.: Toward (2001) a coherent statistical framework for dense de- 18. Grenander, U., Miller, M.I.: Computational formable template estimation. JRSS 69, 3–29 anatomy: An emerging discipline. Quarterly of (2007) Applied Mathematics LVI(4), 617–694 (1998) 3. Allassonnière, S., Kuhn, E.: Stochastic algo- 19. Jarner, S., Hansen, E.: Geometric ergodicity of rithm for bayesian mixture effect template esti- metropolis algorithms. In: Stochastic Processes mation. ESAIM Probab Stat 14, 382–408 (2010) and Their Applications, pp. 341–361 (1998) 4. Allassonnière, S., Kuhn, E., Trouvé, A.: 20. Joshi, S., Davis, B., Jomier, M., Gerig, G.: Un- Bayesian consistent estimation in deformable biased diffeomorphic atlas construction for com- models using stochastic algorithms: Applica- putational anatomy. Neuroimage 23, S151–S160 tions to medical images. Journal de la Société (2004) Fran¸caisede Statistique 151(1), 1–16 (2010) 21. Kuhn, E., Lavielle, M.: Coupling a stochastic 5. Allassonnière, S., Kuhn, E., Trouvé, A.: approximation version of EM with an MCMC Bayesian deformable models building via procedure. ESAIM Probab. Stat. 8, 115–131 stochastic approximation algorithm: A conver- (electronic) (2004) gence study. Bernoulli J. 16(3), 641–678 (2010) 22. Maire, F., Lefebvre, S., Moulines, E., Douc, R.: 6. Andrieu, C., Moulines, E., Priouret, P.: Stabil- Aircraft classification with low infrared sensor. ity of stochastic approximation under verifiable Statistical Signal Processing Workshop (SSP), conditions. SIAM J. Control Optim. 44(1), 283– IEEE (2011) 312 (electronic) (2005) 23. Marshall, T., Roberts, G.: An adaptive ap- 7. Atchadé, Y.: An adaptive version for the proach to langevin MCMC. Statistics and Com- metropolis adjusted langevin algorithm with puting 22 (5), 1041–1057 (2012) 26 StéphanieAllassonnière,Estelle Kuhn

24. Marsland, S., Twining, C.: Constructing diffeomorphic representations for the groupewise analysis of non-rigid registrations of medical images. IEEE Transactions on Medical Imaging 23 (2004) 25. Meyn, S.P., Tweedie, R.L.: Markov chains and stochastic stability. Communications and Con- trol Engineering Series. Springer-Verlag London Ltd., London (1993) 26. Micheli, M., Michor, P.W., Mumford, D.B.: Sec- tional curvature in terms of the cometric, with applications to the riemannian manifolds of landmarks. SIAM Journal on Imaging Sciences 5(1), 394–433 (2012) 27. Miller, M., Priebe, C., Qiu, A., Fischl, B., Ko- lasny, A., Brown, T., Park, Y., Ratnanather, J., Busa, E., Jovicich, J., Yu, P., Dickerson, B., Buckner, R.: Morphometry BIRN. collaborative computational anatomy: An MRI morphometry study of the human brain via diffeomorphic metric mapping. Human Brain Mapping 30(7), 2132–2141 (2009) 28. Miller, M.I., Trouvé,A., Younes, L.: On the met- rics and Euler-Lagrange equations of computational anatomy. Annual Review of biomedical Engineering 4 (2002) 29. Richard, F., Samson, A., Cuenod, C.A.: A SAEM algorithm for the estimation of template and deformation parameters in medical image sequences. Statistics and Computing 19, 465– 478 (2009) 30. Roberts, G.O., Tweedie, R.L.: Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2(4), 341– 363 (1996). DOI 10.2307/3318418. URL http://dx.doi.org/10.2307/3318418 31. Stramer, O., Tweedie, R.: Langevin-type models i: Diffusions with given stationary distributions, and their discretizations. Methodol. Comput. Appl. Probab. 1(3), 283–306 (1999) 32. Stramer, O., Tweedie, R.: Langevin-type models ii: self-targeting candidates for mcmc algorithms. Methodol. Comput. Appl. Probab. 1(3), 307–328 (1999) 33. Vercauteren, T., Pennec, X., Perchant, A., Ay- ache, N.: Diffeomorphic demons: Efficient non- parametric image registration. Neuroimage 45, 61–72 (2009)