Properties of the Stochastic Approximation EM Algorithm with Mini-Batch Sampling Estelle Kuhn, Tabea Rebafka, Catherine Matias

Properties of the Stochastic Approximation EM Algorithm with Mini-batch Sampling Estelle Kuhn, Tabea Rebafka, Catherine Matias To cite this version: Estelle Kuhn, Tabea Rebafka, Catherine Matias. Properties of the Stochastic Approximation EM Algorithm with Mini-batch Sampling. Statistics and Computing, 2020, 30, pp.1725-1739. 10.1007/s11222-020-09968-0. hal-02189215v3 HAL Id: hal-02189215 https://hal.archives-ouvertes.fr/hal-02189215v3 Submitted on 30 Apr 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. manuscript No. (will be inserted by the editor) Properties of the Stochastic Approximation EM Algorithm with Mini-batch Sampling Estelle Kuhn · Catherine Matias · Tabea Rebafka Received: date / Accepted: date Abstract To deal with very large datasets a mini-batch 1 Introduction version of the Monte Carlo Markov Chain Stochastic Approximation Expectation-Maximization algorithm for On very large datasets the computing time of the classi- general latent variable models is proposed. For expo- cal expectation-maximization (EM) algorithm (Demp- nential models the algorithm is shown to be convergent ster et al., 1977) as well as its variants such as Monte under classical conditions as the number of iterations Carlo EM, Stochastic Approximation EM, Monte Carlo increases. Numerical experiments illustrate the perfor- Markov Chain-SAEM and others can be very long, since mance of the mini-batch algorithm in various models. all data points are visited in every iteration. To cir- In particular, we highlight that mini-batch sampling cumvent this problem, a number of EM-type algorithms results in an important speed-up of the convergence of have been proposed, namely various mini-batch (Neal the sequence of estimators generated by the algorithm. and Hinton, 1999; Liang and Klein, 2009; Karimi et al., Moreover, insights on the effect of the mini-batch size 2018; Nguyen et al., 2020) and online (Titterington, on the limit distribution are presented. Finally, we il- 1984; Lange, 1995; Cappéand Moulines, 2009; Cappé, lustrate how to use mini-batch sampling in practice to 2011) versions of the EM algorithm. They all consist in improve results when a constraint on the computing using only a part of the observations during one itera- time is given. tion in order to shorten computing time and accelerate convergence. While online algorithms process a single Keywords EM algorithm, mini-batch sampling, observation per iteration handled in the order of ar- stochastic approximation, Monte Carlo Markov chain. rival, mini-batch algorithms use larger, randomly cho- sen subsets of observations. The size of these subsets Mathematics Subject Classification (2010) 65C60 · of data is generally called the mini-batch size. Choos- 62F12 ing large mini-batch sizes entails long computing times, Estelle Kuhn while very small mini-batch sizes as well as online algo- MaIAGE, INRAE, UniversitéParis-Saclay rithms may result in a loss of accuracy of the algorithm. Jouy-en-Josas, France This raises the question about the optimal mini-batch E-mail: [email protected] size that would achieve a compromise between accuracy Catherine Matias and computing time. However this issue is generally Sorbonne Université,Universitéde Paris, CNRS Laboratoire de Probabilités, overlooked. Statistique et Modélisation(LPSM) In this article, we propose a mini-batch version of Paris, France the MCMC-SAEM algorithm (Delyon et al., 1999; Kuhn E-mail: [email protected] and Lavielle, 2004). The original MCMC-SAEM algo- Tabea Rebafka rithm is a powerful alternative to EM when the E-step Sorbonne Université,Universitéde Paris, CNRS Laboratoire de Probabilités, is intractable. This is particularly interesting for non- Statistique et Modélisation(LPSM) linear models or non-Gaussian models, where the un- Paris, France observed data cannot be simulated exactly from the E-mail: [email protected] conditional distribution. Moreover, the MCMC-SAEM 2 Estelle Kuhn et al. algorithm is also more computing efficient than the where h·; ·i is the scalar product, S(z) denotes a vector MCMC-EM algorithm, since only a single instance of of sufficient statistics of the model taking its values in the latent variable is sampled at every iteration of the some set S and and φ are functions on θ. The pos- algorithm. Nevertheless, when the dimension of the la- terior distribution of the latent variables z given the tent variable is huge, the simulation step as well as the observations is denoted by π(·; θ). update of the sufficient statistic can be time consuming. From this point of view the here proposed mini-batch version is computationally more efficient than the orig- 2.2 Description of MCMC-SAEM algorithm inal MCMC-SAEM, since at each iteration only a small The original MCMC-SAEM algorithm proposed by Kuhn proportion of the latent variables is simulated and only and Lavielle (2004) is appropriate for models, where the corresponding data are visited to update the pa- the classical EM-algorithm cannot be applied due to rameter estimates. For exponential models, we prove difficulties in the E-step. In particular, in those models almost-sure convergence of the sequence of estimates the conditional expectation E [S(z)] of the sufficient generated by the mini-batch MCMC-SAEM algorithm θk−1 statistic under the current parameter value θ has under classical conditions as the number of iterations of k−1 no closed-form expression. In the MCMC-SAEM algo- the algorithm increases. We also conjecture an asymp- rithm the quantity E [S(z)] is thus estimated by a totic normality result and the relation between the lim- θk−1 stochastic approximation algorithm. This means that iting covariance and the mini-batch size. Moreover, for the classical E-step is replaced with a simulation step various models we assess via numerical experiments the using a MCMC procedure combined with a stochastic influence of the mini-batch size on the speed-up of the approximation step. Here, we focus on a version where convergence at the beginning of the algorithm as well the MCMC part is a Metropolis-Hastings-within-Gibbs as its impact on the limit distribution of the estimates. algorithm (Robert and Casella, 2004). More precisely, Furthermore, we study the computing time of the algo- the k-th iteration of the classical MCMC-SAEM algorithm and address the question of how to use mini-batch rithm consists of the following three steps. sampling in practice to improve results. 2.2.1 Simulation step 2 Latent variable model and algorithm A new realization zk of the latent variable is sampled This section introduces the general latent variable model from an ergodic Markov transition kernel Π(zk−1; ·|θk−1), considered throughout this paper and the original MCMC- whose stationary distribution is the posterior distri- SAEM algorithm. Then the new mini-batch version of bution π(·; θk−1). In practice, this simulation is done the MCMC-SAEM algorithm is presented. by performing one iteration of a Metropolis-Hastings- within-Gibbs algorithm. That is, we consider a collec- tion (Πi)1≤i≤n of symmetric random walk Metropolis n 2.1 Model and assumptions kernels defined on R , where subscript i indicates that Πi acts only on the i-th coordinate, see Fort et al. Consider the common latent variable model with in- (2003). These kernels are applied successively to up- complete (observed) data y and latent (unobserved) date the components of z one after the other. More pre- n variable z. Denote n the dimension of the latent vari- cisely, let (ei)1≤i≤n be the canonical basis of R . Then, n for each i 2 f1; : : : ; ng starting from the n-vector z = able z = (z1 : : : ; zn) 2 R . In many models, n also cor- responds to the number of observations, but it is not (z1; : : : ; zn), the proposal in the direction of ei is given necessary that z and y have the same size or that each by z + xei, where x 2 R is sampled from a symmetric increment density q . This proposal is then accepted observation yi depends only on a single latent compo- i with probability minf1; π(z + xe ; θ )/π(z; θ )g: nent zi, as it is for instance the case in the stochastic i k−1 k−1 block model, Section 4.3. d 2.2.2 Stochastic approximation step Denote θ 2 Θ ⊂ R the model parameter of the joint distribution of the complete data (y; z). In what follows, omitting all dependencies in the observations y, The approximation of the sufficient statistic is updated which are considered as fixed realizations in the analy- by sis, we assume that the complete-data likelihood func- sk = (1 − γk)sk−1 + γkS(zk); (1) tion has the following form where (γk)k≥1 is a decreasing sequence of positive step- P1 P1 2 f(z; θ) = exp {− (θ) + hS(z); φ(θ)ig c(z); sizes such that k=1 γk = 1 and k=1 γk < 1. That Properties of the Stochastic Approximation EM Algorithm with Mini-batch Sampling 3 is, the current approximation sk of the sufficient statis- Algorithm 1 Mini-batch MCMC-SAEM tic is a weighted mean of its previous value sk−1 and Input: data y. Initialization: Choose initial values θ , s , z .

Load more